Before I had a cluster of 13 nodes. I added 3 other nodes and within 5 minutes I lost the whole cluster. After restarting corosync 1 by 1 but when I start a 15th node I have this message:
then after a few minutes the cluster crash:
proxmox 6.3
kernel: 5.4.78-2-pve
Corosync 3.0.4
libknet1/stable,now 1.16-pve1 amd64
Code:
corosync[29232]: [TOTEM ] Token has not been received in 380 ms
then after a few minutes the cluster crash:
Code:
Dec 29 19:11:16 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 379 ms
Dec 29 19:11:18 xinfvirtrc23b corosync[29232]: [TOTEM ] Retransmit List: 6 1b 5 8 d 10 16 17 19 1a 1c 1d 1e 24 25 27 28 29 2a 2e 2f 11 26 30 32 31 2b
Dec 29 19:11:26 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 385 ms
Dec 29 19:11:28 xinfvirtrc23b corosync[29232]: [TOTEM ] Retransmit List: 10 16 17 19 1a 1d 1e 25 28 2a 30 32 3a 36 39 3e 42 43
Dec 29 19:12:13 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 6 link: 0 is down
Dec 29 19:12:13 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 29 19:12:13 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 6 has no active links
Dec 29 19:12:18 xinfvirtrc23b corosync[29232]: [KNET ] rx: host: 6 link: 0 is up
Dec 29 19:12:18 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 29 19:12:39 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 5 link: 0 is down
Dec 29 19:12:39 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 29 19:12:39 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 5 has no active links
Dec 29 19:12:42 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 2 link: 0 is down
Dec 29 19:12:42 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Dec 29 19:12:42 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 2 has no active links
Dec 29 19:12:44 xinfvirtrc23b corosync[29232]: [KNET ] rx: host: 5 link: 0 is up
Dec 29 19:12:44 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 29 19:12:50 xinfvirtrc23b pvedaemon[47715]: <root@pam> successful auth for user 'root@pam'
Dec 29 19:13:02 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 8028 ms
Dec 29 19:14:00 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 1 link: 0 is down
Dec 29 19:14:00 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 29 19:14:00 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 1 has no active links
Dec 29 19:14:12 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 388 ms
Dec 29 19:14:27 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 389 ms
Dec 29 19:14:39 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 7 link: 0 is down
Dec 29 19:14:39 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 29 19:14:39 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 7 has no active links
Dec 29 19:14:42 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 8 link: 0 is down
Dec 29 19:14:42 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Dec 29 19:14:42 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 8 has no active links
Dec 29 19:14:44 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 9 link: 0 is down
Dec 29 19:14:44 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 29 19:14:44 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 9 has no active links
Dec 29 19:14:47 xinfvirtrc23b corosync[29232]: [KNET ] link: host: 10 link: 0 is down
Dec 29 19:14:47 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Dec 29 19:14:47 xinfvirtrc23b corosync[29232]: [KNET ] host: host: 10 has no active links
Dec 29 19:14:47 xinfvirtrc23b corosync[29232]: [TOTEM ] Token has not been received in 12683 ms
Dec 29 19:14:54 xinfvirtrc23b sshd[24883]: Accepted publickey for root from 10.201.12.52 port 56162 ssh2: RSA SHA256:NlHJccEIhs53WfwenX3bfoMBj/+KnyePlOdMxGkATEE
Dec 29 19:14:54 xinfvirtrc23b sshd[24883]: pam_unix(sshd:session): session opened for user root by (uid=0)
Dec 29 19:14:54 xinfvirtrc23b systemd-logind[2279]: New session 492 of user root.
Dec 29 19:14:54 xinfvirtrc23b systemd[1]: Started Session 492 of user root.
Dec 29 19:14:54 xinfvirtrc23b systemd[1]: Stopping Corosync Cluster Engine...
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [MAIN ] Node was shut down by a signal
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Unloading all Corosync service engines.
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [QB ] withdrawing server sockets
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [confdb] crit: cmap_dispatch failed: 2
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [QB ] withdrawing server sockets
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync configuration map access
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [QB ] withdrawing server sockets
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync configuration service
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] crit: cpg_dispatch failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] crit: cpg_leave failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_dispatch failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_leave failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_send_message failed: 9
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_send_message failed: 9
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_send_message failed: 9
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_send_message failed: 9
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [QB ] withdrawing server sockets
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [quorum] crit: quorum_dispatch failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] notice: node lost quorum
Dec 29 19:14:54 xinfvirtrc23b pve-ha-lrm[4024]: unable to write lrm status file - unable to open file '/etc/pve/nodes/xinfvirtrc23b/lrm_status.tmp.4024' - Device or resource busy
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [QB ] withdrawing server sockets
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync profile loading service
Dec 29 19:14:54 xinfvirtrc23b pvesr[15994]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync resource monitoring service
Dec 29 19:14:54 xinfvirtrc23b corosync[29232]: [SERV ] Service engine unloaded: corosync watchdog service
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [quorum] crit: quorum_initialize failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [quorum] crit: can't initialize service
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [confdb] crit: cmap_initialize failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [confdb] crit: can't initialize service
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] notice: start cluster connection
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: cpg_initialize failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [dcdb] crit: can't initialize service
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] notice: start cluster connection
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] crit: cpg_initialize failed: 2
Dec 29 19:14:54 xinfvirtrc23b pmxcfs[43531]: [status] crit: can't initialize service
Dec 29 19:14:55 xinfvirtrc23b corosync[29232]: [MAIN ] Corosync Cluster Engine exiting normally
proxmox 6.3
kernel: 5.4.78-2-pve
Corosync 3.0.4
libknet1/stable,now 1.16-pve1 amd64
Code:
Cluster information
-------------------
Name: kubeRCbess
Config Version: 42
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Dec 29 20:04:02 2020
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000010
Ring ID: 1.70ce
Quorate: Yes
Votequorum information
----------------------
Expected votes: 16
Highest expected: 16
Total votes: 13
Quorum: 9
Flags: Quorate
Membership information
----------------------
Last edited: