I am using a Proxmox cluster with 16 nodes. Everything works fine until I restart a node.
As soon as this node boots up again, it rejoins the corosync cluster. After that, I receive several "Token has not been received in 9075 ms" messages on each node until all nodes disconnect from every corosync connection.
As a workaround, I perform a "systemctl stop corosync" on each node and restart the service at 15-second intervals on every node. The corosync connection becomes stable again after that. Occasionally, I also have to additionally execute a “systemctl restart pve-cluster” on a single node.
Do you know how to solve this problem?
logs:
As soon as this node boots up again, it rejoins the corosync cluster. After that, I receive several "Token has not been received in 9075 ms" messages on each node until all nodes disconnect from every corosync connection.
As a workaround, I perform a "systemctl stop corosync" on each node and restart the service at 15-second intervals on every node. The corosync connection becomes stable again after that. Occasionally, I also have to additionally execute a “systemctl restart pve-cluster” on a single node.
Do you know how to solve this problem?
logs:
08:07:35 [KNET ] link: Resetting MTU for link 0 because host 3 joined
08:07:35 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
08:07:35 [QUORUM] Sync members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
08:07:35 [QUORUM] Sync joined[1]: 3
08:07:35 [TOTEM ] A new membership (1.ad34) was formed. Members joined: 3
08:07:35 [dcdb] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/2420
08:07:35 [dcdb] notice: starting data syncronisation
08:07:35 [status] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/2420
08:07:35 [status] notice: starting data syncronisation
08:07:36 [KNET ] pmtud: Global data MTU changed to: 1397
08:07:36 [QUORUM] Members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
08:07:36 [MAIN ] Completed service synchronization, ready to provide service.
08:07:36 [dcdb] notice: received sync request (epoch 1/1848437/0000006D)
08:07:36 [status] notice: received sync request (epoch 1/1848437/0000006E)
08:08:05 [TOTEM ] Token has not been received in 9075 ms
08:08:21 [TOTEM ] Token has not been received in 9075 ms
08:08:39 [TOTEM ] Token has not been received in 9075 ms
08:08:59 [TOTEM ] Token has not been received in 9075 ms
08:09:06 [KNET ] link: host: 13 link: 0 is down
08:09:06 [KNET ] link: host: 3 link: 0 is down
08:09:06 [KNET ] link: host: 4 link: 0 is down
08:09:06 [KNET ] link: host: 12 link: 0 is down
08:09:06 [KNET ] link: host: 7 link: 0 is down
08:09:06 [KNET ] link: host: 10 link: 0 is down
08:09:06 [KNET ] link: host: 6 link: 0 is down
08:09:06 [KNET ] link: host: 17 link: 0 is down
08:09:06 [KNET ] link: host: 5 link: 0 is down
08:09:06 [KNET ] link: host: 9 link: 0 is down
08:09:06 [KNET ] link: host: 8 link: 0 is down
08:09:06 [KNET ] link: host: 16 link: 0 is down
08:09:06 [KNET ] link: host: 1 link: 0 is down
08:09:06 [KNET ] link: host: 14 link: 0 is down
08:09:06 [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 13 has no active links
08:09:06 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 3 has no active links
08:09:06 [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 4 has no active links
08:09:06 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 12 has no active links
08:09:06 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 7 has no active links
08:09:06 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 10 has no active links
08:09:06 [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 6 has no active links
08:09:06 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 17 has no active links
08:09:06 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 5 has no active links
08:09:06 [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 9 has no active links
08:09:06 [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 8 has no active links
08:09:06 [KNET ] host: host: 16 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 16 has no active links
08:09:06 [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 1 has no active links
08:09:06 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)
08:09:06 [KNET ] host: host: 14 has no active links
08:09:08 [KNET ] link: Resetting MTU for link 0 because host 1 joined
08:09:08 [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
08:09:08 [KNET ] pmtud: Global data MTU changed to: 1397
08:09:12 [KNET ] rx: host: 4 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 4 joined
08:09:12 [KNET ] rx: host: 13 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 13 joined
08:09:12 [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] rx: host: 10 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 10 joined
08:09:12 [KNET ] rx: host: 17 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 17 joined
08:09:12 [KNET ] rx: host: 6 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 6 joined
08:09:12 [KNET ] rx: host: 3 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 3 joined
08:09:12 [KNET ] rx: host: 7 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 7 joined
08:09:12 [KNET ] rx: host: 5 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 5 joined
08:09:12 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] rx: host: 8 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 8 joined
08:09:12 [KNET ] rx: host: 14 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 14 joined
08:09:12 [KNET ] rx: host: 16 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 16 joined
08:09:12 [KNET ] rx: host: 9 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 9 joined
08:09:12 [KNET ] rx: host: 12 link: 0 is up
08:09:12 [KNET ] link: Resetting MTU for link 0 because host 12 joined
08:09:12 [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 16 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)
08:09:12 [KNET ] pmtud: Global data MTU changed to: 1397
08:09:17 [TOTEM ] Token has not been received in 9075 ms
08:09:32 [TOTEM ] Token has not been received in 9075 ms
08:09:53 [TOTEM ] Token has not been received in 9075 ms
08:10:11 [TOTEM ] Token has not been received in 9075 ms
[...]
08:28:45 [quorum] crit: quorum_initialize failed: 2
08:28:45 [quorum] crit: can't initialize service
08:28:45 [confdb] crit: cmap_initialize failed: 2
08:28:45 [confdb] crit: can't initialize service
08:28:45 [dcdb] crit: cpg_initialize failed: 2
08:28:45 [dcdb] crit: can't initialize service
08:28:45 [status] crit: cpg_initialize failed: 2
08:28:45 [status] crit: can't initialize service
08:28:46 Starting corosync.service - Corosync Cluster Engine...
08:28:46 [MAIN ] Corosync Cluster Engine starting up
08:28:46 [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
08:28:46 [TOTEM ] Initializing transport (Kronosnet).
08:28:47 [TOTEM ] totemknet initialized
08:28:47 [KNET ] pmtud: MTU manually set to: 0
08:28:47 [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
08:28:47 [SERV ] Service engine loaded: corosync configuration map access [0]
08:28:47 [QB ] server name: cmap
08:28:47 [SERV ] Service engine loaded: corosync configuration service [1]
08:28:47 [QB ] server name: cfg
08:28:47 [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
08:28:47 [QB ] server name: cpg
08:28:47 [SERV ] Service engine loaded: corosync profile loading service [4]
08:28:47 [SERV ] Service engine loaded: corosync resource monitoring service [6]
08:28:47 [WD ] Watchdog not enabled by configuration
08:28:47 [WD ] resource load_15min missing a recovery key.
08:28:47 [WD ] resource memory_used missing a recovery key.
08:28:47 [WD ] no resources configured.
08:28:47 [SERV ] Service engine loaded: corosync watchdog service [7]
08:28:47 [QUORUM] Using quorum provider corosync_votequorum
08:28:47 [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
08:28:47 [QB ] server name: votequorum
08:28:47 [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
08:28:47 [QB ] server name: quorum
08:28:47 [TOTEM ] Configuring link 0
08:28:47 [TOTEM ] Configured link number 0: local addr: 192.168.2.199, port=5405
08:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 0)
08:28:47 [KNET ] host: host: 14 has no active links
08:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 14 has no active links
08:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 14 has no active links
[...]
08:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 5 has no active links
08:28:47 Started corosync.service - Corosync Cluster Engine.
08:28:47 [QUORUM] Sync members[1]: 2
08:28:47 [QUORUM] Sync joined[1]: 2
08:28:47 [TOTEM ] A new membership (2.ae09) was formed. Members joined: 2
08:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 5 has no active links
08:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 5 has no active links
08:28:47 [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:28:47 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 17 has no active links
[...]
08:28:47 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 10 has no active links
08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [QUORUM] Members[1]: 2
08:28:47 [MAIN ] Completed service synchronization, ready to provide service.
08:28:47 [KNET ] host: host: 7 has no active links
08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 7 has no active links
08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 7 has no active links
08:28:47 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)
[...]
08:28:47 [KNET ] host: host: 11 has no active links
08:28:47 [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
08:28:47 [KNET ] host: host: 11 has no active links
08:28:51 [status] notice: update cluster info (cluster name ProxmoxCluster1, version = 24)
08:28:51 [dcdb] notice: members: 2/2846234
08:28:51 [dcdb] notice: all data is up to date
08:28:51 [status] notice: members: 2/2846234
08:28:51 [status] notice: all data is up to date
08:28:53 [KNET ] rx: host: 11 link: 0 is up
08:28:53 [KNET ] link: Resetting MTU for link 0 because host 11 joined
08:28:53 [KNET ] rx: host: 4 link: 0 is up