I am using a Proxmox cluster with 16 nodes. Everything works fine until I restart a node.
As soon as this node boots up again, it rejoins the corosync cluster. After that, I receive several "Token has not been received in 9075 ms" messages on each node until all nodes disconnect from every corosync connection.
As a workaround, I perform a "systemctl stop corosync" on each node and restart the service at 15-second intervals on every node. The corosync connection becomes stable again after that. Occasionally, I also have to additionally execute a “systemctl restart pve-cluster” on a single node.
Do you know how to solve this problem?
logs:
As soon as this node boots up again, it rejoins the corosync cluster. After that, I receive several "Token has not been received in 9075 ms" messages on each node until all nodes disconnect from every corosync connection.
As a workaround, I perform a "systemctl stop corosync" on each node and restart the service at 15-second intervals on every node. The corosync connection becomes stable again after that. Occasionally, I also have to additionally execute a “systemctl restart pve-cluster” on a single node.
Do you know how to solve this problem?
logs:
08:07:35 [KNET ] link: Resetting MTU for link 0 because host 3 joined08:07:35 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)08:07:35 [QUORUM] Sync members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1708:07:35 [QUORUM] Sync joined[1]: 308:07:35 [TOTEM ] A new membership (1.ad34) was formed. Members joined: 308:07:35 [dcdb] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/242008:07:35 [dcdb] notice: starting data syncronisation08:07:35 [status] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/242008:07:35 [status] notice: starting data syncronisation08:07:36 [KNET ] pmtud: Global data MTU changed to: 139708:07:36 [QUORUM] Members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 1708:07:36 [MAIN ] Completed service synchronization, ready to provide service.08:07:36 [dcdb] notice: received sync request (epoch 1/1848437/0000006D)08:07:36 [status] notice: received sync request (epoch 1/1848437/0000006E)08:08:05 [TOTEM ] Token has not been received in 9075 ms08:08:21 [TOTEM ] Token has not been received in 9075 ms08:08:39 [TOTEM ] Token has not been received in 9075 ms08:08:59 [TOTEM ] Token has not been received in 9075 ms08:09:06 [KNET ] link: host: 13 link: 0 is down08:09:06 [KNET ] link: host: 3 link: 0 is down08:09:06 [KNET ] link: host: 4 link: 0 is down08:09:06 [KNET ] link: host: 12 link: 0 is down08:09:06 [KNET ] link: host: 7 link: 0 is down08:09:06 [KNET ] link: host: 10 link: 0 is down08:09:06 [KNET ] link: host: 6 link: 0 is down08:09:06 [KNET ] link: host: 17 link: 0 is down08:09:06 [KNET ] link: host: 5 link: 0 is down08:09:06 [KNET ] link: host: 9 link: 0 is down08:09:06 [KNET ] link: host: 8 link: 0 is down08:09:06 [KNET ] link: host: 16 link: 0 is down08:09:06 [KNET ] link: host: 1 link: 0 is down08:09:06 [KNET ] link: host: 14 link: 0 is down08:09:06 [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 13 has no active links08:09:06 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 3 has no active links08:09:06 [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 4 has no active links08:09:06 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 12 has no active links08:09:06 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 7 has no active links08:09:06 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 10 has no active links08:09:06 [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 6 has no active links08:09:06 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 17 has no active links08:09:06 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 5 has no active links08:09:06 [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 9 has no active links08:09:06 [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 8 has no active links08:09:06 [KNET ] host: host: 16 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 16 has no active links08:09:06 [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 1 has no active links08:09:06 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)08:09:06 [KNET ] host: host: 14 has no active links08:09:08 [KNET ] link: Resetting MTU for link 0 because host 1 joined08:09:08 [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)08:09:08 [KNET ] pmtud: Global data MTU changed to: 139708:09:12 [KNET ] rx: host: 4 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 4 joined08:09:12 [KNET ] rx: host: 13 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 13 joined08:09:12 [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] rx: host: 10 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 10 joined08:09:12 [KNET ] rx: host: 17 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 17 joined08:09:12 [KNET ] rx: host: 6 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 6 joined08:09:12 [KNET ] rx: host: 3 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 3 joined08:09:12 [KNET ] rx: host: 7 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 7 joined08:09:12 [KNET ] rx: host: 5 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 5 joined08:09:12 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] rx: host: 8 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 8 joined08:09:12 [KNET ] rx: host: 14 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 14 joined08:09:12 [KNET ] rx: host: 16 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 16 joined08:09:12 [KNET ] rx: host: 9 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 9 joined08:09:12 [KNET ] rx: host: 12 link: 0 is up08:09:12 [KNET ] link: Resetting MTU for link 0 because host 12 joined08:09:12 [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 16 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)08:09:12 [KNET ] pmtud: Global data MTU changed to: 139708:09:17 [TOTEM ] Token has not been received in 9075 ms08:09:32 [TOTEM ] Token has not been received in 9075 ms08:09:53 [TOTEM ] Token has not been received in 9075 ms08:10:11 [TOTEM ] Token has not been received in 9075 ms[...]08:28:45 [quorum] crit: quorum_initialize failed: 208:28:45 [quorum] crit: can't initialize service08:28:45 [confdb] crit: cmap_initialize failed: 208:28:45 [confdb] crit: can't initialize service08:28:45 [dcdb] crit: cpg_initialize failed: 208:28:45 [dcdb] crit: can't initialize service08:28:45 [status] crit: cpg_initialize failed: 208:28:45 [status] crit: can't initialize service08:28:46 Starting corosync.service - Corosync Cluster Engine...08:28:46 [MAIN ] Corosync Cluster Engine starting up08:28:46 [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow08:28:46 [TOTEM ] Initializing transport (Kronosnet).08:28:47 [TOTEM ] totemknet initialized08:28:47 [KNET ] pmtud: MTU manually set to: 008:28:47 [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so08:28:47 [SERV ] Service engine loaded: corosync configuration map access [0]08:28:47 [QB ] server name: cmap08:28:47 [SERV ] Service engine loaded: corosync configuration service [1]08:28:47 [QB ] server name: cfg08:28:47 [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]08:28:47 [QB ] server name: cpg08:28:47 [SERV ] Service engine loaded: corosync profile loading service [4]08:28:47 [SERV ] Service engine loaded: corosync resource monitoring service [6]08:28:47 [WD ] Watchdog not enabled by configuration08:28:47 [WD ] resource load_15min missing a recovery key.08:28:47 [WD ] resource memory_used missing a recovery key.08:28:47 [WD ] no resources configured.08:28:47 [SERV ] Service engine loaded: corosync watchdog service [7]08:28:47 [QUORUM] Using quorum provider corosync_votequorum08:28:47 [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]08:28:47 [QB ] server name: votequorum08:28:47 [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]08:28:47 [QB ] server name: quorum08:28:47 [TOTEM ] Configuring link 008:28:47 [TOTEM ] Configured link number 0: local addr: 192.168.2.199, port=540508:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 0)08:28:47 [KNET ] host: host: 14 has no active links08:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 14 has no active links08:28:47 [KNET ] host: host: 14 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 14 has no active links[...]08:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 5 has no active links08:28:47 Started corosync.service - Corosync Cluster Engine.08:28:47 [QUORUM] Sync members[1]: 208:28:47 [QUORUM] Sync joined[1]: 208:28:47 [TOTEM ] A new membership (2.ae09) was formed. Members joined: 208:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 5 has no active links08:28:47 [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 5 has no active links08:28:47 [KNET ] link: Resetting MTU for link 0 because host 2 joined08:28:47 [KNET ] host: host: 17 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 17 has no active links[...]08:28:47 [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 10 has no active links08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)08:28:47 [QUORUM] Members[1]: 208:28:47 [MAIN ] Completed service synchronization, ready to provide service.08:28:47 [KNET ] host: host: 7 has no active links08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 7 has no active links08:28:47 [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 7 has no active links08:28:47 [KNET ] host: host: 12 (passive) best link: 0 (pri: 1)[...]08:28:47 [KNET ] host: host: 11 has no active links08:28:47 [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)08:28:47 [KNET ] host: host: 11 has no active links08:28:51 [status] notice: update cluster info (cluster name ProxmoxCluster1, version = 24)08:28:51 [dcdb] notice: members: 2/284623408:28:51 [dcdb] notice: all data is up to date08:28:51 [status] notice: members: 2/284623408:28:51 [status] notice: all data is up to date08:28:53 [KNET ] rx: host: 11 link: 0 is up08:28:53 [KNET ] link: Resetting MTU for link 0 because host 11 joined08:28:53 [KNET ] rx: host: 4 link: 0 is up