corosync sets all links down after a single node reboots

sndex

New Member
Sep 10, 2024
1
0
1
I am using a Proxmox cluster with 16 nodes. Everything works fine until I restart a node.

As soon as this node boots up again, it rejoins the corosync cluster. After that, I receive several "Token has not been received in 9075 ms" messages on each node until all nodes disconnect from every corosync connection.

As a workaround, I perform a "systemctl stop corosync" on each node and restart the service at 15-second intervals on every node. The corosync connection becomes stable again after that. Occasionally, I also have to additionally execute a “systemctl restart pve-cluster” on a single node.

Do you know how to solve this problem?

logs:
08:07:35 [KNET  ] link: Resetting MTU for link 0 because host 3 joined
08:07:35 [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
08:07:35 [QUORUM] Sync members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
08:07:35 [QUORUM] Sync joined[1]: 3
08:07:35 [TOTEM ] A new membership (1.ad34) was formed. Members joined: 3
08:07:35 [dcdb] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/2420
08:07:35 [dcdb] notice: starting data syncronisation
08:07:35 [status] notice: members: 1/1848437, 2/3371638, 3/2568, 4/734264, 5/2041, 6/3143026, 7/2732651, 8/2480799, 9/487855, 10/1903557, 11/3370801, 12/857093, 13/7129, 14/1355, 16/2791, 17/2420
08:07:35 [status] notice: starting data syncronisation
08:07:36 [KNET  ] pmtud: Global data MTU changed to: 1397
08:07:36 [QUORUM] Members[16]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
08:07:36 [MAIN  ] Completed service synchronization, ready to provide service.
08:07:36 [dcdb] notice: received sync request (epoch 1/1848437/0000006D)
08:07:36 [status] notice: received sync request (epoch 1/1848437/0000006E)
08:08:05 [TOTEM ] Token has not been received in 9075 ms
08:08:21 [TOTEM ] Token has not been received in 9075 ms
08:08:39 [TOTEM ] Token has not been received in 9075 ms
08:08:59 [TOTEM ] Token has not been received in 9075 ms
08:09:06 [KNET  ] link: host: 13 link: 0 is down
08:09:06 [KNET  ] link: host: 3 link: 0 is down
08:09:06 [KNET  ] link: host: 4 link: 0 is down
08:09:06 [KNET  ] link: host: 12 link: 0 is down
08:09:06 [KNET  ] link: host: 7 link: 0 is down
08:09:06 [KNET  ] link: host: 10 link: 0 is down
08:09:06 [KNET  ] link: host: 6 link: 0 is down
08:09:06 [KNET  ] link: host: 17 link: 0 is down
08:09:06 [KNET  ] link: host: 5 link: 0 is down
08:09:06 [KNET  ] link: host: 9 link: 0 is down
08:09:06 [KNET  ] link: host: 8 link: 0 is down
08:09:06 [KNET  ] link: host: 16 link: 0 is down
08:09:06 [KNET  ] link: host: 1 link: 0 is down
08:09:06 [KNET  ] link: host: 14 link: 0 is down
08:09:06 [KNET  ] host: host: 13 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 13 has no active links
08:09:06 [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 3 has no active links
08:09:06 [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 4 has no active links
08:09:06 [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 12 has no active links
08:09:06 [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 7 has no active links
08:09:06 [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 10 has no active links
08:09:06 [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 6 has no active links
08:09:06 [KNET  ] host: host: 17 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 17 has no active links
08:09:06 [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 5 has no active links
08:09:06 [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 9 has no active links
08:09:06 [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 8 has no active links
08:09:06 [KNET  ] host: host: 16 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 16 has no active links
08:09:06 [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 1 has no active links
08:09:06 [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
08:09:06 [KNET  ] host: host: 14 has no active links
08:09:08 [KNET  ] link: Resetting MTU for link 0 because host 1 joined
08:09:08 [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
08:09:08 [KNET  ] pmtud: Global data MTU changed to: 1397
08:09:12 [KNET  ] rx: host: 4 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 4 joined
08:09:12 [KNET  ] rx: host: 13 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 13 joined
08:09:12 [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 13 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] rx: host: 10 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 10 joined
08:09:12 [KNET  ] rx: host: 17 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 17 joined
08:09:12 [KNET  ] rx: host: 6 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 6 joined
08:09:12 [KNET  ] rx: host: 3 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 3 joined
08:09:12 [KNET  ] rx: host: 7 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 7 joined
08:09:12 [KNET  ] rx: host: 5 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 5 joined
08:09:12 [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 17 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] rx: host: 8 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 8 joined
08:09:12 [KNET  ] rx: host: 14 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 14 joined
08:09:12 [KNET  ] rx: host: 16 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 16 joined
08:09:12 [KNET  ] rx: host: 9 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 9 joined
08:09:12 [KNET  ] rx: host: 12 link: 0 is up
08:09:12 [KNET  ] link: Resetting MTU for link 0 because host 12 joined
08:09:12 [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 16 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
08:09:12 [KNET  ] pmtud: Global data MTU changed to: 1397
08:09:17 [TOTEM ] Token has not been received in 9075 ms
08:09:32 [TOTEM ] Token has not been received in 9075 ms
08:09:53 [TOTEM ] Token has not been received in 9075 ms
08:10:11 [TOTEM ] Token has not been received in 9075 ms
[...]
08:28:45 [quorum] crit: quorum_initialize failed: 2
08:28:45 [quorum] crit: can't initialize service
08:28:45 [confdb] crit: cmap_initialize failed: 2
08:28:45 [confdb] crit: can't initialize service
08:28:45 [dcdb] crit: cpg_initialize failed: 2
08:28:45 [dcdb] crit: can't initialize service
08:28:45 [status] crit: cpg_initialize failed: 2
08:28:45 [status] crit: can't initialize service
08:28:46 Starting corosync.service - Corosync Cluster Engine...
08:28:46 [MAIN  ] Corosync Cluster Engine  starting up
08:28:46 [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
08:28:46 [TOTEM ] Initializing transport (Kronosnet).
08:28:47 [TOTEM ] totemknet initialized
08:28:47 [KNET  ] pmtud: MTU manually set to: 0
08:28:47 [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
08:28:47 [SERV  ] Service engine loaded: corosync configuration map access [0]
08:28:47 [QB    ] server name: cmap
08:28:47 [SERV  ] Service engine loaded: corosync configuration service [1]
08:28:47 [QB    ] server name: cfg
08:28:47 [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
08:28:47 [QB    ] server name: cpg
08:28:47 [SERV  ] Service engine loaded: corosync profile loading service [4]
08:28:47 [SERV  ] Service engine loaded: corosync resource monitoring service [6]
08:28:47 [WD    ] Watchdog not enabled by configuration
08:28:47 [WD    ] resource load_15min missing a recovery key.
08:28:47 [WD    ] resource memory_used missing a recovery key.
08:28:47 [WD    ] no resources configured.
08:28:47 [SERV  ] Service engine loaded: corosync watchdog service [7]
08:28:47 [QUORUM] Using quorum provider corosync_votequorum
08:28:47 [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
08:28:47 [QB    ] server name: votequorum
08:28:47 [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
08:28:47 [QB    ] server name: quorum
08:28:47 [TOTEM ] Configuring link 0
08:28:47 [TOTEM ] Configured link number 0: local addr: 192.168.2.199, port=5405
08:28:47 [KNET  ] host: host: 14 (passive) best link: 0 (pri: 0)
08:28:47 [KNET  ] host: host: 14 has no active links
08:28:47 [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 14 has no active links
08:28:47 [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 14 has no active links
[...]
08:28:47 [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 5 has no active links
08:28:47 Started corosync.service - Corosync Cluster Engine.
08:28:47 [QUORUM] Sync members[1]: 2
08:28:47 [QUORUM] Sync joined[1]: 2
08:28:47 [TOTEM ] A new membership (2.ae09) was formed. Members joined: 2
08:28:47 [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 5 has no active links
08:28:47 [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 5 has no active links
08:28:47 [KNET  ] link: Resetting MTU for link 0 because host 2 joined
08:28:47 [KNET  ] host: host: 17 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 17 has no active links
[...]
08:28:47 [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 10 has no active links
08:28:47 [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [QUORUM] Members[1]: 2
08:28:47 [MAIN  ] Completed service synchronization, ready to provide service.
08:28:47 [KNET  ] host: host: 7 has no active links
08:28:47 [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 7 has no active links
08:28:47 [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 7 has no active links
08:28:47 [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
[...]
08:28:47 [KNET  ] host: host: 11 has no active links
08:28:47 [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
08:28:47 [KNET  ] host: host: 11 has no active links
08:28:51 [status] notice: update cluster info (cluster name  ProxmoxCluster1, version = 24)
08:28:51 [dcdb] notice: members: 2/2846234
08:28:51 [dcdb] notice: all data is up to date
08:28:51 [status] notice: members: 2/2846234
08:28:51 [status] notice: all data is up to date
08:28:53 [KNET  ] rx: host: 11 link: 0 is up
08:28:53 [KNET  ] link: Resetting MTU for link 0 because host 11 joined
08:28:53 [KNET  ] rx: host: 4 link: 0 is up
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!