Issues with corosync

carlosmp

Renowned Member
Jun 2, 2010
48
1
73
Hi - recently had our 3 node cluster start bouncing unstoppably. I have manually stopped watchdog-mux, which seems to be the only way i could get our cluster to stop bouncing.

Been trying to see what happened, and it seems the root cause is the corosync token. pvecm status shows:

Cluster information
-------------------
Name: clus21a
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Oct 1 18:49:16 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.3415
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.25.60.11 (local)
0x00000002 1 172.25.60.12
0x00000003 1 172.25.60.13

Going through the log:

Oct 01 18:34:30 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:43 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:44 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:48 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:50 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:56 h21a corosync[22624]: [TOTEM ] Token has not been received in 11693 ms
Oct 01 18:35:04 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms

root@h21a:~# corocync-cmapctl | grep members
'-bash: corocync-cmapctl: command not found
root@h21a:~# corosync-cmapctl | grep members
runtime.members.1.config_version (u64) = 8
runtime.members.1.ip (str) = r(0) ip(172.25.60.11)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 8
runtime.members.2.ip (str) = r(0) ip(172.25.60.12)
runtime.members.2.join_count (u32) = 35
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 8
runtime.members.3.ip (str) = r(0) ip(172.25.60.13)
runtime.members.3.join_count (u32) = 22
runtime.members.3.status (str) = joined

root@h21a:~# ha-manager status shows some machines on, other stopped. If i try to remove from ha-manager to manually start them,

root@h21a:~# ha-manager remove 107
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
delete resource failed: cfs-lock 'domain-ha' error: no quorum!

i can ping the cluster members, no packet loss. I'm at my wits ends trying to see what could be causing our issue.

Any ideas/pointers?

TIA
 
Hi!

Oct 01 18:34:30 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:43 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:44 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:48 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:50 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:56 h21a corosync[22624]: [TOTEM ] Token has not been received in 11693 ms
Oct 01 18:35:04 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
What network does corosync run on? Does it have a dedicated network interface or is it shared with some other network resources? These are quite high numbers, the network should have a very low latency. [0]

root@h21a:~# ha-manager remove 107
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
delete resource failed: cfs-lock 'domain-ha' error: no quorum!
This seems like that for some reason the HA was interrupted while reading/writing to its files, which will also lockup the HA Manager from reading its CRM commands, which needs a lock. Normally there's a timeout to locking the HA domain, so this tells me that it's likely that the lock couldn't be unlocked anymore because there wasn't any quorum or the node failed in the middle. I'd check the corosync situation from above first.

[0] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network
 
They were, which is why i was checking the network side of things, but other than the corosync, it all seemed fine. we could always get port 5405 to respond as open but 5404 seemed to be closing intermittently, which is why i suspect something with Corosync. Running > 500 pings between hosts looke fairly clean :

Worst cse from host C to B.
500 packets transmitted, 494 received, 1.2% packet loss, time 510971ms
rtt min/avg/max/mdev = 0.096/0.122/0.204/0.019 ms

interface for corosync is shared with the vmbr0, but that has been since this was setup almost 5 years ago now. Not sure why that would crop up as an issue now, but I'll move the corosync IP to a dedicated 1G Adapter on the node, since corosync should have minimal traffic.

Assuming that goes through, I've done the following on each node to get the vms running

Code:
systemctl stop pve-ha-lrm pve-ha-crm corosync pve-cluster
systemctl disable corosync
pmxcfs -l

I've then placed the files i want to run on each host in the /etc/pve/nodes/h21x/qemu-server/xxx.conf and have started. Do i need to make all these match on all hosts h21a, h21b, h21c before restarting, or will corosync figure this out? I had manually done ha-manager remove to all the vms to make sure that they were out of HA manager.

Am i correct to assume that after redoing the ip interfaces that corosync will figure itself out? Should i do anything else in preparation ?

Thanks