Issues with corosync

carlosmp

Renowned Member
Jun 2, 2010
47
1
73
Hi - recently had our 3 node cluster start bouncing unstoppably. I have manually stopped watchdog-mux, which seems to be the only way i could get our cluster to stop bouncing.

Been trying to see what happened, and it seems the root cause is the corosync token. pvecm status shows:

Cluster information
-------------------
Name: clus21a
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Oct 1 18:49:16 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.3415
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.25.60.11 (local)
0x00000002 1 172.25.60.12
0x00000003 1 172.25.60.13

Going through the log:

Oct 01 18:34:30 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:43 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:44 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:48 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms
Oct 01 18:34:50 h21a corosync[22624]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 2000ms for consensus.
Oct 01 18:34:56 h21a corosync[22624]: [TOTEM ] Token has not been received in 11693 ms
Oct 01 18:35:04 h21a corosync[22624]: [TOTEM ] Token has not been received in 4237 ms

root@h21a:~# corocync-cmapctl | grep members
'-bash: corocync-cmapctl: command not found
root@h21a:~# corosync-cmapctl | grep members
runtime.members.1.config_version (u64) = 8
runtime.members.1.ip (str) = r(0) ip(172.25.60.11)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 8
runtime.members.2.ip (str) = r(0) ip(172.25.60.12)
runtime.members.2.join_count (u32) = 35
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 8
runtime.members.3.ip (str) = r(0) ip(172.25.60.13)
runtime.members.3.join_count (u32) = 22
runtime.members.3.status (str) = joined

root@h21a:~# ha-manager status shows some machines on, other stopped. If i try to remove from ha-manager to manually start them,

root@h21a:~# ha-manager remove 107
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
trying to acquire cfs lock 'domain-ha' ...
delete resource failed: cfs-lock 'domain-ha' error: no quorum!

i can ping the cluster members, no packet loss. I'm at my wits ends trying to see what could be causing our issue.

Any ideas/pointers?

TIA