New Cluster: [TOTEM ] FAILED TO RECEIVE after 4mins.

encore

Well-Known Member
May 4, 2018
108
1
58
36
Hi,

I setup a new cluster of 4 nodes.
Exact 4mins. after I restarted corosync, the cluster becomes unquorate.
One node seems to form a new corosync ring and the other 3 nodes still have their old corosync ring.
Those 3 nodes are still working fine, only the one nodes leavs the ring.

Syslog from the failing node. It failed at 09:30, worked fine at 09:26.
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: members: 1/1881, 2/1749, 3/1778, 4/1758
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: starting data syncronisation
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received sync request (epoch 1/1881/00000017)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received sync request (epoch 1/1881/00000011)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: leader is 1/1881
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: synced members: 1/1881, 2/1749, 3/1778, 4/1758
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: start sending inode updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: sent all (0) updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: all data is up to date
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: all data is up to date
Oct 12 09:27:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:27:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:28:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:28:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:29:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:29:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:30:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:49 pve-dal corosync[27180]: error [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Saving state aru 7a7 high seq received 7be
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Storing new sequence id for ring 111c
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering COMMIT state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering RECOVERY state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] TRANS [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] position [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] previous ring seq 1118 rep 10.10.21.1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] aru 7a7 high delivered 7a7 received flag 0
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] copying all old ring messages from 7a8-7be.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Originated 13 messages in RECOVERY.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Sending initial ORF token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq 0 aru d high seq received d
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq d aru d high seq received d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] retrans flag count 4 token aru d install seq d aru d d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Resetting old ring state
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] recovery to regular 1-d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1970
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1982
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.2)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.4)
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] waiting_trans_ack changed to 1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] call init for locally known services
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering OPERATIONAL state.
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] A new membership (10.10.21.1:4380) was formed. Members left: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] Failed to receive the leave message. failed: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] enter sync process
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync configuration map access
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CMAP ] Not first sync -> no action
Oct 12 09:30:51 pve-dal corosync[27180]: warning [CPG ] downlist left_list: 3 received
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] got joinlist message from node 0x1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] my downlist: members(old:4 left:3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] group:pve_dcdb_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] group:pve_dcdb_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] group:pve_kvstore_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881

Network is fine, corosync has it's own 10g sfp+ network:
10.10.21.1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 037/0.093/0.486/0.041
10.10.21.1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 041/0.101/0.486/0.042
10.10.21.2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 029/0.059/0.468/0.032
[..]

Corosync config:
logging {
debug: on
to_syslog: yes
}

nodelist {
node {
name: bondbabe001-62000-bl13
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.21.3
}
node {
name: bondsir001-62000-bl12
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.21.2
}
node {
name: captive001-62000-bl14
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.21.4
}
node {
name: pve-dal
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.21.1
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: pve-dal
config_version: 5
interface {
bindnetaddr: 10.10.21.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
The totem cluster_name is the same name as one node. Hope that isn't a problem.

Any ideas?
 
Network is fine, corosync has it's own 10g sfp+ network:
10.10.21.1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 037/0.093/0.486/0.041
10.10.21.1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 041/0.101/0.486/0.042
10.10.21.2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 029/0.059/0.468/0.032
[..]

that's only the quick test, this does not tells you anything about your network on longer duraitons, and as you have issues after 4 minutes the following longer running test (10 minutes) will tell us more regarding the issues you see:

Code:
omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...

I setup a new cluster of 4 nodes.

Please also always state the version of Proxmox VE you're using. Proxmox VE 5 (which you use) has quite another cluster sdtack than the newer Proxmox VE 6...
 
Any ideas?

From the top of my head, I'd say there's no multicast querier active on the network, and so the switch cuts of multicast traffic of the PVE cluster multicast group after a few minutes (default 5)...

So either disable IGMP snooping, or enable a querier (if possible on your switch)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!