Hi,
I setup a new cluster of 4 nodes.
Exact 4mins. after I restarted corosync, the cluster becomes unquorate.
One node seems to form a new corosync ring and the other 3 nodes still have their old corosync ring.
Those 3 nodes are still working fine, only the one nodes leavs the ring.
Syslog from the failing node. It failed at 09:30, worked fine at 09:26.
Network is fine, corosync has it's own 10g sfp+ network:
Corosync config:
The totem cluster_name is the same name as one node. Hope that isn't a problem.
Any ideas?
I setup a new cluster of 4 nodes.
Exact 4mins. after I restarted corosync, the cluster becomes unquorate.
One node seems to form a new corosync ring and the other 3 nodes still have their old corosync ring.
Those 3 nodes are still working fine, only the one nodes leavs the ring.
Syslog from the failing node. It failed at 09:30, worked fine at 09:26.
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: members: 1/1881, 2/1749, 3/1778, 4/1758
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: starting data syncronisation
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received sync request (epoch 1/1881/00000017)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received sync request (epoch 1/1881/00000011)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: leader is 1/1881
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: synced members: 1/1881, 2/1749, 3/1778, 4/1758
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: start sending inode updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: sent all (0) updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: all data is up to date
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: all data is up to date
Oct 12 09:27:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:27:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:28:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:28:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:29:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:29:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:30:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:49 pve-dal corosync[27180]: error [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Saving state aru 7a7 high seq received 7be
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Storing new sequence id for ring 111c
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering COMMIT state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering RECOVERY state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] TRANS [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] position [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] previous ring seq 1118 rep 10.10.21.1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] aru 7a7 high delivered 7a7 received flag 0
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] copying all old ring messages from 7a8-7be.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Originated 13 messages in RECOVERY.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Sending initial ORF token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq 0 aru d high seq received d
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq d aru d high seq received d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] retrans flag count 4 token aru d install seq d aru d d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Resetting old ring state
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] recovery to regular 1-d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1970
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1982
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.2)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.4)
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] waiting_trans_ack changed to 1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] call init for locally known services
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering OPERATIONAL state.
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] A new membership (10.10.21.1:4380) was formed. Members left: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] Failed to receive the leave message. failed: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] enter sync process
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync configuration map access
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CMAP ] Not first sync -> no action
Oct 12 09:30:51 pve-dal corosync[27180]: warning [CPG ] downlist left_list: 3 received
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] got joinlist message from node 0x1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] my downlist: members(old:4 left:3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[0] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[1] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: starting data syncronisation
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received sync request (epoch 1/1881/00000017)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received sync request (epoch 1/1881/00000011)
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: leader is 1/1881
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: synced members: 1/1881, 2/1749, 3/1778, 4/1758
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: start sending inode updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: sent all (0) updates
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [dcdb] notice: all data is up to date
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: received all states
Oct 12 09:26:28 pve-dal pmxcfs[1881]: [status] notice: all data is up to date
Oct 12 09:27:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:27:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:28:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:28:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:29:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:29:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:00 pve-dal systemd[1]: Starting Proxmox VE replication runner...
Oct 12 09:30:00 pve-dal systemd[1]: Started Proxmox VE replication runner.
Oct 12 09:30:49 pve-dal corosync[27180]: error [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] FAILED TO RECEIVE
Oct 12 09:30:49 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 6(failed to receive).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Saving state aru 7a7 high seq received 7be
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Storing new sequence id for ring 111c
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering COMMIT state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering RECOVERY state.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] TRANS [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] position [0] member 10.10.21.1:
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] previous ring seq 1118 rep 10.10.21.1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] aru 7a7 high delivered 7a7 received flag 0
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] copying all old ring messages from 7a8-7be.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Originated 13 messages in RECOVERY.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] got commit token
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] entering GATHER state from 0(consensus timeout).
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Sending initial ORF token
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq 0 aru d high seq received d
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] install seq d aru d high seq received d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] retrans flag count 4 token aru d install seq d aru d d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] Resetting old ring state
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] recovery to regular 1-d
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1970
[..]
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] comparing if ring id is for this processors old ring seqno 1982
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.2)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [MAIN ] Member left: r(0) ip(10.10.21.4)
Oct 12 09:30:51 pve-dal corosync[27180]: [TOTEM ] Creating commit token because I am the rep.
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] waiting_trans_ack changed to 1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] call init for locally known services
Oct 12 09:30:51 pve-dal corosync[27180]: debug [TOTEM ] entering OPERATIONAL state.
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] A new membership (10.10.21.1:4380) was formed. Members left: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: notice [TOTEM ] Failed to receive the leave message. failed: 3 2 4
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] enter sync process
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync configuration map access
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CMAP ] Not first sync -> no action
Oct 12 09:30:51 pve-dal corosync[27180]: warning [CPG ] downlist left_list: 3 received
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] got joinlist message from node 0x1
Oct 12 09:30:51 pve-dal corosync[27180]: debug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] my downlist: members(old:4 left:3)
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list_entries:3
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[0] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.3) , pid:1749
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[1] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.2) , pid:1778
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] left_list[2] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.4) , pid:1758
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[0] groupve_kvstore_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881
Oct 12 09:30:51 pve-dal corosync[27180]: debug [CPG ] joinlist_messages[1] groupve_dcdb_v1\x00, ip:r(0) ip(10.10.21.1) , pid:1881
Network is fine, corosync has it's own 10g sfp+ network:
10.10.21.1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 037/0.093/0.486/0.041
10.10.21.1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 041/0.101/0.486/0.042
10.10.21.2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 029/0.059/0.468/0.032
[..]
10.10.21.1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 041/0.101/0.486/0.042
10.10.21.2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0. 029/0.059/0.468/0.032
[..]
Corosync config:
logging {
debug: on
to_syslog: yes
}
nodelist {
node {
name: bondbabe001-62000-bl13
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.21.3
}
node {
name: bondsir001-62000-bl12
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.21.2
}
node {
name: captive001-62000-bl14
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.21.4
}
node {
name: pve-dal
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.21.1
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: pve-dal
config_version: 5
interface {
bindnetaddr: 10.10.21.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
debug: on
to_syslog: yes
}
nodelist {
node {
name: bondbabe001-62000-bl13
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.21.3
}
node {
name: bondsir001-62000-bl12
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.21.2
}
node {
name: captive001-62000-bl14
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.21.4
}
node {
name: pve-dal
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.21.1
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: pve-dal
config_version: 5
interface {
bindnetaddr: 10.10.21.1
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
Any ideas?