We have recently started experiencing this issue, on PVE 3.0. Our cluster had been working well for around 6 months, but after going to 3.0 this issue is affecting us.
I have a very basic setup with 2 nodes. Upon boot the cluster will work for several minutes, I will then see the logs begin to flood with [TOTEM ] Retransmit List messages and the cluster will eventually fall over with:
Code:
Jul 3 10:46:08 vhbtgmar04 corosync[5942]: [TOTEM ] FAILED TO RECEIVE
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] CLM CONFIGURATION CHANGE
Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [status] notice: node lost quorum
Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul 3 10:46:10 vhbtgmar04 kernel: dlm: closing connection to node 2
Jul 3 10:46:10 vhbtgmar04 pmxcfs[6145]: [dcdb] notice: members: 1/6145
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] New Configuration:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.12)
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Left:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.11)
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Joined:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CMAN ] quorum lost, blocking activity
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [QUORUM] Members[1]: 1
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] CLM CONFIGURATION CHANGE
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] New Configuration:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] #011r(0) ip(10.10.11.12)
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Left:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CLM ] Members Joined:
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [CPG ] chosen downlist: sender r(0) ip(10.10.11.12) ; members(old:2 left:1)
Jul 3 10:46:10 vhbtgmar04 corosync[5942]: [MAIN ] Completed service synchronization, ready to provide service.
If I restart cman and pve-cluster it will come back to life, but then fail again after the same time period.
I initially thought the issue was related to my bonding configuration, so put the PVE control/cluster traffic on it's own dedicated nic (eth0) which made no difference, I then reloaded both nodes fresh from the 3.0 installer as my cluster had previously been upgraded all the way from 2.1 to 3, this also made no difference.
I also tried reducing the TOTEM NetMTU which did not resolve the issue (verified MTU was correct with tcpdump)
Versions from NODE1:
Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vhbtgmar04
Node ID: 1
Multicast addresses: 239.192.214.162
Node addresses: 10.10.11.12
Versions from NODE2:
Code:
Version: 6.2.0
Config Version: 4
Cluster Name: pve-marua
Cluster Id: 54987
Cluster Member: Yes
Cluster Generation: 2092
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vhbtgmar05
Node ID: 2
Multicast addresses: 239.192.214.162
Node addresses: 10.10.11.11
The machines are Xeon 54xx series with Intel e1000e NIC connected via a Extreme Networks x460.
Any ideas on the cause of this issue? what can we do to try and get this resolved ?