Quorum lost after upgrading pve node

rolandrapide

New Member
Feb 27, 2014
3
0
1
Hello,

We have a 5 nodes proxmox cluster. The following is a list of version we are currently running:


pve04 pve-manager/3.3-1/a06c9f73 (running kernel: 3.10.0-4-pve)
pve05 pve-manager/3.3-1/a06c9f73 (running kernel: 3.10.0-4-pve)
pve07 pve-manager/3.3-5/bfebec03 (running kernel: 3.10.0-5-pve)
pve08 pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-29-pve)
pve10 pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-29-pve)


Pve07 is the one we upgraded to the currently newest version. Unfortunatly after about 5 minutes quorum is lost. When running the cluster and excluding the pve07 the cluster work fine.
When only running a cluster with pve04,pve05 and pve07 it also functions correctly. The same goes for pve07,pve08 and pve10 (this one we haven't tested for very long)


The following error messages are in the syslog:


Code:
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [status] crit: cpg_send_message failed: 9
15:27:32  pmxcfs[8674]: [dcdb] notice: cpg_join retry 7260
15:27:33  pmxcfs[8674]: [dcdb] notice: cpg_join retry 7270
15:27:34  pmxcfs[8674]: [dcdb] notice: cpg_join retry 7280
15:27:35  pmxcfs[8674]: [dcdb] notice: cpg_join retry 7290


By restarting all pve services sometimes quorum can be reached again but after about 5 minutes the same error occurs.


Pve07 output: pvecm status
Code:
Version: 6.2.0
Config Version: 19
Cluster Name: tcn01
Cluster Id: 3233
Cluster Member: Yes
Cluster Generation: 90436
Membership state: Cluster-Member
Nodes: 4
Expected votes: 5
Total votes: 4
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: pve07
Node ID: 2
Multicast addresses: x.x.x.x
Node addresses: x.x.x.x


Pve07 command output: pvecm nodes
Code:
Node  Sts   Inc   Joined               Name
   2   M  89992   2015-01-27 19:03:54  pve07
   3   X  90376                        pve08
   4   M  90436   2015-01-28 14:22:23  pve05
   5   M  90436   2015-01-28 14:22:23  pve04
  10   M  90436   2015-01-28 14:22:23  pve10


Pve04 command output: pvecm status
Code:
Version: 6.2.0
Config Version: 19
Cluster Name: tcn01
Cluster Id: 3233
Cluster Member: Yes
Cluster Generation: 90436
Membership state: Cluster-Member
Nodes: 4
Expected votes: 5
Total votes: 3
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: pve04
Node ID: 5
Multicast addresses: x.x.x.x
Node addresses: x.x.x.x

Pve04 command output: pvecm nodes
Code:
Node  Sts   Inc   Joined               Name
   2   M  90436   2015-01-28 14:22:23  pve07
   3   X      0                        pve08
   4   X  90388                        pve05
   5   M  90052   2015-01-28 09:27:36  pve04
  10   M  90388   2015-01-28 09:27:36  pve10

Any suggestions on how we can debug this issue?


Thanks in advance!
 
The problem is solved it appears to be a kernel flag about multicast. Specifically the flag /sys/devices/virtual/net/YOURADAPTER/bridge/multicast_querier. The pve07 node was sending multicast messages but not receiving them. Hope this helps someone in the future!
 
Thanks for posting the solution, I was curious about what was going on. I'm also curious how the flag got toggled?