Node won't participate in quorum after LACP - Multicast working

QuantumSchema

Member
Mar 25, 2019
7
0
6
64
Hey everyone,

I've recently ran into a problem with our Proxmox cluster. We're increasing the network uplinks for each node and immediately ran into a problem with nodes not being able to participate in the quorum/cluster with both links active.

Here's a brief overview of the configuration:
  • 14 node Proxmox cluster
  • 2x Juniper switches clustered/stacked in a Virtual Chassis (appears as one switch to hosts)
  • Primary node (node 1 for conversation sake) is connected to the 1st switch in the cluster
  • Primary node has LACP enabled and active but only one port connected at the moment.
  • Testing node (node 2 for conversation sake) is connected to both the 1st switch and 2nd switch in the cluster.
  • Testing node shows both links active and LACP is negotiated and active
  • Testing node appears offline in cluster web UI.
  • Multicast is appears to be working as confirmed by running omping between primary and testing nodes.
  • Testing node logs a "pvestatd: ipoc_send_rec failed: Connection refused" and "pvestatd: status update error: Connection refused" when restarting pve-cluster
  • Testing node logs a "pvesr: error with cfs lock 'file-replication_cfg' : no quorum!" when attempting to start during pvesr start up.
  • Testing node pvecm status shows "Quorum Activity blocked"
  • If the Testing node's second uplink that goes into the second switch is disconnected, the node quickly rejoins the cluster.
  • If the Testing node's first uplink that goes into the first switch is disconnected, the node quickly drops from the cluster but all data traffic still continues to function (guest and management).

I'm kind of at a loss. Everything looks okay from a network configuration standpoint. Multicast is working. Pings are fine. Host files look okay.

Any help would be greatly appreciated!

Here's what a few of the log files look like:

omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3
Code:
root@node02:~# omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3

192.168.0.2 : waiting for response msg
192.168.0.2 : waiting for response msg
192.168.0.2 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.0.2 : given amount of query messages was sent

192.168.0.2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.027/0.170/0.230/0.040
192.168.0.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.194/0.259/0.039
Waiting for 3000 ms to inform other nodes about instance exit

journalctl -xe
Code:
Mar 25 08:48:56 node02 systemd[1]: Starting The Proxmox VE cluster filesystem...

-- Subject: Unit pve-cluster.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has begun starting up.
Mar 25 08:48:56 node02 pmxcfs[1258860]: [status] notice: update cluster info (cluster name  production-cluster, version = 95)
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[1] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[2] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[3] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[4] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update error: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update time (5.013 seconds)
Mar 25 08:48:57 node02 snmpd[3257]: error on subcontainer 'ia_addr' insert (-1)
Mar 25 08:48:58 node02 systemd[1]: Started The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has finished start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has finished starting up.
--
-- The start-up result is done.
Mar 25 08:49:00 node02 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pvesr.service has begun starting up.
Mar 25 08:49:00 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 corosync[1258776]: notice  [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]:  [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]: warning [CPG   ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]: notice  [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 corosync[1258776]:  [CPG   ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]:  [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]:  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: all data is up to date
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: all data is up to date
Mar 25 08:49:01 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: supports implicit and explicit TPGS
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: device naa.6e843b6e043cb3cd830cd4b86db6b3d5 port group 0 rel port 1
Mar 25 08:49:02 node02 kernel: sd 15:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
Mar 25 08:49:02 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:03 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:04 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:05 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:06 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:07 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:08 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:09 node02 pvesr[1258894]: error with cfs lock 'file-replication_cfg': no quorum!
Mar 25 08:49:09 node02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 25 08:49:09 node02 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: Unit pvesr.service has failed

pvecm status
Code:
root@node02:~# pvecm status
Quorum information
------------------
Date:             Mon Mar 25 08:58:43 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3/1355816
Quorate:          No

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      1
Quorum:           8 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 192.168.0.3 (local)
 
Last edited:
Good morning (or evening) everyone! Just bumping in the hopes someone might know where to look? I've been pounding my head on the desk for the past week troubleshooting.
 
Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!
 
Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!

Thanks Stoiko! I'll give it a go here in a sec!
 
Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!

On the LACP config, we are using a layer2+3 hash policy. What you mentioned made it sound like using a layer3+4 hash policy might be better? From what I've read, a layer3+4 isn't 802.3ad 100% compliant though so we steered clear of it. Thoughts?
 
Here's what I see using the multicast address of the cluster and having the test node plugged into only the second switch. Thoughts?


Code:
root@node02:~# corosync-cmapctl -g totem.interface.0.mcastaddr
totem.interface.0.mcastaddr (str) = 239.192.174.244
root@node02:~# omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3 -m 239.192.174.244
192.168.0.2 : waiting for response msg
192.168.0.2 : joined (S,G) = (*, 239.192.174.244), pinging
192.168.0.2 : given amount of query messages was sent

192.168.0.2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.031/0.189/0.295/0.032
192.168.0.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.047/0.163/0.256/0.032
Waiting for 3000 ms to inform other nodes about instance exit
 
regarding the lacp hash policy - if possible I would stick with layer 2+3 (seems more robust and independent of the switch also supporting it, since it's in the standard) - the only potential gain from using 3+4 is that you could split connections between 2 nodes on the 2 streams - but i'd expect in a 14 node cluster the distribution should happen anyways (although each 2 nodes always will use only one link)

Apart from that - check your journal (basically skim through the whole thing, but focus on corosync.service (and then later on pve-cluster.service/pmxcfs). - check that on all nodes in the cluster (or at least not only on the one not joining the cluster)

are the `/etc/corosync/corosync.conf` identical on all nodes?

If that doesn't help - try to see where the packets are dropped with tcpdump (on the bond and both member-interfaces of the bond)

hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!