Node won't participate in quorum after LACP - Multicast working

QuantumSchema · Mar 25, 2019

Hey everyone,

I've recently ran into a problem with our Proxmox cluster. We're increasing the network uplinks for each node and immediately ran into a problem with nodes not being able to participate in the quorum/cluster with both links active.

Here's a brief overview of the configuration:

14 node Proxmox cluster
2x Juniper switches clustered/stacked in a Virtual Chassis (appears as one switch to hosts)
Primary node (node 1 for conversation sake) is connected to the 1st switch in the cluster
Primary node has LACP enabled and active but only one port connected at the moment.
Testing node (node 2 for conversation sake) is connected to both the 1st switch and 2nd switch in the cluster.
Testing node shows both links active and LACP is negotiated and active
Testing node appears offline in cluster web UI.
Multicast is appears to be working as confirmed by running omping between primary and testing nodes.
Testing node logs a "pvestatd: ipoc_send_rec failed: Connection refused" and "pvestatd: status update error: Connection refused" when restarting pve-cluster
Testing node logs a "pvesr: error with cfs lock 'file-replication_cfg' : no quorum!" when attempting to start during pvesr start up.
Testing node pvecm status shows "Quorum Activity blocked"
If the Testing node's second uplink that goes into the second switch is disconnected, the node quickly rejoins the cluster.
If the Testing node's first uplink that goes into the first switch is disconnected, the node quickly drops from the cluster but all data traffic still continues to function (guest and management).

I'm kind of at a loss. Everything looks okay from a network configuration standpoint. Multicast is working. Pings are fine. Host files look okay.

Any help would be greatly appreciated!

Here's what a few of the log files look like:

omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3

Code:

root@node02:~# omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3

192.168.0.2 : waiting for response msg
192.168.0.2 : waiting for response msg
192.168.0.2 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.0.2 : given amount of query messages was sent

192.168.0.2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.027/0.170/0.230/0.040
192.168.0.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.194/0.259/0.039
Waiting for 3000 ms to inform other nodes about instance exit

journalctl -xe

Code:

Mar 25 08:48:56 node02 systemd[1]: Starting The Proxmox VE cluster filesystem...

-- Subject: Unit pve-cluster.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has begun starting up.
Mar 25 08:48:56 node02 pmxcfs[1258860]: [status] notice: update cluster info (cluster name  production-cluster, version = 95)
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[1] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[2] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[3] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[4] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update error: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update time (5.013 seconds)
Mar 25 08:48:57 node02 snmpd[3257]: error on subcontainer 'ia_addr' insert (-1)
Mar 25 08:48:58 node02 systemd[1]: Started The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has finished start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has finished starting up.
--
-- The start-up result is done.
Mar 25 08:49:00 node02 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pvesr.service has begun starting up.
Mar 25 08:49:00 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 corosync[1258776]: notice  [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]:  [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]: warning [CPG   ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]: notice  [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 corosync[1258776]:  [CPG   ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]:  [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]:  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: all data is up to date
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: all data is up to date
Mar 25 08:49:01 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: supports implicit and explicit TPGS
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: device naa.6e843b6e043cb3cd830cd4b86db6b3d5 port group 0 rel port 1
Mar 25 08:49:02 node02 kernel: sd 15:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
Mar 25 08:49:02 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:03 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:04 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:05 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:06 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:07 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:08 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:09 node02 pvesr[1258894]: error with cfs lock 'file-replication_cfg': no quorum!
Mar 25 08:49:09 node02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 25 08:49:09 node02 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: Unit pvesr.service has failed

pvecm status

Code:

root@node02:~# pvecm status
Quorum information
------------------
Date:             Mon Mar 25 08:58:43 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000003
Ring ID:          3/1355816
Quorate:          No

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      1
Quorum:           8 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 192.168.0.3 (local)

QuantumSchema · Mar 26, 2019

Good morning (or evening) everyone! Just bumping in the hopes someone might know where to look? I've been pounding my head on the desk for the past week troubleshooting.

Stoiko Ivanov · Mar 26, 2019

Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!

QuantumSchema · Mar 26, 2019

Stoiko Ivanov said:
Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!

Thanks Stoiko! I'll give it a go here in a sec!

QuantumSchema · Mar 26, 2019

Stoiko Ivanov said:
Hm - I would guess that maybe the two switches do not forward the corosync traffic over their inter-chassis links ?

* omping uses a different port and multicast-address for it's tests - thus it could by chance end up transmitting on the working link of the testnode, while the corosync traffic ends up on the link to the second switch. - the layer3 info (ip) is almost always used for lacp-hashing, layer 4 (port) depending on the config and capabilities of the switch

maybe try running the omping tests while the test node is only plugged in the second switch?

* also compare the config of both switches (if they have a separate config) - with a focus on the multicast settings

hope this helps!

On the LACP config, we are using a layer2+3 hash policy. What you mentioned made it sound like using a layer3+4 hash policy might be better? From what I've read, a layer3+4 isn't 802.3ad 100% compliant though so we steered clear of it. Thoughts?

QuantumSchema · Mar 26, 2019

Here's what I see using the multicast address of the cluster and having the test node plugged into only the second switch. Thoughts?

Code:

root@node02:~# corosync-cmapctl -g totem.interface.0.mcastaddr
totem.interface.0.mcastaddr (str) = 239.192.174.244
root@node02:~# omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3 -m 239.192.174.244
192.168.0.2 : waiting for response msg
192.168.0.2 : joined (S,G) = (*, 239.192.174.244), pinging
192.168.0.2 : given amount of query messages was sent

192.168.0.2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.031/0.189/0.295/0.032
192.168.0.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.047/0.163/0.256/0.032
Waiting for 3000 ms to inform other nodes about instance exit

QuantumSchema · Mar 26, 2019

Also to add, I just tested using layer3+4 on both the primary and test node with no change.

Stoiko Ivanov · Mar 26, 2019

regarding the lacp hash policy - if possible I would stick with layer 2+3 (seems more robust and independent of the switch also supporting it, since it's in the standard) - the only potential gain from using 3+4 is that you could split connections between 2 nodes on the 2 streams - but i'd expect in a 14 node cluster the distribution should happen anyways (although each 2 nodes always will use only one link)

Apart from that - check your journal (basically skim through the whole thing, but focus on corosync.service (and then later on pve-cluster.service/pmxcfs). - check that on all nodes in the cluster (or at least not only on the one not joining the cluster)

are the `/etc/corosync/corosync.conf` identical on all nodes?

If that doesn't help - try to see where the packets are dropped with tcpdump (on the bond and both member-interfaces of the bond)

hope this helps!

Search

Search

Node won't participate in quorum after LACP - Multicast working

QuantumSchema

Member

QuantumSchema

Member

Stoiko Ivanov

Proxmox Staff Member

QuantumSchema

Member

QuantumSchema

Member

QuantumSchema

Member

QuantumSchema

Member

Stoiko Ivanov

Proxmox Staff Member