Hey everyone,
I've recently ran into a problem with our Proxmox cluster. We're increasing the network uplinks for each node and immediately ran into a problem with nodes not being able to participate in the quorum/cluster with both links active.
Here's a brief overview of the configuration:
I'm kind of at a loss. Everything looks okay from a network configuration standpoint. Multicast is working. Pings are fine. Host files look okay.
Any help would be greatly appreciated!
Here's what a few of the log files look like:
omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3
journalctl -xe
pvecm status
I've recently ran into a problem with our Proxmox cluster. We're increasing the network uplinks for each node and immediately ran into a problem with nodes not being able to participate in the quorum/cluster with both links active.
Here's a brief overview of the configuration:
- 14 node Proxmox cluster
- 2x Juniper switches clustered/stacked in a Virtual Chassis (appears as one switch to hosts)
- Primary node (node 1 for conversation sake) is connected to the 1st switch in the cluster
- Primary node has LACP enabled and active but only one port connected at the moment.
- Testing node (node 2 for conversation sake) is connected to both the 1st switch and 2nd switch in the cluster.
- Testing node shows both links active and LACP is negotiated and active
- Testing node appears offline in cluster web UI.
- Multicast is appears to be working as confirmed by running omping between primary and testing nodes.
- Testing node logs a "pvestatd: ipoc_send_rec failed: Connection refused" and "pvestatd: status update error: Connection refused" when restarting pve-cluster
- Testing node logs a "pvesr: error with cfs lock 'file-replication_cfg' : no quorum!" when attempting to start during pvesr start up.
- Testing node pvecm status shows "Quorum Activity blocked"
- If the Testing node's second uplink that goes into the second switch is disconnected, the node quickly rejoins the cluster.
- If the Testing node's first uplink that goes into the first switch is disconnected, the node quickly drops from the cluster but all data traffic still continues to function (guest and management).
I'm kind of at a loss. Everything looks okay from a network configuration standpoint. Multicast is working. Pings are fine. Host files look okay.
Any help would be greatly appreciated!
Here's what a few of the log files look like:
omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3
Code:
root@node02:~# omping -v -c 600 -i 1 -q 192.168.0.2 192.168.0.3
192.168.0.2 : waiting for response msg
192.168.0.2 : waiting for response msg
192.168.0.2 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.0.2 : given amount of query messages was sent
192.168.0.2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.027/0.170/0.230/0.040
192.168.0.2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.194/0.259/0.039
Waiting for 3000 ms to inform other nodes about instance exit
journalctl -xe
Code:
Mar 25 08:48:56 node02 systemd[1]: Starting The Proxmox VE cluster filesystem...
-- Subject: Unit pve-cluster.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has begun starting up.
Mar 25 08:48:56 node02 pmxcfs[1258860]: [status] notice: update cluster info (cluster name production-cluster, version = 95)
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[1] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[2] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[3] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: ipcc_send_rec[4] failed: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update error: Connection refused
Mar 25 08:48:56 node02 pvestatd[10230]: status update time (5.013 seconds)
Mar 25 08:48:57 node02 snmpd[3257]: error on subcontainer 'ia_addr' insert (-1)
Mar 25 08:48:58 node02 systemd[1]: Started The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has finished start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pve-cluster.service has finished starting up.
--
-- The start-up result is done.
Mar 25 08:49:00 node02 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support: [removed]
--
-- Unit pvesr.service has begun starting up.
Mar 25 08:49:00 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 corosync[1258776]: notice [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]: [TOTEM ] A new membership (192.168.0.3:1355660) was formed. Members
Mar 25 08:49:01 node02 corosync[1258776]: warning [CPG ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]: notice [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]: notice [MAIN ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 corosync[1258776]: [CPG ] downlist left_list: 0 received
Mar 25 08:49:01 node02 corosync[1258776]: [QUORUM] Members[1]: 3
Mar 25 08:49:01 node02 corosync[1258776]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [dcdb] notice: all data is up to date
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: members: 3/1258860
Mar 25 08:49:01 node02 pmxcfs[1258860]: [status] notice: all data is up to date
Mar 25 08:49:01 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: supports implicit and explicit TPGS
Mar 25 08:49:01 node02 kernel: sd 15:0:0:0: alua: device naa.6e843b6e043cb3cd830cd4b86db6b3d5 port group 0 rel port 1
Mar 25 08:49:02 node02 kernel: sd 15:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
Mar 25 08:49:02 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:03 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:04 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:05 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:06 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:07 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:08 node02 pvesr[1258894]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 25 08:49:09 node02 pvesr[1258894]: error with cfs lock 'file-replication_cfg': no quorum!
Mar 25 08:49:09 node02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 25 08:49:09 node02 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: Unit pvesr.service has failed
pvecm status
Code:
root@node02:~# pvecm status
Quorum information
------------------
Date: Mon Mar 25 08:58:43 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000003
Ring ID: 3/1355816
Quorate: No
Votequorum information
----------------------
Expected votes: 14
Highest expected: 14
Total votes: 1
Quorum: 8 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 192.168.0.3 (local)
Last edited: