[SOLVED] Six node cluster has lost its unity

edtrumbull · Sep 29, 2023

I have a cluster of six servers running proxmox 7.4.16.
Three of these servers run ceph, and provide the storage for the VMs hosted by the other three servers. Ceph version is 17.2.6

This configuration has been running without problems for half a year or more.

Today, when I opened the GUI, there were no nodes visible.

I can ping each of the nodes and ssh them without problems. Each node
can ping and ssh the others.

If I run pvecm status on the different nodes, I get different
results.

For example, ceph node 0 gives me:
root@ceph-00:~# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Sep 29 10:10:10 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.646c
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.29.134.20 (local)

but node kvm-01 gives me:
oot@skvm-01:/etc/ssh# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Sep 29 10:10:47 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 1.647c
Quorate: No

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.29.134.11 (local)

and as far as I can determine none of the nodes are in agreement. It
is also worth noting that the time to run the pvecm status command is
quite long - on the order of 45 seconds to a minute

Is there a way to bring these wayward nodes back together?
It appears to me that the VM are still running, and I'd prefer not to
reboot nodes willy-nilly unless I have to.

bbgeek17 · Sep 29, 2023

edtrumbull said:
I have a cluster of six servers running proxmox 7.4.16.

This is not a recommended setup. Running an even number of nodes in a cluster is an easy precursor to a split-brain scenario. A brief network outage can put the cluster in a degraded mode, which would require manual administrative intervention to resolve.
As per PVE HA documentation, you should have an external quorum device in an even-numbered cluster:
https://pve.proxmox.com/wiki/Cluster_Manager#:~:text=We support QDevices,QDevice at all.

You mentioned that different nodes have different view of the cluster completeness, but you have not presented that view. One way to resolve this is to reboot all nodes and let them re-negotiate cluster membership. If you want to avoid a reboot, you need to get better understanding of which nodes are talking to each other and which are not. Then you can attempt to restart cluster services on one node at a time, with the goal of bringing more nodes into a cluster: https://pve.proxmox.com/wiki/Cluste...may have to restart the corosync service via:

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

edtrumbull · Sep 29, 2023

Thank you for your answer, bbgeek17.
I should have mentioned that HA is not configured in this cluster, sorry.
However, the systemctl restart corosync command you pointed to in the documents resolved the problem.

bbgeek17 · Sep 29, 2023

edtrumbull said:
I should have mentioned that HA is not configured in this cluster, sorry.

Glad to hear you got it resolved. As you noticed, cluster health is not dependent on having actual HA-aware resources (ie VMs). The HA is implied once you construct the cluster, its not just for a single pane of glass management.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

[SOLVED] Six node cluster has lost its unity

edtrumbull

New Member

bbgeek17

Distinguished Member

edtrumbull

New Member

bbgeek17

Distinguished Member