Quorum question - 5 Nodes over 2 data centers

elnino54

New Member
Oct 29, 2024
10
2
3
Hi all,
We are in the process of validating Proxmox for our production environment, which is currently built as a single cluster.

DC1 will have 3 PVE hosts, DC2 will have 2 PVE hosts.

DC's are connected with dual 10Gbit fibre

Each DC also has dual internet connections, and SDWAN connections to each, so connectivity is very well maintained.

What would be the best way to ensure an appropriate quorum be retained in the event of an entire DC outage? Have a qdevice at a 3rd site with more than one vote?
 
You need the same number of nodes (or votes) in both DC. Then you add a third location to host the Quorum Device. That's the only way I know to achieve HA without manual intervention in case of a disaster.

Disclaimer: I do not own such a setup.
 
  • Like
Reactions: Johannes S
That's not really relevant for my case - As I said, the DCs are connected via dual 10Gbit (dark) fiber, so it's <1ms between DCs

According to the documentation, a qdevice CAN be over a slower connection, so should be fine on SDWAN.
What will be the storage of the cluster?
Each site also has it's own HA mirrored storage (Pure) with realtime replication.
Yes, implement third DC with quorum device with like 3 votes can be effective.
I'm thinking that might be the best option and was what I was thinking but I was not sure if there was anything i'm not allowing for.
 
  • Like
Reactions: Johannes S
If the cluster consists of an odd number a qdevice should be avoided:

We support QDevices for clusters with an even number of nodes and recommend it for 2 node clusters, if they should provide higher availability. For clusters with an odd node count, we currently discourage the use of QDevices. The reason for this is the difference in the votes which the QDevice provides for each cluster type. Even numbered clusters get a single additional vote, which only increases availability, because if the QDevice itself fails, you are in the same position as with no QDevice at all.

On the other hand, with an odd numbered cluster size, the QDevice provides (N-1) votes — where N corresponds to the cluster node count. This alternative behavior makes sense; if it had only one additional vote, the cluster could get into a split-brain situation. This algorithm allows for all nodes but one (and naturally the QDevice itself) to fail. However, there are two drawbacks to this:

If the QNet daemon itself fails, no other node may fail or the cluster immediately loses quorum. For example, in a cluster with 15 nodes, 7 could fail before the cluster becomes inquorate. But, if a QDevice is configured here and it itself fails, no single node of the 15 may fail. The QDevice acts almost as a single point of failure in this case.

The fact that all but one node plus QDevice may fail sounds promising at first, but this may result in a mass recovery of HA services, which could overload the single remaining node. Furthermore, a Ceph server will stop providing services if only ((N-1)/2) nodes or less remain online.

If you understand the drawbacks and implications, you can decide yourself if you want to use this technology in an odd numbered cluster setup. https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

I would also avoid messing round with changing votes. I see these possibilities:
- Add another node to your two-node setup so you have a six-node cluster with the qdevice as seventh vote
- Split the five node cluster in a three node and two-node cluster, add the qdevice only to the two node cluster. Use the proxmox datacenter manager for cross-cluster migration.

Since both options propably don't fit your requirements ( otherwise you would have already choose one of them wouldn't you?) I hope others have better ideas, sorry
Edit: Added reference to proxmox datacenter manager
 
Last edited:
  • Like
Reactions: UdoB
Since both options propably don't fit your requirements ( otherwise you would have already choose one of them wouldn't you?) I hope others have better ideas, sorry
I'm not ruling out expanding it to two clusters but I don't know what that looks like in reality, and what consequences that would have. It was certainly something that crossed my mind but it seemed overcomplicated for what I'm trying to achieve.

I'm aware of the qdevice not being recommended for even configurations but this did seem like a plausible solution if the qdevice has more weight.

I still might be overcomplicating everything as the likelyhood of a failure that would cause loss of quorum without cause is extremely low. I just don't want a situation where a site isolates itself unnesessarily. It might be a case of trial and error DR scenario testing.
 
There is nothing inherently wrong with stretching your cluster in this way, latency permitting.

Take this wiki page as an example, (even though it is focused on Ceph storage, the point still stands): https://pve.proxmox.com/wiki/Stretch_Cluster

I would strongly encourage you to have an even number of PVE nodes at both sites to eliminate the need to tweak the number of votes provided by the qdevice.

If you want seamless failover, you need to keep all of these PVE instances in the same Datacentre.

PDM remote migrations or other "hackier" methods like syncing VM configuration files between clusters would "work", but would surely add unnecessary complexity and extra manual upkeep and/or scripting to even come close to seamless failover territory.

Hope that helps.
 
  • Like
Reactions: Johannes S
Thanks, that's a great explanation of our system. We would still have seamless failover because we have replicated storage at each DC (Pure storage)

Would a potential solution to this be to install an extra node in the DC with 2 servers, but effectively excluded from the compute pool, and then retain the qdevice at a third site?

I'm thinking something like a NUC or other non-server type PC, just for the vote.
 
Would a potential solution to this be to install an extra node in the DC with 2 servers, but effectively excluded from the compute pool, and then retain the qdevice at a third site?
This would work. Not an ideal solution but if you can ensure that no workloads ever gets distributed on it (in case you use something like prox-lb or the affinity rules of Proxmox VEs native ha-manager see https://pve.proxmox.com/pve-docs/chapter-ha-manager.html ) it should be "good enough".