Even no of PVE nodes?

proxwolfe · Aug 24, 2023

Hi,

I have a three node home lab cluster.

The reason I set it up like this is that it is recommended to have an uneven number of nodes in order to avoid a split-brain situation when one of the nodes fails.

After having used PVE for a while now, it is dawning on me that this is only relevant if (and maybe to the extent) you need HA.

In the HA manager I found a place where I can define a group. Does that mean that I could have an even numbered (4 node) cluster as well but within that cluster define an uneven numbered (3 node) group for HA purposes? And the fourth node would be ignored for HA purposes?

The reason I am asking is that I have a fourth server that I would like to sometimes use for special applications (as it has more oomph than the rest of the bunch) but I don't want to have it running all the time (as it consumes as much power as the rest of the bunch together). When I need it, it would be very convenient to migrate VMs easily to it. But I don't want to risk the HA functioning as intended.

So, would that work?

Thanks!

LnxBil · Aug 24, 2023

proxwolfe said:
So, would that work?

As long as you still have at least 3 out of 4 nodes, yes. If you would only have 2 out of 4, you would not be able to change things on the cluster (start/stop/configure) and may get fenced.

You could run a 5th node on a Pi-like system that only acts as a quorum device if you're unsure if you want to risk this. You could also set the votes to 1 so that at least one node will be there. If you don't have shared storage in your cluster and only want to use the cluster as one mangement interface, you should always be fine. A split-brain setup is only bad if you have shared storage and could potentially write to the same data from multi nodes and the result would be undefined.

jsterr · Aug 24, 2023

proxwolfe said:
The reason I set it up like this is that it is recommended to have an uneven number of nodes in order to avoid a split-brain situation when one of the nodes fails.

You will likely never have a splitbrain when one node fails, when you have all systems connected to the same switches in the same room etc. If one node fails, the other 3 remain and will build quorum. Just think about components that might fail and if the failure would lead to a situation where 2 systems remain online, while the other two also remain only without seeing the other 2 nodes. In a "all in one room setup with same switches" this is usually very unlikely.

proxwolfe said:
After having used PVE for a while now, it is dawning on me that this is only relevant if (and maybe to the extent) you need HA.

In the HA manager I found a place where I can define a group. Does that mean that I could have an even numbered (4 node) cluster as well but within that cluster define an uneven numbered (3 node) group for HA purposes? And the fourth node would be ignored for HA purposes?

Yes this 4th node would never get a vm, if you restrict the vms via ha-group to only run on node 1-3. But as soon as you put one ressource into HA, all 4 nodes are accounted for HA. There will be a timestamp for each node. AFAIK it doesnt matter, node4 could still fence even if it has no active ha-ressource on it. But maybe someone of the Proxmox-Team can confirm.

proxwolfe said:
The reason I am asking is that I have a fourth server that I would like to sometimes use for special applications (as it has more oomph than the rest of the bunch) but I don't want to have it running all the time (as it consumes as much power as the rest of the bunch together). When I need it, it would be very convenient to migrate VMs easily to it. But I don't want to risk the HA functioning as intended.

You could setup qm-remote migrate to live-migrate from your 3-node cluster to this single node.

proxwolfe · Aug 24, 2023

LnxBil said:
A split-brain setup is only bad if you have shared storage and could potentially write to the same data from multi nodes and the result would be undefined.

Good point: Yes, I do have shared storage (all three original nodes are also CEPH nodes with OSDs). The fourth node would probably be also a CEPH node (but without OSDs).

LnxBil said:
As long as you still have at least 3 out of 4 nodes, yes. If you would only have 2 out of 4, you would not be able to change things on the cluster (start/stop/configure) and may get fenced.

In my idea, node no 4 would be completely ignored for all quorum purposes. Most of the time it would be offline anyway. So I don't want a situation where the failure of one of the three "24/7" nodes would lead to fencing (because only 2 out of 4 are available). Ideally, it should at all times be 2 out of 3. Is that possible?

Maximiliano · Aug 24, 2023

If you want to run a cluster with an even number of nodes, the recommended way is to set a QDevice [1]. This can be any device, or even a VM/Container thought it would be preferable if it is not running inside of the cluster itself, unlike corosync the qdevice does not even need to have a very low latency.

Do note that with 4 nodes + qdevice you need 3 votes to reach quorum.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

proxwolfe · Aug 24, 2023

jsterr said:
You will likely never have a splitbrain when one node fails, when you have all systems connected to the same switches in the same room etc. If one node fails, the other 3 remain and will build quorum. Just think about components that might fail and if the failure would lead to a situation where 2 systems remain online, while the other two also remain only without seeing the other 2 nodes. In a "all in one room setup with same switches" this is usually very unlikely.

Makes sense. All nodes are connected via the same infrastructure and are located in the same room. (I would love to have a geographically distributed cluster but latency on the connection means available to me (i.e. end user DSL lines) is too large for Corosync, as I understand.)

But I do fiddle around with the cluster from time to time and so I have actually had a situation more than once where only one node was down (because of me, whether intentional or not) while the others weren't.

So this is the HA situation I have in mind.

jsterr said:
Yes this 4th node would never get a vm, if you restrict the vms via ha-group to only run on node 1-3. But as soon as you put one ressource into HA, all 4 nodes are accounted for HA. AFAIK it doesnt matter, node4 could still fence even if it has no active ha-ressource on it. But maybe someone of the Proxmox-Team can confirm.

Yeah, this is the crucial point, I think. I need node no. 4 to be ignored for all quorum purposes. I had hoped I could achieve this by leaving it out of the group.

jsterr said:
You could setup qm-remote migrate to live-migrate from your 3-node cluster to this single node.

Hah! I had read about plans to implement this a while ago but I had completely missed that this is already available. So thank you very much for pointing this out. This could become game changer for me.

Alas, atm the process still looks quite cumbersome. I'm afraid by the time I'd actually manage to send a VM across, I could have easily backed it up to my PBS and restored it from there on node no. 4.

But I will definitely keep this in mind.

proxwolfe · Aug 24, 2023

Maximiliano said:
If you want to run a cluster with an even number of nodes, the recommended way is to set a QDevice [1]. This can be any device, or even a VM/Container thought it would be preferable if it is not running inside of the cluster itself, unlike corosync the qdevice does not even need to have a very low latency.

I did run the third node for a while as a (full features) VM off the server on which my PBS resides before giving it its own server hardware. So this is an option. This VM was part both of the dedicated Corosync network and well as the dedicated Ceph network. I would like to avoid the need to bring a "new fifth node VM" also into the dedicated Corosync network, but it would need to be, right? (Or could it just be on the normal management network?)

Maximiliano said:
Do note that with 4 nodes + qdevice you need 3 votes to reach quorum.

This is what probably breaks it for me. I (think I) need to retain a 2 out of 3 quorum rule. This is what I had hoped to achieve with setting up a HA group only comprising the original three nodes. But, as jsterr suspected, that won't work, if I understand you correctly. As soon as the cluster has four nodes, four becomes the benchmark and I can't have it ignore the fourth node for quorum purposes, right?

Maximiliano · Aug 24, 2023

Note that the QDevice provides one vote so an example quorate cluster would be 2 nodes + QDevice. Again, you can use anything as a QDevice, a Raspberry Pi for example, or a VM/Container as long as it is not hosted in the cluster itself.

proxwolfe · Aug 24, 2023

Maximiliano said:
Note that the QDevice provides one vote so an example quorate cluster would be 2 nodes + QDevice.

Well, my cluster has three nodes and would, from time to time (when I turn on no. 4), have four nodes. And I am trying to find a solution that works both in the three node scenario as well as in the four node scenario. If it were possible to totally ignore node no. 4 when it is online, that could be such solution.

Maximiliano said:
Again, you can use anything as a QDevice, a Raspberry Pi for example, or a VM/Container as long as it is not hosted in the cluster itself.

Good, but would it need to be on the dedicated Corosync network or would it be sufficient for it to be on the normal management network?

Maximiliano · Aug 24, 2023

Good, but would it need to be on the dedicated Corosync network or would it be sufficient for it to be on the normal management network?

Yes that would be sufficient. I have a cluster running right now with two dedicated corosync networks and the QDevice is in the management network.

proxwolfe · Aug 24, 2023

And how would I implement that?

Now, I have three nodes and two form a quorum.

When I add the fourth server, I would also add the quorum device, right? Then this would give me five "nodes" out of which I need four for quorum. But node no. 4 will be offline most of the time. If one of the original "24/7" node fails, the cluster would not have quorum, right?

Am I missing something here?

Maximiliano · Aug 24, 2023

You need 3 votes for quorum, two nodes plus qdevice provide 3.

proxwolfe · Aug 24, 2023

Okay, new try:

Now, I have three nodes with three votes and need two for quorum.

If I add a fourth server, I also add the quorum device. This would give me four nodes (the quorum device does not count here) with four votes out of which I need three for quorum.

So when node no. 4 is offline (which is most of the time), I still have three nodes that have quorum all by themselves (the quorum device does not need to do anything).

If one of the three original "24/7" nodes then fails, I only have two nodes with two votes and this is when the quorum device helps out and gives the remaining two nodes its vote so that they still have the required three votes for quorum.

Did I get it right this time?

Maximiliano · Aug 24, 2023

If one of the three original "24/7" nodes then fails, I only have two nodes with two votes and this is when the quorum device helps out and gives the remaining two nodes its vote so that they still have the required three votes for quorum.

Yes, thats more like it. Do note that from a voting perspective nodes and QDevices are not very different, so yes the QDevice does count and you should count your case as 3 votes, the QDevice will emit its vote even if you have your entire cluster up.

Just as you take into account the 4th node going down, you also need to account for the case when the QDevice is off. In this case your 3 of your 4 nodes will suffice too.

proxwolfe · Aug 24, 2023

One more try:

Now, I have three nodes with three votes and need two for quorum.

If I add a fourth server, I also add the quorum device. This would give me five "nodes" (the quorum device counts) with five votes out of which I need three for quorum.

So when node no. 4 is offline (which is most of the time), I still have three nodes that have quorum all by themselves (the quorum device will vote anyway but that doesn't hurt).

If one of the three original "24/7" nodes then fails, I only have two nodes with two votes and this is when the quorum device's vote matters. It gives the remaining two nodes its vote so that they still have the required three votes for quorum.

If the quorum device fails, there is no harm done, as long as at least two of the four real nodes are online.

Is that correct now? I feel like I'm ready to deploy that thing

Oh, one more thing:

Should I just install it directly on my cluster-external (Proxmox Backup) server? Or would it be better to put in a VM? (My PBS is on the same management network as the cluster (but not part of the Corosync network) and it shares the hardware with a PVE (which, however, only hosts one ISO fileserver VM that I want online, even when the cluster fails; so there is still room for another cluster-external VM.)

Maximiliano · Aug 25, 2023

I am not sure if I understand the setup you mention, but the only requisites for it is to not be on the same machine as one of the cluster nodes, and be accessible (e.g. via ping) via the nodes in the cluster. It can be in a Proxmox Backup server using the same management network for example.

The reason you don't want it to be on the same machine as one node in the cluster is that for all intents and purposes that would be a node with two votes but multiple failure points, at that point you might as well setup that node to have two votes. Having a device with two votes is a bad idea too, you cannot have any two members of the cluster be absent at the same time, in contrast with having a dedicated QDevice where any two votes can be absent at a time.

UdoB · Aug 25, 2023

Here's another approach:

you have three main nodes. They shall run always. One node may have a malfunction and the other two shall continue to be available
you add a fourth node. The rule above shall stay intact regardless of the availability of this fourth node

You can specify a "quorum_votes" for each corosync node. The default is "1" of course. You start with three nodes --> Expected=3; Quorum=2. When you add node four you end up with Expected=4; Quorum=3 --> problematic (w/o QDev)

Now: prepare Node 1+2+3 to count as two votes! Having only three nodes you now get Expected=6; Quorum=4. The actual behavior is completely unmodified. All normal rules apply.

Adding the fourth node with weight = 1 (!) now results in Expected=7; Quorum=4.

If I am right you can now lose one "normal" node AND the fourth one without losing Quorum! The surviving two "normal" nodes still have "4" votes = maintaining the Quorum.

While I am sure this approach is not really recommended it is technically possible...

Good luck

VictorSTS · Aug 25, 2023

proxwolfe said:
If the quorum device fails, there is no harm done, as long as at least two of the four real nodes are online.

If both qdev and 2 hosts are off, your cluster will have no quorum (Expected=5, votes=2). Your sentence should be more like "If the quorum device fails, there is no harm done, as long as at least three of the four real nodes are online."

Besides Proxmox quorum, you should keep in mind Ceph quorum: if two of your three monitors are down, all i/o is paused. Also, if not enough OSD are available to honor your min_size (typically 2), i/o will be halted too.

LnxBil · Aug 27, 2023

Keep in mind that if you use CEPH, you need to take care of the votes there too. Corosync and Ceph are AFAIK totally separated and need to be taken care of also separately.

proxwolfe · Aug 27, 2023

LnxBil said:
Keep in mind that if you use CEPH, you need to take care of the votes there too. Corosync and Ceph are AFAIK totally separated and need to be taken care of also separately.

Good point.

So my plan was to have the fourth node join CEPH but not host any OSDs. If I also don't set ip up as Monitor or Manager? Would that keep it neutral in the quorum count?

Even no of PVE nodes?

Active Member

Distinguished Member

Well-Known Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Famous Member

Renowned Member

Distinguished Member

Active Member