cluster quorum

stuartbh

Active Member
Dec 2, 2019
120
11
38
59
I recently created a 2 node cluster and then had reason to take the second node offline. In so doing, I noticed that the entire PVE stopped functioning until the second node rejoined. Is there a way to tell PVE that when only one node on a two node cluster is functioning to still consider that a quorum (without adding a third node at the moment)?

I was contemplating to temporarily remove the second node but then thought I read that doing so would cause me to need to reinstall the entire second node before rejoining the cluster. Is this correct?

I ran an update on the second node and doing so messed it up somehow and I am looking at that now (used standard apt commands to upgrade the PVE node). I may end up needing to reinstall it anyway, if so, I suppose I ought delete the node from the cluster first and rejoin post install.

Thanks in advance to anyone providing insights and do stay safe and healthy during these challenging times.

Stuart
 
Our wiki should answer your questions: https://pve.proxmox.com/wiki/Cluster_Manager
You can lower the required quorum (not recommended) or use an external quorum device

Yes, removing the node from the cluster means you have to reinstall it
 
Is there a way to tell PVE that when only one node on a two node cluster is functioning to still consider that a quorum
Yes there is pvecm expected 1

In the Cluster management documentation you will find two ways of removing a cluster node. While it is possible to remove a node without having to reinstall, it is not recommended.

While you technically don't need it, adding a QDevice for external vote support might be interesting to you. With this you don't need a third full node, but get the benefits of having a 3-node cluster.
 
  • Like
Reactions: Soper
I noticed that there is a place in the /etc/pve/corosync.conf file to set the number of votes each node gets. What would be wrong with just giving each node two votes (or only the working node two votes)? Would this not also solve the problem for the moment?

the other temporary solution would be adding the line "pvecm expected 1" to the /etc/pve/corosync.conf file at the bottom?

I am not sure why the other node says pve-manager is not and will not configure, but I may end up just deleting the node from the cluster and reinstalling it anyway.

Stuart
 
What would be wrong with just giving each node two votes?
Nothing really, but it would have the same effect. For your cluster to be quorate it needs to have >50% of votes. If both of your nodes have 2 votes and one goes down, it will still be non-quorate just like if both of them had one vote.

Changing these values in general will most likely just lead to unintended consequences though and is most certainly not what you are looking for.

the other temporary solution would be adding the line "pvecm expected 1" to the /etc/pve/corosync.conf file at the bottom?
No, pvecm is a command line tool < Proxmox Virtual Environment Cluster Manager >. When your cluster is non-quorate (so at least half of all nodes are dead), the remaining nodes will change the PVE management into read-only mode. Because of that you will no longer be able to change and manage your VMs and containers and will also not be able to log into the GUI. This is done to avoid cluster split-brain problems in which they run into inconsistent states.
Sometimes you still want to able to manage your node though (e.g. if you want to disband the cluster) in this case you can login to your cluster (e.g. with SSH) and then issue the command pvecm expected 1. This will tell the node that it should override the value of expected votes for the time.
 
Nothing really, but it would have the same effect. For your cluster to be quorate it needs to have >50% of votes. If both of your nodes have 2 votes and one goes down, it will still be non-quorate just like if both of them had one vote.

Changing these values in general will most likely just lead to unintended consequences though and is most certainly not what you are looking for.


No, pvecm is a command line tool < Proxmox Virtual Environment Cluster Manager >. When your cluster is non-quorate (so at least half of all nodes are dead), the remaining nodes will change the PVE management into read-only mode. Because of that you will no longer be able to change and manage your VMs and containers and will also not be able to log into the GUI. This is done to avoid cluster split-brain problems in which they run into inconsistent states.
Sometimes you still want to able to manage your node though (e.g. if you want to disband the cluster) in this case you can login to your cluster (e.g. with SSH) and then issue the command pvecm expected 1. This will tell the node that it should override the value of expected votes for the time.

When I attempted to execute that command I got this:
Unable to set expected votes: CS_ERR_INVALID_PARAM

Once I shut the second node down the command worked (I'd suggest that error is rather indescript).

Also, once I fix the second node how do I revert the expected parameter back to a normal setting or would that just get fixed as a function of me deleting the second node from the cluster, reinstalling, and having the newly reinstalled node rejoin the cluster?

I am surely going to try to setting up a 3rd QDevice as an extra vote when quorum elections are held. I have a laptop that I use daily that runs Linux, so doing that is really an easy cheat/solution. I hadn't really thought about it until just now.

I also have an HP T620 with 4GB of RAM and 16GB of internal SSD that runs pfSense on it and I'm think of installing ProxMox on it and then running pfSense under ProxMox (as the T620's only VM it would have).

The pfSense instance (running directly on bare metal) is using only "23% of 3449 MiB" (approximately 800MB) of the 4GB of RAM and 4.9GB of the 11GB root on the SSD (the other 4GB of SSD is for swap that is used 0%). I imagine ProxMox itself does not use more than 2GB of RAM, does that sound right? At any rate, another 4GB of RAM would not run more than $30 USD. I know a total of 8GB of RAM would be plenty, though if price the differential is negligable maybe I'd get an 8GB or 16GB RAM module.

It seems that my ProxMox installation is using about 7GB of DASD and pfSense about 5GB of DASD, so a 16GB SSD might be cutting it too close I think.

If I did this, it would allow me to have another node in the cluster and also to have HA with pfSense on the T620 and another pfSense instance installed on one of the other two cluster nodes.

Have a safe and healthy day during these most challenging times.

Stuart
 
Last edited:
Sometimes you still want to able to manage your node though (e.g. if you want to disband the cluster) in this case you can login to your cluster (e.g. with SSH) and then issue the command pvecm expected 1. This will tell the node that it should override the value of expected votes for the time.
I disbanded the cluster and left one node, but it requires quorum at startup. pvecm expected 1 has to be entered every time it starts. how to specify this value as constant.
 
When i run the command "pvecm expected 1" i'm able to start vms and everything but it is not permanent. My idea is that maybe we could make an startup service script that automatically runs the command when the proxmox node starts up
 
My idea is that maybe we could make an startup service script that automatically runs the command when the proxmox node starts up

Everything is possible. From my point of view manipulating "expected 1" is for disaster recovery only. If you believe you need this continuously, then you are doing something wrong. Please be careful, you are begging for trouble - if not today then probably on the long run...

Just my 2€¢
 
Okay, i have 2 nodes. An powerful pc node and an mini pc node for 24/7 things. I only use the pc node if i need an server for testing things or heavy game servers. Its most ofvthe time offline and i dont want to switch ip everytime and i mwant to migrate vms then
 
I only use the pc node if i need an server for testing things

Then just do it the way you want - everything is possible :-)

Your system is small enough for this. You know everything about it. You can handle problems when they occur. (My approach is to avoid risky situations to continuously stay "stable", that's why I run clusters instead of single systems - so I am biased.)

Have fun!
 
Then just do it the way you want - everything is possible :)

Your system is small enough for this. You know everything about it. You can handle problems when they occur. (My approach is to avoid risky situations to continuously stay "stable", that's why I run clusters instead of single systems - so I am biased.)

Have fun!

I think what the OP actually needs is two_node: 1 but read up on it:

https://manpages.debian.org/bookworm/corosync/votequorum.5.en.html

The ideal approach in this case is - not to have the two clustered at all.

EDIT: And if by all means that's not an option, also check auto_tie_breaker: 1.
 
Last edited:
Okay, i have 2 nodes. An powerful pc node and an mini pc node for 24/7 things. I only use the pc node if i need an server for testing things or heavy game servers. Its most ofvthe time offline and i dont want to switch ip everytime and i mwant to migrate vms then
Your nodes are not equally important. Just give to 24/7 node a second vote in corosync.conf. It will be able to quorate alone (2 of 3 votes) and you don't risk a split-brain condition as with pvecm expected 1
 
  • Like
Reactions: UdoB
Yeah I just wanted to say that, the best and safest solution to no quorum is to give the node that is always on an 2. Vote. You can do that with editing nano /etc/pve/corosync.conf and then search your node and next to votes enter 2 (votes) and then restart corosync with systemctl restart corosync.
And then with pvecm status you can see that quorum is still true bec there are still 2 votes with 1 node :D
 
Yeah I just wanted to say that, the best and safest solution to no quorum is to give the node that is always on an 2. Vote. You can do that with editing nano /etc/pve/corosync.conf and then search your node and next to votes enter 2 (votes) and then restart corosync with systemctl restart corosync.
And then with pvecm status you can see that quorum is still true bec there are still 2 votes with 1 node :D

I used to do this too, but it messes everything up later if you want to e.g. start using a Q device. You basically get same result with auto_tie_breaker which was made for this. Also PVE update scripts throw funny warnings on other than 1 vote per cluster node...
 
Something I was going to try but haven't had the time...I have three PVE servers but it's just too much energy usage and not enough need to havethem clustered, but I really like clustering over single nodes. I'm thinking of creating a three node cluster whereby the third (backup) server will automatically shutdown after seeing a 2+ vote quorum, and if one of the main servers sees a 1-vote for say more than 30 seconds or so then it will send a WoL magic packet to the backup server to turn it on. Then, if the main server which had gone down comes back online, the backup will notice it and (perhaps with a cron job?) will shut itself back down to maintain a 2-vote maximum. All the benefit of a 3 node cluster with 2/3 the energy usage. If networking between them fails then they turn off the VMs anyway, so no difference there. I can afford having two or three minutes of downtime so for me it'd be OK. Apart from that down time, any downsides you can think of or reasons why that wouldn't work? It seems like it'd be better than running three nodes OR running a vote proxy, for my specific work case that is where I want 24/7 operation but I can deal with just two 9's or a few min of dow time on occasion. O and if power goes out all three come back on then the backup turns itself off when it sees enough votes. Thoughts?? Anything I'm failing to consider??
 
Something I was going to try but haven't had the time...I have three PVE servers but it's just too much energy usage and not enough need to havethem clustered, but I really like clustering over single nodes. I'm thinking of creating a three node cluster whereby the third (backup) server will automatically shutdown after seeing a 2+ vote quorum, and if one of the main servers sees a 1-vote for say more than 30 seconds or so then it will send a WoL magic packet to the backup server to turn it on. Then, if the main server which had gone down comes back online, the backup will notice it and (perhaps with a cron job?) will shut itself back down to maintain a 2-vote maximum. All the benefit of a 3 node cluster with 2/3 the energy usage.

Well, it's not ALL the benefits as with genuine 3 nodes, there are not any 2 "main" ones that need to have at least one of them up. (It is actually possible to set even a single surviving node running with quorum with certain corosync settings.)

If networking between them fails then they turn off the VMs anyway, so no difference there.

When networking is down and quorum lost, nothing really happens to running VMs, they will keep running.

I can afford having two or three minutes of downtime so for me it'd be OK. Apart from that down time, any downsides you can think of or reasons why that wouldn't work?

The way PVE implements High Availability (which I suppose is the reason you have 2 nodes up) will bring more carnage than benefit with 2 nodes left up without a Q device. Basically, if node 1 is fencing, before node 3 can start, node 2 might already start fencing as well, meanwhile node 1 is back up and not seeing anyone as node 3 with its delayed start had started fencing as well, rinse and repeat.

It seems like it'd be better than running three nodes OR running a vote proxy, for my specific work case that is where I want 24/7 operation but I can deal with just two 9's or a few min of dow time on occasion.

A Q device would be fine.

O and if power goes out all three come back on then the backup turns itself off when it sees enough votes.

This tiny part would actually be fine.
 
Well, it's not ALL the benefits as with genuine 3 nodes, there are not any 2 "main" ones that need to have at least one of them up. (It is actually possible to set even a single surviving node running with quorum with certain corosync settings.)



When networking is down and quorum lost, nothing really happens to running VMs, they will keep running.



The way PVE implements High Availability (which I suppose is the reason you have 2 nodes up) will bring more carnage than benefit with 2 nodes left up without a Q device. Basically, if node 1 is fencing, before node 3 can start, node 2 might already start fencing as well, meanwhile node 1 is back up and not seeing anyone as node 3 with its delayed start had started fencing as well, rinse and repeat.



A Q device would be fine.



This tiny part would actually be fine.
I'll have to read more about fencing but as i understood it, if I do NOT use it and not add these nodes to the fencing domain, and replicate the VMs that need replication but which are largely static (though perhaps I could sync them weekly or something for the occasional change) then the issue you describe above wouldnt be an actual issue, correct? My understanding is that it's a bigger problem with shared storage which wouldn't be used. I'd have all of my most needed VMs replicated to the spare from both "main" devices whereby, when the spare spins up, it would start whichever VMs are needed and not anything extraneous. When the main one comes back online the VMs would sync anyway and handoff would occur again albeit with another delay after the shutdown of the backup. Because the thing is, high availability is NOT my main goal. I prefer the simplicity of running all of the nodes in a cluster and my issue has always been that they wont run with only a single node without changing the quorum setup as you alluded to above (change one of them to have two votes which may be the route I go but offers no redundancy, or use a Q device which I dont like and has no redundancy). What I want you could perhaps call 'medium availability'. It's a homelab so nothing is mission critical, if I have to wait 5 minutes that's ok, but I'd prefer to not have HAProxy,my VPN, or Home Assistant go down for much longer than that. Apart from historical data on home assistant, these VMs in particular would be fine if ran with data that's a week or two old.
Essentially I'm looking for a 2-node cluster with a cold spare for only the more critical VMs, which in normal circumstances may be running on either of those nodes. Is there another way you can think of to do that? I can't run everything on one node so HA with a Q device isn't going to work since one node failing will leave me without services and yet, paying for another 130W of continuous load (old Broadwell V4 servers) just isnt doable for me financially either. If it were say a database of customer info, I'd totally agree with you...but given the use case do you really think it'd cause lots of problems?
 
I'll have to read more about fencing but as i understood it, if I do NOT use it and not add these nodes to the fencing domain, and replicate the VMs that need replication but which are largely static (though perhaps I could sync them weekly or something for the occasional change) then the issue you describe above wouldnt be an actual issue, correct?

But what's the point of the third node, if VMs can't ever migrate to it?

My understanding is that it's a bigger problem with shared storage which wouldn't be used. I'd have all of my most needed VMs replicated to the spare from both "main" devices whereby, when the spare spins up, it would start whichever VMs are needed and not anything extraneous.

How will you be replicating to a node that is off? You will be scripting those auto-starts without HA?

When the main one comes back online the VMs would sync anyway and handoff would occur again albeit with another delay after the shutdown of the backup.

That is another HA migration? Because VMs do not migrate back in PVE unless the node fails.

Because the thing is, high availability is NOT my main goal. I prefer the simplicity of running all of the nodes in a cluster

There's a nothing simple about clustering in PVE, it's a brittle system, it also shreds your SSDs (more). If HA is not a goal, it is literally simpler to have one host with all the VMs, not even PVE.

and my issue has always been that they wont run with only a single node without changing the quorum setup as you alluded to above (change one of them to have two votes which may be the route I go but offers no redundancy, or use a Q device which I dont like and has no redundancy).

I am not sure what you mean with Q device providing no redudancy, it literally allows - reliably - 2 nodes to be in a cluster which otherwise would be limited features setup.

What I want you could perhaps call 'medium availability'. It's a homelab so nothing is mission critical, if I have to wait 5 minutes that's ok

Then I would prefer host with disaster recover = good backups. If I already had two hosts, I would run them separately and keep replicating without clustering.

, but I'd prefer to not have HAProxy,my VPN, or Home Assistant go down for much longer than that. Apart from historical data on home assistant, these VMs in particular would be fine if ran with data that's a week or two old.

See above.

Essentially I'm looking for a 2-node cluster with a cold spare for only the more critical VMs, which in normal circumstances may be running on either of those nodes.

In my opinion, you are not looking for a cluster, you are using ill-chosen solution for the use case and letting it force you to believe you require a cluster because that's the only thing that it can provide.

Is there another way you can think of to do that?

If you insisted on PVE, then it would need (without Q device if you insist on that too) the sort of options* from corosync that Proxmox team do not vouch for: last_man_standing and do not use HA or if you insist on HA (as implemented by PVE), that would be a bit of gamble but wait_for_all.

* https://manpages.debian.org/unstable/corosync/votequorum.5.en.html

I can't run everything on one node so HA with a Q device isn't going to work since one node failing will leave me without services

I am not sure I got you on this one.

and yet, paying for another 130W of continuous load (old Broadwell V4 servers) just isnt doable for me financially either.

This is a recurrent topic on the forum, I cannot know from your limited description what else on those old-gen servers cannot be easily substituted, but generally speaking a single new mini PC, compute & storage wise can easily handle similar workloads at ~25W in my experience. For some homelabbing, the servers could be on-demand only then, or simply virtualised on its own.

If it were say a database of customer info, I'd totally agree with you...but given the use case do you really think it'd cause lots of problems?

Your originally described setup? If you ignored everything above and pressed on with PVE on 2 old hosts, I still believe it's overly complicated setup as opposed to corosync config options.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!