HA & last_man_standing + wait_for_all

esi_y

Active Member
Nov 29, 2023
796
105
43
It's nowhere in the PVE official docs, but corosync does support last_man_standing and when used with HA it is suggested to also set wait_for_all. I found some previous threads, but not in relation to HA.

Now I understand the official PVE endorsed way would be to just use a qdevice, but this does not solve particular situations, for instance looking to maximize off-grid time by having cascade shutdown nodes leaving only 3, with only few essential HA services that won't overload them.

My question is - has anyone been running this in production or at least for a reasonably long period on reasonably large cluster (10+) to test any anomalies when nodes are going down and then re-starting up and its effect on the HA stack on PVE?

Note: As the rebalanacing might end up with even node count, I suppose I better set auto_tie_breaker as well, but that should have no influence on the two above.
 
Last edited:
I guess these configs are not really popular?
Seems so.

Actually I now tried to understand (Redhat) what tie_breaker and last_man would do. While I am sure that I did not really understand all the details and implications I am fairly sure I do not want this. In my picture there is always "split brain" possible, which might be fatal for known reasons.

Sorry, I have no well-founded knowledge...
 
In my picture there is always "split brain" possible

I understood the options (and recommended combos) are there to not have end up in a split-brain. What specifically was the point where you got the impression?
 
It makes sense that you should set wait_for_all if you set last_man_standing, because there may be a quorate minority. So after a cold start we need all nodes connected to be quorate.

But this means if just ONE of your nodes dies after a powerloss, the other nodes won't get quorate on their own. The bigger the cluster, the bigger the risk. On a 10+ node cluster, this is definitely a no go for me.
 
It makes sense that you should set wait_for_all if you set last_man_standing, because there may be a quorate minority. So after a cold start we need all nodes connected to be quorate.

But this means if just ONE of your nodes dies after a powerloss, the other nodes won't get quorate on their own. The bigger the cluster, the bigger the risk. On a 10+ node cluster, this is definitely a no go for me.

My understanding on this scenario would be different, say a power loss ocurred, going down (graceful shutdowns) from 8 nodes thanks to LMS to 3. Now assume the power loss lasted for so long all the remaining nodes had to go off. After power resume there's 8 nodes starting up and expected was set to 3 with WFA, any 3 nodes would do to get it to start off. Which part did I get wrong?
 
Last edited:
What specifically was the point where you got the impression?
It was just my not-well-founded impression, just to give at least *any* feedback after your third post without any reaction ;-)

For auto_tie_breaker I see no advantage compared to the well established QDev.

For last_man_standing I do not like the requirement of wait_for_all; see also #4.

Nevertheless: an additional mechanism to address this problem would be great, so basically I am with you.

But I believe I do not have this problem. My (only!) productive cluster at work has five nodes running 24*7, a test-cluster has four but I intentionally ignore this topic. At home... I manually tune it via "pvecm expected"...
 
It was just my not-well-founded impression, just to give at least *any* feedback after your third post without any reaction ;-)

Thank you. :)

For auto_tie_breaker I see no advantage compared to the well established QDev.

I agree, but the QD cannot be used with LMS.

For last_man_standing I do not like the requirement of wait_for_all; see also #4.

This is for HA only.

Nevertheless: an additional mechanism to address this problem would be great, so basically I am with you.

I will go on do some tests with the said scenario, I believe it should be fine with the last expected value to regain quorum. But generally I do not expect power loss, as in, upon power loss, UPS lets the nodes know, predetermined ones will start shutting down, 3 will remain. They should last running till power resume. If not, I thought it will only take 3 initially to get quorum as 3 was the last expected value.

But I believe I do not have this problem. My (only!) productive cluster at work has five nodes running 24*7, a test-cluster has four but I intentionally ignore this topic. At home... I manually tune it via "pvecm expected"...

I mean, I do not force anyone to answer, I just wondered why it's not been discussed more, it's what corosync provides, why not use it.
 
  • Like
Reactions: UdoB
At home... I manually tune it via "pvecm expected"...

The thing is, one could basically script this even, from an independent machine, the issue is of course that then that one becomes the SPOF (at least for dynamically bringing expected value down as nodes are checking out into the darkness). Stretching it a bit, would be funny to run such script itself as guest on the cluster. But I kind of hoped this is all built in within the LMS of the corosync.
 
The thing is, one could basically script this even, from an independent machine, the issue is of course that then that one becomes the SPOF
As said this is my "solution" at home. It is simply not critical. I do not need the cluster to restart after a power fail without my assistance. (But it would as most nodes would start up...). Actually I have had only one single power fail in 15 years. Possibly I will think different when I deploy serious home automation some time in the future, which is not planned yet.

At work all nodes will restart after power fail. Unfortunately the inter-dependencies between services is not very well tested and a cold start of everything at once requires manual intervention...

So as usual: the way of correct handling of a shrinking number or active nodes depends on the circumstances and specific requirements :cool:
 
My understanding on this scenario would be different, say a power loss ocurred, going down (graceful shutdowns) from 8 nodes thanks to LMS to 3. Now assume the power loss lasted for so long all the remaining nodes had to go off. After power resume there's 8 nodes starting up and expected was set to 3 with WFA, any 3 nodes would do to get it to start off. Which part did I get wrong?

You got the part with the "any 3 nodes would do to get it to start off" wrong.
You were (or maybe still are) working with data on the last standing nodes. If *any* nodes could become quorate, you could run into a split brain or data corruption.
The quorate has the purpose to use the latest data for the datacenter files (/etc/pve) and VMs + LXCs. It also prevents multiple parallel starts of a VM/LXC.

With LMS you may shrink the quorate cluster for example from 8 to 3 nodes. How should the nodes know there is a 3/8 = 37.5% quorate fragment?
At the same time it is not desirable to quorate (after they died) the last nodes standing, as there may be already a 50% +1 quorate.
The 50% +1 is the default behavior: The cluster tries to go quorate as soon as possible.

This is why you set WFA. That way you are sure you won't have any smaller quorate node fragments after a cold start, but you really have to "wait for all" as described in the man for votequorum(5):
When WFA is enabled, the cluster will be quorate for the first time only after all nodes have been visible at least once at the same time.
So when just one node dies after a full powerloss, the whole cluster will be offline/not quorate. And this is more likely to happen the more nodes you have.
 
You got the part with the "any 3 nodes would do to get it to start off" wrong.
You were (or maybe still are) working with data on the last standing nodes. If *any* nodes could become quorate, you could run into a split brain or data corruption.
The quorate has the purpose to use the latest data for the datacenter files (/etc/pve) and VMs + LXCs. It also prevents multiple parallel starts of a VM/LXC.

Right, good explanation, logical. Now I wonder why WFA is only then recommended for HA scenarios, it's not necessary as per the docs strictly with LMS. I suppose as long as the parallel starts are not at stake (HA), split cluster is not the end of the world after such event. But I really would want to test it, i.e. without WFA, how the cluster fractions converge (or not). This does not solve the HA scenario though.

With LMS you may shrink the quorate cluster for example from 8 to 3 nodes. How should the nodes know there is a 3/8 = 37.5% quorate fragment?
At the same time it is not desirable to quorate (after they died) the last nodes standing, as there may be already a 50% +1 quorate.
The 50% +1 is the default behavior: The cluster tries to go quorate as soon as possible.

This is why you set WFA. That way you are sure you won't have any smaller quorate node fragments after a cold start, but you really have to "wait for all" as described in the man for votequorum(5)

This is beyond the scope of my post here, but I am now wondering why it is wait_for_all, specifically that "all" - true, there may be situation where e.g. last standing 3 nodes are slow to start after power resume, meanwhile the remaining (and "stale") 5 are already up and quorate (without WFA). The WFA clearly ensures that the cluster gets the most recent data from the last ones standing prior to the power loss. But it does not need them all, strictly speaking. It only needs N - LMS_minimum + 1 to get at least one of those previously standing online to feed in the most recent data. This is just me thinking aloud though, would need to patch corosync for that, which I might attempt (unless someone nudges me with why my logic here was unsound).

So when just one node dies after a full powerloss, the whole cluster will be offline/not quorate. And this is more likely to happen the more nodes you have.

Yes, this answers my question, so thanks for that. On the other hand, it's not the end of the world for my scenarios as I do not really anticipate total power loss - the very reason I started this thread was to figure out how to shrink quorum so that just 3 nodes can keep running of UPS as long as possible. In case of actual resume from zero nodes, it would be such a rare event that I can afford to require manual intervention, I might even want that. I realise this is not the same for everyone though.
 
For anyone digging the same rabbithole [1] in the future:

C:
    /* If ATB is set and the cluster has an odd number of nodes then wait_for_all needs
     * to be set so that an isolated half+1 without the tie breaker node
     * does not have quorum on reboot.
     */
    if ((auto_tie_breaker != ATB_NONE) && (node_expected_votes % 2) &&
        (!wait_for_all)) {
        if (last_man_standing) {
            /* if LMS is set too, it's a fatal configuration error. We can't dictate to the user what
             *  they might want so we'll just quit.
             */

There it goes, the ATB... but will persevere with LMS & WFA.

[1] https://github.com/corosync/corosyn...ff9e2c300c2/exec/votequorum.c#L1381C1-L1391C1
 
Last edited:
Did you end up finding a "working configuration" that would utilize both LMS and WFA? I would be eager to try this at scale as well but I've been having issues setting it up under /etc/pve/corosync.conf
 
Did you end up finding a "working configuration" that would utilize both LMS and WFA? I would be eager to try this at scale as well but I've been having issues setting it up under /etc/pve/corosync.conf
The problem is not LMS + WFA, it was having ATB at the same time.
 
The problem is not LMS + WFA, it was having ATB at the same time.
In my use case I just want to be able to shut down more than half of the nodes within the cluster in a controlled fashion and still stay quorate with the remaining by having LMS recalculate expected votes. As I can ensure that there is always an uneven number of nodes running at any given time I believe I don't need ATB in my use case. Please feel free to correct me on this if I am wrong on this, I am still quite new to both Proxmox and Corosync after spending many years working with vmwares family of products.

That being said configuring the quorum section under /etc/pve/corosync.conf has been a uphill battle for me as I have not been able to find an example that would utilize both LMS and WFA.
 
You should basically suffice with with man 5 votequorum [1] and include (amongst other things):

Code:
quorum {
    provider: corosync_votequorum
    last_man_standing: 1
    wait_for_all: 1
}

You can even play around with parameters such as last_man_standing_window, another option is to go with a QDevice.

This will be confusing but while LMS (from [1]) cannot be used with it, the qdevice setup itself has a similar options with the same name, you can read on the algorithm part in man 7 corosync-qdevice [2].

Or describe you issue... ;)

[1] https://manpages.debian.org/bookworm/corosync/votequorum.5.en.html
[2] https://manpages.debian.org/bookworm/corosync-qdevice/corosync-qdevice.8.en.html
 
BTW Before anyone comes here to tell me I am dispensing all sorts of advice on "unsupported" config, yes it is not something PVE team is testing and wish to support you with. But it's normal/proper corosync options.
 
I think my current issue is best described by "lack of experience" Proxmox is particularly picky about the configuration under /etc/pve/ and my last attempt at this resulted in bricking the cluster in to stage where none of the nodes could see one an other. :cool: Thank you for your help with this, it's greatly appreciated and like you said, please don't try this in production people.

Just one additional question regarding the configuration. Should I include the expected_votes: while using this? It's in the debian documentation but I have not seen it in any of the examples here on the forums.
 
And did you not forget to update the config_version when you were making changes [1]?

The expected_votes is the number of votes required for quorum. Normally when you are just tweaking your setup, you don't need to touch it.

Corosync is actually something you can test even without PVE, it is interesting to watch it "behave". ;)

I would suggest when testing this:
1) Do NOT use anything HA, remove any HA services you added and reboot the master node (you can check it with ha-manager status);
2) Update your corosync config and watch what happens in journalctl -u corosync;
3) If your cluster gets messed up, do not dismantle it, instead fix your corosync config. :)

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration
 
Also, when making changes, you can have an tmux open for each node watching corosync-quorumtool -m, if you are tweaking config values, learn to use corosync-cfgtool -s and -R and eventually you may need to see what's corosync-cmapctl also with -t.

The pvecm command is just a wrapper, as you'll see.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!