Help me understand why my entire cluster just went offline.

UnexpectedDevelopment · Sep 28, 2023

I've been running PVE 7.x for about a year now on a three node cluster. Its been working great. The cluster operates in HA mode and I see my nodes listed in the HA page.

But something very bad happened just now and I'd like to know if this is the expected behavior. Here is the timeline.

I have had three 7.3-x nodes running in HA mode for over a year. No problems.
I added a new (fourth) node running PVE 8.0.x. It joined correctly, then I powered if off so I could move it into my datacenter later.
I removed all VMs and replications to/from node 3 to prepare to remove it (node 3) from my cluster.
Not realizing that I was about to knock out half my total number of joined nodes (quorum), I powered off node three. I now have 2 of 4 nodes running.
As node 3 powered down, ALL guest VMs unexpectedly shutdown on nodes 1 and 2.
Attempting to force the VM's back up on nodes 1 and 2 resulted in an error saying "cluster not ready - no quorum? (500)"

Ok, reading through the documentation tells me that issuing a command to any (online) node in the cluster (in this case, a start vm command) requires a vote. And a majority (not a tie, but majority) of nodes must "approve" or vote on the operation before it is accepted. Ok, I can accept that. I also became aware that you can, in an emergency, use the "pvecm expected x" command to lift this requirement.

But my more important question is: Why did nodes 1 and 2 suddenly power off all running VMs? Is this by design? Can I modify this behavior?

bbgeek17 · Sep 28, 2023

First we need to define what quorum is - quorum is a majority in a given set of nodes. When your given set of nodes is 3, then majority is 2.
Once you bump your set to 4, even if temporary - the majority is now 3. The fact that you shutdown 4th node after joining does not change quorum requirements. In fact, what you did is introduced first failure in the cluster.
By shutting down the 3rd node you introduced a double failure in your cluster, reduced the number of alive nodes to below quorum and created a split-brain scenario.
Think of it this way, you had 4 nodes, then A and B disconnected from C and D. Each part has equal power in the cluster, neither side knows if the other two made any progress, ie took over services and started changing shared data. A VM100 could be running on both A and C, causing data corruption.

At this point the appropriate recovery step is to shutdown services, in some cases commit suicide (reboot), or STONITH (shoot the other node in the head).
The behavior of the cluster that you experienced is the most correct in a described scenario. I would strongly discourage you from modifying it without an intimate familiarity with Corosync/Pacemaker cluster operations.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

UnexpectedDevelopment · Sep 28, 2023

bbgeek17,

What you said above makes sense for the reasons you gave. The machines committing suicide to prohibit two VMs from talking to the same virtual disk and corrupting it makes sense. I have two extra points to put on the table concerning my scenario however.

I am NOT using shared storage for the back-end of these VMs. Rather it's local ZFS with daily replication jobs to the other nodes. So technically it would be impossible for two VMs who are no longer coordinating, to corrupt the VM storage on some shared storage array. Does PVE have any provision to disable suicide in this scenario?
As you know, nodes 3 and 4 did not technically fail. They were gracefully shutdown. I was hoping the cluster would be aware of this fact and not assume they had hard-failed and allow nodes 1 and 2 to carry on.

Thanks for your reply!

bbgeek17 · Sep 28, 2023

UnexpectedDevelopment said:
I am NOT using shared storage for the back-end of these VMs. Rather it's local ZFS with daily replication jobs to the other nodes. So technically it would be impossible for two VMs who are no longer coordinating, to corrupt the VM storage on some shared storage array. Does PVE have any provision to disable suicide in this scenario?

those are details of your specific installation. Whether it is shared storage or an Application that runs in the VM and reads/writes data to database and must run in one location only, corruption is always possible. The default behavior of a properly implemented cluster is to put as many bumpers as possible around users who are not HA savvy.
There are many resources available with additional information, ie https://pve.proxmox.com/wiki/High_Availability#_node_maintenance

UnexpectedDevelopment said:
As you know, nodes 3 and 4 did not technically fail. They were gracefully shutdown. I was hoping the cluster would be aware of this fact and not assume they had hard-failed and allow nodes 1 and 2 to carry on.

I am guessing "graceful shutdown" means that you typed "reboot -p" or "shutdown now"? If yes, then it may be graceful for OS, but not for cluster operations. As far as the remaining cluster nodes are concerned, this node has disappeared off network. Keep in mind there are no guarantees of strict order during shutdown. The network could be brought down before cluster process.

Using cluster maintenance mode that would ignore quorum changes or double failures, or ejecting the node from the cluster completely - is the right approach.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Help me understand why my entire cluster just went offline.

UnexpectedDevelopment

Member

bbgeek17

Distinguished Member

UnexpectedDevelopment

Member

bbgeek17

Distinguished Member

We value your privacy