Unexpected reboots on a 2nd node while failover test

faisa7847

New Member
Jul 25, 2024
13
0
1
Hello i have a 5 node cluster i shut down the 4th and 5th node due to high latency issue
we didn't have any problems at the beginning but now while doing failover test by shutting the the first node the 2nd does a reboot

this was the last log from 2nd node while going to reboot
Oct 19 10:58:00 prx2 watchdog-mux[1323]: client watchdog expired - disable watchdog updates
Oct 19 10:58:09 prx2 pvescheduler[5162]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Oct 19 10:58:09 prx2 pvescheduler[5161]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
 
Hello
Hello i have a 5 node cluster i shut down the 4th and 5th node due to high latency issue
we didn't have any problems at the beginning but now while doing failover test by shutting the the first node the 2nd does a reboot
Did I get this right: 2 nodes out of five were already powered off, and then you powered down another node?
If so then yes, this is expected if HA is active. In a five node cluster you need at least three nodes to have a majority, i.e. your cluster can "loose" (for maintenance or due to failure) up to two nodes and still remain operational. But if a third node goes down the remaining two cannot tell the difference between them being still operational or if there is, e.g., a network issue and on the "other side" of that network problem the other three nodes are operational, thus quorum loss and thus self-fencing through the watchdog due to active HA service.
Oct 19 10:58:00 prx2 watchdog-mux[1323]: client watchdog expired - disable watchdog updates
Log messages leading up to this would be interesting, especially from corosync and pmxcfs/pve-cluster. Those can help to confirm above or look into why this might have happened if my interpretation above is wrong.
 
Did I get this right: 2 nodes out of five were already powered off, and then you powered down another node?
If so then yes, this is expected if HA is active. In a five node cluster you need at least three nodes to have a majority, i.e. your cluster can "loose" (for maintenance or due to failure) up to two nodes and still remain operational. But if a third node goes down the remaining two cannot tell the difference between them being still operational or if there is, e.g., a network issue and on the "other side" of that network problem the other three nodes are operational, thus quorum loss and thus self-fencing through the watchdog due to active HA service.

Log messages leading up to this would be interesting, especially from corosync and pmxcfs/pve-cluster. Those can help to confirm above or look into why this might have happened if my interpretation above is wrong.
hey thanks for the response i just removed the 4tha and 5th node the failover are working fine now but when I shut down the 1rst and 2 nd node the 3 rd node reboots I am guessing it also goes into fencing state?
 
With "removed 4th and 5th node" you mean from the cluster?

In that case yes, then your total node count is 3, so 2 nodes are required to ensure you got >50% of the votes for cluster quorum, so only one node can be offline, if a second one goes down the last one also looses cluster quorum and if it hosts HA services it will indeed self-fence.

Not 100% sure I'm on the same page as you, but as it doesn't hurt to state:
In general, in a cluster with N nodes there must always be (N - 1) / 2 nodes online for the cluster to work, i.e. this works out as:
Total # of Nodes​
Required Online Nodes for Quroum​
Nodes That can be Offline (Total - Required)​
2​
2​
0​
3​
2​
1​
4​
3​
1​
5​
3​
2​
6​
4​
2​
7​
4​
3​
8​
5​
3​
9​
5​
4​

and so on.
 
Hey
With "removed 4th and 5th node" you mean from the cluster?

In that case yes, then your total node count is 3, so 2 nodes are required to ensure you got >50% of the votes for cluster quorum, so only one node can be offline, if a second one goes down the last one also looses cluster quorum and if it hosts HA services it will indeed self-fence.

Not 100% sure I'm on the same page as you, but as it doesn't hurt to state:
In general, in a cluster with N nodes there must always be (N - 1) / 2 nodes online for the cluster to work, i.e. this works out as:
Total # of Nodes​
Required Online Nodes for Quroum​
Nodes That can be Offline (Total - Required)​
2​
2​
0​
3​
2​
1​
4​
3​
1​
5​
3​
2​
6​
4​
2​
7​
4​
3​
8​
5​
3​
9​
5​
4​

and so on.
Hey Thanks for this info it helped a lot .
Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!