Unexpected reboots on a 2nd node while failover test

faisa7847 · Saturday at 10:06

Hello i have a 5 node cluster i shut down the 4th and 5th node due to high latency issue
we didn't have any problems at the beginning but now while doing failover test by shutting the the first node the 2nd does a reboot

this was the last log from 2nd node while going to reboot
Oct 19 10:58:00 prx2 watchdog-mux[1323]: client watchdog expired - disable watchdog updates
Oct 19 10:58:09 prx2 pvescheduler[5162]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Oct 19 10:58:09 prx2 pvescheduler[5161]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

t.lamprecht · Saturday at 13:32

Hello

faisa7847 said:
Hello i have a 5 node cluster i shut down the 4th and 5th node due to high latency issue
we didn't have any problems at the beginning but now while doing failover test by shutting the the first node the 2nd does a reboot

Did I get this right: 2 nodes out of five were already powered off, and then you powered down another node?
If so then yes, this is expected if HA is active. In a five node cluster you need at least three nodes to have a majority, i.e. your cluster can "loose" (for maintenance or due to failure) up to two nodes and still remain operational. But if a third node goes down the remaining two cannot tell the difference between them being still operational or if there is, e.g., a network issue and on the "other side" of that network problem the other three nodes are operational, thus quorum loss and thus self-fencing through the watchdog due to active HA service.

faisa7847 said:
Oct 19 10:58:00 prx2 watchdog-mux[1323]: client watchdog expired - disable watchdog updates

Log messages leading up to this would be interesting, especially from corosync and pmxcfs/pve-cluster. Those can help to confirm above or look into why this might have happened if my interpretation above is wrong.

faisa7847 · 2024-10-21T20:24:33+0200

t.lamprecht said:
Did I get this right: 2 nodes out of five were already powered off, and then you powered down another node?
If so then yes, this is expected if HA is active. In a five node cluster you need at least three nodes to have a majority, i.e. your cluster can "loose" (for maintenance or due to failure) up to two nodes and still remain operational. But if a third node goes down the remaining two cannot tell the difference between them being still operational or if there is, e.g., a network issue and on the "other side" of that network problem the other three nodes are operational, thus quorum loss and thus self-fencing through the watchdog due to active HA service.

Log messages leading up to this would be interesting, especially from corosync and pmxcfs/pve-cluster. Those can help to confirm above or look into why this might have happened if my interpretation above is wrong.

hey thanks for the response i just removed the 4tha and 5th node the failover are working fine now but when I shut down the 1rst and 2 nd node the 3 rd node reboots I am guessing it also goes into fencing state?

Search

Search

Unexpected reboots on a 2nd node while failover test

faisa7847

New Member

t.lamprecht

Proxmox Staff Member

faisa7847

New Member