Why does my PVE node still get rebooted by Watchdog when HA VMs are set to Ignored during a network outage?

Oct 14, 2025
100
34
28
Hi everyone,

I have a question regarding the specific behavior of Proxmox VE's High Availability (HA) and Watchdog mechanism.

Our production environment consists of a PVE cluster with several nodes distributed across different server rooms in the office building. Recently, due to major internal network maintenance, we knew that the backbone connection between these rooms would be disconnected for about 30 minutes.

To prepare for this, my primary goal was to ensure that even if the network is cut and nodes lose their cluster quorum, the currently running VMs would stay powered on (even if they are temporarily unreachable from the outside). Most importantly, I wanted to prevent the PVE nodes from being force-rebooted by the fencing mechanism, as a hard reboot is quite stressful for the hardware and the critical services running on it.

My strategy was as follows: before the scheduled maintenance, I manually changed the HA state of every managed VM from started to ignored. My reasoning was that once a resource is set to ignored, the HA Manager should stop supervising these VMs, and therefore, there should be no reason for the watchdog to trigger a fencing operation.

However, once the network actually went down, the isolated node still performed a hard reboot after about a minute. The system log revealed the culprit: "kernel: watchdog: watchdog0: watchdog did not stop!". It is clear that the watchdog triggered because the node lost quorum, but I am confused as to why it remained active even though I had set all VM HA states to "ignored".

---

To investigate the root cause, I set up a simulation in my lab with three nodes (pve01, pve02, pve03) and reproduced the issue using an HA-managed VM (ID: 103) running on pve02:

1. Changing the HA State:I changed the state for VM 103 from started to ignored. The VM continued to run fine on pve02. In my understanding, since VM 103 was no longer under HA management, pve02 should not have been subject to the HA watchdog's fencing tests.

2026-03-31_11-02.png

2. Simulating Network Failure:I disabled the network interface on pve02 to simulate a loss of cluster quorum and isolate the node.

3. Watchdog Triggered Reboot:Within a minute of the disconnection, the pve02 console started flashing: "watchdog: watchdog0: watchdog did not stop!". Almost immediately after, the entire physical host rebooted. There was no graceful shutdown for the VM; it was a hard power-off.

2026-03-31_11-04.png

My questions are:

  • Why does the watchdog still trigger fencing even if all VMs are set to ignored?
  • Is there any way to prevent a node from rebooting during a temporary network outage without completely deleting the HA configuration?
I want the node to just "stay as is" and keep the VMs running even if it's isolated, rather than resorting to this aggressive self-reboot. What is the correct way to handle this situation? Thank you!
 
  • Why does the watchdog still trigger fencing even if all VMs are set to ignored?
  • Is there any way to prevent a node from rebooting during a temporary network outage without completely deleting the HA configuration?
After all the HA resources were set to 'ignored', it takes roughly 10 minutes for the LRM and 15 minutes for the CRM to release their watchdogs and become idle. Maybe the network interruption for both the production and test setup were done while some of the LRM services and/or the CRM Manager service were still active?

There is a new feature for pve-ha-manager (>=5.1.2) on the pve-test repository, which adds more user-friendly HA disarming and rearming methods, which can handle this automatically.

Another option to currently disarm the HA stack manually is to stop all pve-ha-lrm services on each of the nodes, then check that all HA resources are in the 'freeze' state with ha-manager status and check with systemctl status pve-ha-lrm that the watchdog was closed on each node. If all of that was successful, then finally stop all pve-ha-crm services on each node with systemctl stop pve-ha-crm.
 
Hi dakralex,

Thank you for the detailed explanation!

2026-04-01_14-18.png

Based on your tip, I reviewed my previous screenshots and noticed that the LRM on pve02 was indeed still in the "active" state. This confirms your suspicion and likely explains why the watchdog still triggered a fencing reboot during the network outage even though the resources were set to "ignored."

Regarding the manual disarming of the HA stack, I have a follow-up question:

I tried running systemctl restart pve-ha-lrm to see if it would transition pve02 to "idle." After the restart, the LRM status did become "idle." I then tested a simulated network disconnection, and this time the node remained stable—the watchdog did not trigger a reboot even after a long wait.

Is this a valid or recommended way to force the LRM into an idle state? Are there any specific risks associated with restarting the pve-ha-lrm service while HA resources are set to "ignored"? I want to make sure I'm not inadvertently causing other cluster inconsistencies.

Looking forward to your insights!