[TUTORIAL] [High Availability] Watchdog reboots

esi_y · Sep 18, 2024

First of all, you can recognise watchdog induced reboots of your node from the end of last boot's log containing entries such as:

Code:

watchdog-mux: Client watchdog expired - disable watchdog updates

kernel: watchdog: watchdog0: watchdog did not stop!

You should probably start with reading the official documentation on the topic [1].

Nevertheless, when it comes to understanding the watchdog behaviour, it's a bit more complicated than what it would cover. I tried to touch on that once in another post [2] where the OP was experiencing his own woes, in fact the staff post referenced within [3] explains the matter much better than the official docs, which are a simplification at best in my opinion. There seems to be some confusion about active and inactive, or so called disarmed watchdog.

Watchdog(s)

First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].

Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with lsmod | grep softdog. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.

Now this is exactly what PVE does when it loads its watchdog-mux.service, which as its name implies is there to handle the feature in a staged (i.e. elaborate) way. This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the watchdog-mux from keeping the softdog happy, your node will reboot.

The primary purpose of the watchdog-mux.service is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a /run/watchdog-mux.active/. The clients are the pve-ha-crm.service and pve-ha-lrm.service. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the watchdog-mux.service, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing. If such service unexpectedly dies, it will cause the watchdog-mux.service to stop resetting the softdog device and that will cause a reboot.

This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the watchdog-mux.service [8].

In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.

The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer "what might have gone wrong", without more detailed logs.

It is often tedious debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).

In case you do NOT use High Availability

If your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the watchdog-mux.service. Do not kill it, as it will fail to close the softdog device which will cause a reboot. Same would happen if you stop it with active "clients".

Before that you have to do therefore get rid of pve-ha-crm.service and pve-ha-lrm.service. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ... it was not designed to be neatly turned off. So you would have to mask them.

You can also blacklist the module:

Bash:

tee /etc/modprobe.d/softdog-deny.conf << 'EOF'
blacklist softdog
install softdog /bin/false
EOF

NOTE: Be sure you understand what disabling the watchdog means in case you were to ever re-enable HA and why it is NOT a good idea. In all other cases, it's fairly reasonable to want to not have such features active.

NOTE: There was a bugreport actually filed [9] regarding some rough edges in the HA stack. As of today, the bug is still present.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231
[9] https://bugzilla.proxmox.com/show_bug.cgi?id=5243

esi_y · Sep 18, 2024

The gist of this post was pulled out of a convoluted thread for the sake of better re-usability (I cannot remove or edit the content there anymore):
https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/

I will try to update this tutorial going forward to help better troubleshoot any HA related reboot issues.

esi_y · Sep 18, 2024

Additional notes

You can check for watchdog-mux behaviour with:

Bash:

strace -t -e ioctl  -p $(pidof watchdog-mux)  | grep WDIOC_KEEPALIVE

And for the device:

Bash:

wdctl /dev/watchdog0

You cannot use alternative watchdog handlers:

Bash:

# apt install --dry-run -o Debug::pkgProblemResolver=true watchdog

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Starting pkgProblemResolver with broken count: 1
Starting 2 pkgProblemResolver with broken count: 1
Investigating (0) pve-ha-manager:amd64 < 4.0.3 @ii K Ib >
Broken pve-ha-manager:amd64 Conflicts on watchdog:amd64 < none -> 5.16-1+b2 @un puN >
  Considering watchdog:amd64 9998 as a solution to pve-ha-manager:amd64 9
  Removing pve-ha-manager:amd64 rather than change watchdog:amd64
Investigating (0) qemu-server:amd64 < 8.0.10 @ii K Ib >
Broken qemu-server:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to qemu-server:amd64 7
  Removing qemu-server:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-container:amd64 < 5.0.8 @ii K Ib >
Broken pve-container:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to pve-container:amd64 6
  Removing pve-container:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-manager:amd64 < 8.1.4 @ii K Ib >
Broken pve-manager:amd64 Depends on pve-container:amd64 < 5.0.8 @ii R > (>= 5.0.5)
  Considering pve-container:amd64 6 as a solution to pve-manager:amd64 1
  Removing pve-manager:amd64 rather than change pve-container:amd64
Investigating (0) proxmox-ve:amd64 < 8.1.0 @ii K Ib >
Broken proxmox-ve:amd64 Depends on pve-manager:amd64 < 8.1.4 @ii R > (>= 8.0.4)
  Considering pve-manager:amd64 1 as a solution to proxmox-ve:amd64 0
  Removing proxmox-ve:amd64 rather than change pve-manager:amd64
Done
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
The following NEW packages will be installed:
  watchdog
0 upgraded, 1 newly installed, 5 to remove and 4 not upgraded.
Remv proxmox-ve [8.1.0]
Remv pve-manager [8.1.4]
Remv qemu-server [8.0.10] [pve-ha-manager:amd64 ]
Remv pve-ha-manager [4.0.3] [pve-container:amd64 ]
Remv pve-container [5.0.8]
Inst watchdog (5.16-1+b2 Debian:12.5/stable [amd64])
Conf watchdog (5.16-1+b2 Debian:12.5/stable [amd64])

And you cannot remove the HA stack on its own:
https://forum.proxmox.com/threads/cannot-remove-pve-ha-manager-why.141940/#post-636316

Search

Search

[TUTORIAL] [High Availability] Watchdog reboots

esi_y

Renowned Member

esi_y

Renowned Member

esi_y

Renowned Member