First of all, you can recognise watchdog induced reboots of your node from the end of last boot's log containing entries such as:
You should probably start with reading the official documentation on the topic [1].
Nevertheless, when it comes to understanding the watchdog behaviour, it's a bit more complicated than what it would cover. I tried to touch on that once in another post [2] where the OP was experiencing his own woes, in fact the staff post referenced within [3] explains the matter much better than the official docs, which are a simplification at best in my opinion. There seems to be some confusion about active and inactive, or so called disarmed watchdog.
Watchdog(s)
First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].
Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with
Now this is exactly what PVE does when it loads its
The primary purpose of the
This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the
In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.
The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer "what might have gone wrong", without more detailed logs.
It is often tedious debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).
In case you do NOT use High Availability
If your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the
Before that you have to do therefore get rid of
You can also blacklist the module:
NOTE: Be sure you understand what disabling the watchdog means in case you were to ever re-enable HA and why it is NOT a good idea. In all other cases, it's fairly reasonable to want to not have such features active.
NOTE: There was a bugreport actually filed [9] regarding some rough edges in the HA stack. As of today, the bug is still present.
[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231
[9] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
Code:
watchdog-mux: Client watchdog expired - disable watchdog updates
kernel: watchdog: watchdog0: watchdog did not stop!
You should probably start with reading the official documentation on the topic [1].
Nevertheless, when it comes to understanding the watchdog behaviour, it's a bit more complicated than what it would cover. I tried to touch on that once in another post [2] where the OP was experiencing his own woes, in fact the staff post referenced within [3] explains the matter much better than the official docs, which are a simplification at best in my opinion. There seems to be some confusion about active and inactive, or so called disarmed watchdog.
Watchdog(s)
First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].
Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with
lsmod | grep softdog
. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.Now this is exactly what PVE does when it loads its
watchdog-mux.service
, which as its name implies is there to handle the feature in a staged (i.e. elaborate) way. This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the watchdog-mux
from keeping the softdog happy, your node will reboot.The primary purpose of the
watchdog-mux.service
is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a /run/watchdog-mux.active/
. The clients are the pve-ha-crm.service
and pve-ha-lrm.service
. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the watchdog-mux.service
, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing. If such service unexpectedly dies, it will cause the watchdog-mux.service to stop resetting the softdog device and that will cause a reboot.This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the
watchdog-mux.service
[8].In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.
The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer "what might have gone wrong", without more detailed logs.
It is often tedious debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).
In case you do NOT use High Availability
If your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the
watchdog-mux.service
. Do not kill it, as it will fail to close the softdog device which will cause a reboot. Same would happen if you stop it with active "clients".Before that you have to do therefore get rid of
pve-ha-crm.service
and pve-ha-lrm.service
. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ... it was not designed to be neatly turned off. So you would have to mask them.You can also blacklist the module:
Bash:
tee /etc/modprobe.d/softdog-deny.conf << 'EOF'
blacklist softdog
install softdog /bin/false
EOF
NOTE: Be sure you understand what disabling the watchdog means in case you were to ever re-enable HA and why it is NOT a good idea. In all other cases, it's fairly reasonable to want to not have such features active.
NOTE: There was a bugreport actually filed [9] regarding some rough edges in the HA stack. As of today, the bug is still present.
[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231
[9] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
Last edited: