in the original setup there were 10 VMs protected by ha-manager, distributed to 3 (out of 5) compute nodes in our 8 node cluster. However, all 5 compute nodes emergency-rebooted at once - as this did not really match my expectation ("only" the 3 nodes running HA loads should've restarted) I'm hesitant to trust ha-manager too much - 2 hosts whose watchdog should've been in status idle / disarmed (no HA resources) nevertheless were force-restarted.
Hey
@pwizard! I don't think you expect an answer from me, but as I speed-read your thread so far I think the (outstanding) gist for you is the above quoted part.
And then the earlier concern as:
@aaron Do I understand you correctly - LRM/CRM is the
only component that arms the watchdog and as long as they are disabled, there is no other component of Proxmox, not even corosync / pmxcfs itself, that could trigger a reboot? pmxcfs simply switches to read-only if losing quorum, correct?
I have recently filed, for instance, a bugreport [1]. As you can see, there are definitely some rough edges in the HA stack.
The other thing is, when it comes to understanding the watchdog behaviour, it's a bit more complicated. I tried to touch on that in another post [2] where
@bjsko was experiencing similar woes, in fact the
post from @t.lamprecht referenced within [3]
explains it much better than the official docs, which are a simplification at best in my opinion.
First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].
Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with
lsmod | grep softdog
. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.
Now this is exactly what PVE does when it loads its
watchdog-mux.service
, which as its name implies is there to handle the feature in a staged (read: more elaborate than necessary) way.
This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the
watchdog-mux
from keeping the softdog happy, your node will reboot.
The safer way to prevent this from happening is to get rid of - but do not stop it - the watchdog-mux
service manually. Do not kill it, as it will fail to close the softdog device which will also cause a reboot. Same would happen if you stop it with active "clients" because ...
You see, the primary purpose of the
watchdog-mux.service
is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a
/run/watchdog-mux.active/
. The clients are the
pve-ha-crm.service
and
pve-ha-lrm.service
. This is the two you were pointed to above as for their documentation about the HA stack. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the
watchdog-mux.service
, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing.
If such service unexpectedly dies, it will cause the watchdog-mux.service
to stop resetting the softdog device and that will cause a reboot.
This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the
watchdog-mux.service
[8].
In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.
The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it
much more difficult to answer your question, what might have gone wrong, without more detailed logs.
Definitely beyond
grep 'Oct 31 16:3'
and
corosync
alone. As you can imagine from the above, it will be hell of a "structured" debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).
But if your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the
watchdog-mux.service
. Before that you have to do the same with
pve-ha-crm.service
and
pve-ha-lrm.service
. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ...
it was not designed to be neatly turned off. It's always going to haunt you.
Or you go full resistance...
Code:
tee /etc/modprobe.d/softdog-deny.conf << 'EOF'
blacklist softdog
install softdog /bin/false
EOF
... or they address it.
[1]
https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2]
https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3]
https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4]
https://www.kernel.org/doc/html/latest/watchdog/
[5]
https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6]
https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7]
https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8]
https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231