[TUTORIAL] [High Availability] Watchdog reboots

esi_y · Sep 18, 2024

First of all, you can recognise watchdog induced reboots of your node from the end of last boot's log containing entries such as:

Code:

watchdog-mux: Client watchdog expired - disable watchdog updates

kernel: watchdog: watchdog0: watchdog did not stop!

You should probably start with reading the official documentation on the topic [1].

Nevertheless, when it comes to understanding the watchdog behaviour, it's a bit more complicated than what it would cover. I tried to touch on that once in another post [2] where the OP was experiencing his own woes, in fact the staff post referenced within [3] explains the matter much better than the official docs, which are a simplification at best in my opinion. There seems to be some confusion about active and inactive, or so called disarmed watchdog.

Watchdog(s)

First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].

Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with lsmod | grep softdog. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.

Now this is exactly what PVE does when it loads its watchdog-mux.service, which as its name implies is there to handle the feature in a staged (i.e. elaborate) way. This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the watchdog-mux from keeping the softdog happy, your node will reboot.

The primary purpose of the watchdog-mux.service is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a /run/watchdog-mux.active/. The clients are the pve-ha-crm.service and pve-ha-lrm.service. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the watchdog-mux.service, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing. If such service unexpectedly dies, it will cause the watchdog-mux.service to stop resetting the softdog device and that will cause a reboot.

This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the watchdog-mux.service [8].

In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.

The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer "what might have gone wrong", without more detailed logs.

It is often tedious debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).

In case you do NOT use High Availability

If your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the watchdog-mux.service. Do not kill it, as it will fail to close the softdog device which will cause a reboot. Same would happen if you stop it with active "clients".

Before that you have to do therefore get rid of pve-ha-crm.service and pve-ha-lrm.service. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ... it was not designed to be neatly turned off. So you would have to mask them.

You can also blacklist the module:

Bash:

tee /etc/modprobe.d/softdog-deny.conf << 'EOF'
blacklist softdog
install softdog /bin/false
EOF

NOTE: Be sure you understand what disabling the watchdog means in case you were to ever re-enable HA and why it is NOT a good idea. In all other cases, it's fairly reasonable to want to not have such features active.

NOTE: There was a bugreport actually filed [9] regarding some rough edges in the HA stack. As of today, the bug is still present.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231
[9] https://bugzilla.proxmox.com/show_bug.cgi?id=5243

esi_y · Sep 18, 2024

The gist of this post was pulled out of a convoluted thread for the sake of better re-usability (I cannot remove or edit the content there anymore):
https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/

I will try to update this tutorial going forward to help better troubleshoot any HA related reboot issues.

esi_y · Sep 18, 2024

Additional notes

You can check for watchdog-mux behaviour with:

Bash:

strace -t -e ioctl  -p $(pidof watchdog-mux)  | grep WDIOC_KEEPALIVE

And for the device:

Bash:

wdctl /dev/watchdog0

You cannot use alternative watchdog handlers:

Bash:

# apt install --dry-run -o Debug::pkgProblemResolver=true watchdog

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Starting pkgProblemResolver with broken count: 1
Starting 2 pkgProblemResolver with broken count: 1
Investigating (0) pve-ha-manager:amd64 < 4.0.3 @ii K Ib >
Broken pve-ha-manager:amd64 Conflicts on watchdog:amd64 < none -> 5.16-1+b2 @un puN >
  Considering watchdog:amd64 9998 as a solution to pve-ha-manager:amd64 9
  Removing pve-ha-manager:amd64 rather than change watchdog:amd64
Investigating (0) qemu-server:amd64 < 8.0.10 @ii K Ib >
Broken qemu-server:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to qemu-server:amd64 7
  Removing qemu-server:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-container:amd64 < 5.0.8 @ii K Ib >
Broken pve-container:amd64 Depends on pve-ha-manager:amd64 < 4.0.3 @ii R > (>= 3.0-9)
  Considering pve-ha-manager:amd64 9 as a solution to pve-container:amd64 6
  Removing pve-container:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-manager:amd64 < 8.1.4 @ii K Ib >
Broken pve-manager:amd64 Depends on pve-container:amd64 < 5.0.8 @ii R > (>= 5.0.5)
  Considering pve-container:amd64 6 as a solution to pve-manager:amd64 1
  Removing pve-manager:amd64 rather than change pve-container:amd64
Investigating (0) proxmox-ve:amd64 < 8.1.0 @ii K Ib >
Broken proxmox-ve:amd64 Depends on pve-manager:amd64 < 8.1.4 @ii R > (>= 8.0.4)
  Considering pve-manager:amd64 1 as a solution to proxmox-ve:amd64 0
  Removing proxmox-ve:amd64 rather than change pve-manager:amd64
Done
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
The following NEW packages will be installed:
  watchdog
0 upgraded, 1 newly installed, 5 to remove and 4 not upgraded.
Remv proxmox-ve [8.1.0]
Remv pve-manager [8.1.4]
Remv qemu-server [8.0.10] [pve-ha-manager:amd64 ]
Remv pve-ha-manager [4.0.3] [pve-container:amd64 ]
Remv pve-container [5.0.8]
Inst watchdog (5.16-1+b2 Debian:12.5/stable [amd64])
Conf watchdog (5.16-1+b2 Debian:12.5/stable [amd64])

And you cannot remove the HA stack on its own:
https://forum.proxmox.com/threads/cannot-remove-pve-ha-manager-why.141940/#post-636316

silverstone · Feb 9, 2025

esi_y said:
First of all, you can recognise watchdog induced reboots of your node from the end of last boot's log containing entries such as:

Code:

watchdog-mux: Client watchdog expired - disable watchdog updates kernel: watchdog: watchdog0: watchdog did not stop!

You should probably start with reading the official documentation on the topic [1].

Nevertheless, when it comes to understanding the watchdog behaviour, it's a bit more complicated than what it would cover. I tried to touch on that once in another post [2] where the OP was experiencing his own woes, in fact the staff post referenced within [3] explains the matter much better than the official docs, which are a simplification at best in my opinion. There seems to be some confusion about active and inactive, or so called disarmed watchdog.

Watchdog(s)

First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].

Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with lsmod | grep softdog. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.

Now this is exactly what PVE does when it loads its watchdog-mux.service, which as its name implies is there to handle the feature in a staged (i.e. elaborate) way. This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the watchdog-mux from keeping the softdog happy, your node will reboot.

The primary purpose of the watchdog-mux.service is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a /run/watchdog-mux.active/. The clients are the pve-ha-crm.service and pve-ha-lrm.service. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the watchdog-mux.service, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing. If such service unexpectedly dies, it will cause the watchdog-mux.service to stop resetting the softdog device and that will cause a reboot.

This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the watchdog-mux.service [8].

In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.

The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer "what might have gone wrong", without more detailed logs.

It is often tedious debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).

In case you do NOT use High Availability

If your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the watchdog-mux.service. Do not kill it, as it will fail to close the softdog device which will cause a reboot. Same would happen if you stop it with active "clients".

Before that you have to do therefore get rid of pve-ha-crm.service and pve-ha-lrm.service. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ... it was not designed to be neatly turned off. So you would have to mask them.

You can also blacklist the module:

Bash:

tee /etc/modprobe.d/softdog-deny.conf << 'EOF' blacklist softdog install softdog /bin/false EOF

NOTE: Be sure you understand what disabling the watchdog means in case you were to ever re-enable HA and why it is NOT a good idea. In all other cases, it's fairly reasonable to want to not have such features active.

NOTE: There was a bugreport actually filed [9] regarding some rough edges in the HA stack. As of today, the bug is still present.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231
[9] https://bugzilla.proxmox.com/show_bug.cgi?id=5243

First of all, thanks for your Efforts in shedding Light on this Topic.

I currently rely on the Watchdog on some Raspberry Pi SBC in order to force a Reboot in case they freeze for whatever Reason.

I was planning on doing the same with Proxmox VE (single Standalone Host, no HA) and to my Disbelief trying to install watchdog prompted the entire removal of Proxmox VE

.

So I cannot activate Proxmox HA because it's a Standalone Host (no Cluster). And I cannot install the Watchdog otherwise.

And it seems that one of my Hosts freezes from Time to Time (around every 2 Weeks). It might be related to an Intel X710 NIC that I recently started using. Or it might be somewhat completely unrelated, I don't know. I cannot SSH or ping. And in the IPMI Console I just see some text (no real Error/Panic Message) but I cannot do anything. Pressing ENTER or any other Key doesn't change anything on the Screen. While it didn't Panic, it seems to have Frozen.

Thus I was looking for a Way to automatically reboot it in case this happens again.

I see your explanation is more about how to disable the Watchdog / Softdog. Do you have any insight as to how to enable it ?

Otherwise the only other alternative is to setup some kind of Remote Monitoring and trigger a System Reset via IPMI/BMC but that's additional Work

.

fabian · Feb 10, 2025

you can still enable HA on a single node (some people do that to automatically restart guests that might crash, for example), which will still arm the watchdog and fence your system if it becomes unresponsive

silverstone · Feb 10, 2025

fabian said:
you can still enable HA on a single node (some people do that to automatically restart guests that might crash, for example), which will still arm the watchdog and fence your system if it becomes unresponsive

I'm a bit confused though ... https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements_3 says (about HA):

You must meet the following requirements before you start with HA:

at least three cluster nodes (to get reliable quorum)

shared storage for VMs and containers

hardware redundancy (everywhere)

use reliable “server” components

hardware watchdog - if not available we fall back to the linux kernel software watchdog (softdog)

optional hardware fencing devices

So it would appear it's not possible to enable HA with a Single / Standalone Node (well, it also doesn't make much sense to enable HA in such case obviously).

Are you talking about HA Manager Fencing Instead for a Single Node ?

There are also some Notes about Fencing in the Wiki.

But how is this achieved in a Standalone Setup ? I understood the softdog is enabled by default yet the Host didn't reboot by itself once the Freeze occurred.

Or the softdog is loaded, but no action (reboot) is Taken ? I guess that since it's the Kernel that froze (apparently), of course the Kernel cannot kick itself and reboot

.

fabian · Feb 10, 2025

well, to achieve actual HA you obviously need three nodes minimum and proper redundancy. to enable the HA mode/feature of PVE you don't. the linked wiki article is very outdated (see the note on top), the fencing part of the regular docs should still be correct. if your board has a proper (support/working) hardware watchdog, you can use that. else you need to use the softdog, which may or may not work for all failure scenarios relevant for your use case. you can check the logs to see which watchdog is used and whether it is armed.

magingale · Aug 21, 2025

Glad I found this topic! I build a 2 node cluster with Pi quorum and every time when I isolated one node by pulling the network plug the remaining node rebooted due to the watchdog. You can see it below 172.16.1.21 pulled the plug, 172.16.1.20 stops sending corosyn beacons because the watchdog simply kills the node too. Then I'm stuck with zero nodes left....

Is this expected behaviour in proxmox 9.0.5, or is some code inverted

Like I do understand that a reboot of the isolated/unplugged node is most of the time a goed idea. But the reboot of the remaining node, which is in sync with the quorum, also reboots by a quorum value of 2 out of 3. Super weird behaviour?

By disabling HA in gui does that fix this behaviour? I can live without HA.

silverstone · Aug 21, 2025

magingale said:
Glad I found this topic! I build a 2 node cluster with Pi quorum and every time when I isolated one node by pulling the network plug the remaining node rebooted due to the watchdog. You can see it below 172.16.1.21 pulled the plug, 172.16.1.20 stops sending corosyn beacons because the watchdog simply kills the node too. Then I'm stuck with zero nodes left....

Is this expected behaviour in proxmox 9.0.5, or is some code inverted Like I do understand that a reboot of the isolated/unplugged node is most of the time a goed idea. But the reboot of the remaining node, which is in sync with the quorum, also reboots by a quorum value of 2 out of 3. Super weird behaviour?

By disabling HA in gui does that fix this behaviour? I can live without HA.

View attachment 89719

Are you sure you set a different "weight" (number of Votes) for the "remaining" Node ?

What is :ffff:172.16.1.20 and :ffff:172.16.1.21 ? It looks like 2 Hosts are down, not only 1 ... But who is monitoring/pinging them then ? Do you have another Device doing the Monitoring or ?

magingale · Aug 21, 2025

silverstone said:
Are you sure you set a different "weight" (number of Votes) for the "remaining" Node ?

What is :ffff:172.16.1.20 and :ffff:172.16.1.21 ? It looks like 2 Hosts are down, not only 1 ... But who is monitoring/pinging them then ? Do you have another Device doing the Monitoring or ?

Thanks for you reply! The output is captured from the Pi the quorum device. The both nodes pve 172.16.1.20 and pve1 172.16.1.21 went down when I unplugged the lan port of pve1, you see the .21 disconnecting and in my humble opinion node .20 should remain up. But that also disconnected and rebooted due to watchdog-mux, which does not make sense because in a cluster you want uptime ;-)

I'm not familiar by setting weights, they all have one vote now which makes 3 total? Which leaves a majority of 2 out of 3 when I unplug one node from the network?

Code:

root@pve:~# pvecm status
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Aug 21 22:13:20 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.26d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000002          1    A,V,NMW 172.16.1.21
0x00000000          1            Qdevice

silverstone · Aug 21, 2025

magingale said:
Thanks for you reply! The output is captured from the Pi the quorum device. The both nodes pve 172.16.1.20 and pve1 172.16.1.21 went down when I unplugged the lan port of pve1, you see the .21 disconnecting and in my humble opinion node .20 should remain up. But that also disconnected and rebooted due to watchdog-mux, which does not make sense because in a cluster you want uptime ;-)

I'm not familiar by setting weights, they all have one vote now which makes 3 total? Which leaves a majority of 2 out of 3 when I unplug one node from the network?

Code:

root@pve:~# pvecm status Cluster information ------------------- Name: Home Config Version: 11 Transport: knet Secure auth: on Quorum information ------------------ Date: Thu Aug 21 22:13:20 2025 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 0x00000001 Ring ID: 1.26d Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Qdevice Membership information ---------------------- Nodeid Votes Qdevice Name 0x00000001 1 A,V,NMW 172.16.1.20 (local) 0x00000002 1 A,V,NMW 172.16.1.21 0x00000000 1 Qdevice

Well, it say quorate, so it should be fine. Is this after unplugging the Cable on pve1 and pve2 also rebooted ? Are you sure it's a reboot and not some e.g. Network IP Address conflict due to reconfiguration or something along those Lines ? Does uptime on the 2nd Host Confirm that it rebooted (uptime should yield a very low value in Minutes if it indeed rebooted).

It's been a long Time since I "played" with Cluster, since it's not really suited for a Ad-Hoc Homelab Deployment where many of my Systems might go up or down (intentionally) based on the Need and the Cost of Electricity

.

I remember some Years back I added some higher "Weight" to those Nodes that were intended to always stay on, while leaving a Weight of 1 to the Ad-Hoc ones.

Probably it was done by setting the votes: Entry to something higher than 1 (like 10 IIRC in my Case) in /etc/corosync/corosync.conf.

On the other Hand, I'm NOT familiar with the concept of "Quorum Device" apart from Google which apparently returns

F

Thread '2 Node HA with external QDevice'

Oct 25, 2023

We are currently running a single proxmox server to host our VMs, and I would like to build some redundancy. I was thinking of ordering an identical spec server to basically have a mirror. We also have some old servers laying around we can use for the quorum as I've read that you need 3 servers to achieve HA in Proxmox, however we don't want the VMs to run on this 3rd server as it will be old hardware.

For this 3rd server, do we have to run Proxmox on it? Or can we use it as a QDevice? I'm not sure what this means, would really appreciate an explanation and pointers.

It seems to be a "Lightweight Quorum Device" ?

By the Way, how is the Networking between all these Devices ? Are you sure you don't have something else going on in the other Pi ?

magingale · Aug 21, 2025

Yes it's a light weight Quorum device, just heartbeating like a node to create an extra witness for cluster to prevent from bad failover decisions / split-brains. My network is out of scope, thats my day$job, they have 100% reachability.

I can write a lot of tech details but this one summarise it all: when adding the watchdog blacklist from this topic al my issues where gone and the node stay up which is confirmed by the uptime counters. I even try to disable the debian kernel watchdog, the only solutions was the blacklist code from this topic to disable the proxmox watchdog-mux. The reboot is 100% initiated from the watchdog-mux but without an explainable reason. Other than the services pve-ha-crm.service and pve-ha-lrm.service are failing at the unplugged node i.e. not reachable any more from the remaining nodes and that causes the watchdog reset on the remaining node as written in the start post with a 100% downtime of all the virtualised servers.

I'm happy I noticed this at home and not in a customer production site ;-) Maybe can @fabian shine his light over it?

silverstone · Aug 21, 2025

magingale said:
Yes it's a light weight Quorum device, just heartbeating like a node to create an extra witness for cluster to prevent from bad failover decisions / split-brains.

Alright

.

magingale said:
My network is out of scope, thats my day$job, they have 100% reachability.

Never say never

. If they have 100% reachability why is the other Node going down

?

You say you are certain it's the Watchdog, but does the Watchdog or the PVE Service give any indication as to why ? Is it communication Failure, Quorum Failure, etc ?

There seems to be a delay of up to 63 Seconds, although not sure what corosync Sampling Time / Ping Interval is.

There are some Notes in the Section lost agent lock in the Wiki, not sure if that explains it:
https://pve.proxmox.com/wiki/High_Availability

That seems to match the 63 Seconds you described in your initial Post.

Anything else from the Detailed logs ?

Read the Logs
The HA Stack logs every action it makes. This helps to understand what and also why something happens in the cluster. Here its important to see what both daemons, the LRM and the CRM, did. You may use journalctl -u pve-ha-lrm on the node(s) where the service is and the same command for the pve-ha-crm on the node which is the current master.

magingale said:
I can write a lot of tech details but this one summarise it all: when adding the watchdog blacklist from this topic al my issues where gone and the node stay up which is confirmed by the uptime counters. I even try to disable the debian kernel watchdog, the only solutions was the blacklist code from this topic to disable the proxmox watchdog-mux. The reboot is 100% initiated from the watchdog-mux but without an explainable reason. Other than the services pve-ha-crm.service and pve-ha-lrm.service are failing at the unplugged node i.e. not reachable any more from the remaining nodes and that causes the watchdog reset on the remaining node as written in the start post with a 100% downtime of all the virtualised servers.

I'm happy I noticed this at home and not in a customer production site ;-) Maybe can @fabian shine his light over it?

Well I'm not an Expert in Cluster as I said already BUT ... your Cluster looks like it's only for Configuration Purposes (like mine was), to basically prevent editing the Cluster if too many Members are down, allow (manual) migration of Containers / Virtual Machines from one Host to the other using the GUI, etc.

You are NOT doing HA i.e. running Ceph or Similar, with actual replication of the Data going on in real Time / all the Time, right ?

The Question I have for you is then: why is only the second Node rebooting and NOT the Raspberry Pi ? There's probably something specific about the 2nd Node that causes it to reboot. And the first Item on my List would be the Network.

Can you setup a monitoring System with sampling Time of approximatively 1 Seconds in that L2 Segment of your Network to see if you can Ping every host from everywhere at all Times even when you unplug pve1 ? My guess is that pve2 rebooted because the connection for some reason, either to a specific Host (you might have configured that at one Point) or in general towards you e.g. Router, was lost.

If you have Firewalls / Routing between any of these, make sure that you don't have some dependency on pve1. Be it a Virtual Router, Nameserver, IPv6 Router Advertisement Daemon (radvd), etc. Such that if pve1 goes down, then the other Host won't have access to anything, Times out, Watchdog triggers and System reboots.

magingale · Aug 22, 2025

Thanks, let me do two tests. One without HA configured on VM level and one with HA configured on VM level.

I enabled the watchdog back to default promox state, and I just pulled the network plug on pve1 like yesterday. Below the output of these test, which shows normal behaviour. The remaining pve node stays up and healthy! The cluster shows that it has lost a node but is still reachable, thats also proof that the network is working fine and there are no firewalls in between or any packetdrops

I'll post them separately:

Test without HA pulling the network plug on pve1 while pve remains running

Starting point

Code:

root@pve1:~# ls /dev/watchdog*
lsmod | grep softdog
systemctl status watchdog-mux
ha-manager status
/dev/watchdog  /dev/watchdog0
softdog                16384  2
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/usr/lib/systemd/system/watchdog-mux.service; static)
     Active: active (running) since Fri 2025-08-22 08:18:35 CEST; 3min 46s ago
 Invocation: 2180962d2d7f41e78cc4476ffe3e4ccb
   Main PID: 670 (watchdog-mux)
      Tasks: 1 (limit: 38034)
     Memory: 224K (peak: 1.8M)
        CPU: 22ms
     CGroup: /system.slice/watchdog-mux.service
             └─670 /usr/sbin/watchdog-mux

Aug 22 08:18:35 pve1 watchdog-mux[670]: Watchdog driver 'Software Watchdog', version 0
quorum OK
master pve1 (idle, Thu Aug 21 19:27:09 2025)
lrm pve (idle, Fri Aug 22 08:22:22 2025)
lrm pve1 (idle, Fri Aug 22 08:22:20 2025)


root@pve1:~# ha-manager status
pvecm status
systemctl status pve-ha-crm pve-ha-lrm
quorum OK
master pve1 (idle, Thu Aug 21 19:27:09 2025)
lrm pve (idle, Fri Aug 22 08:24:07 2025)
lrm pve1 (idle, Fri Aug 22 08:24:05 2025)
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:24:08 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.27f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20
0x00000002          1    A,V,NMW 172.16.1.21 (local)
0x00000000          1            Device

Code:

root@pve1:~# ls /dev/watchdog*
lsmod | grep softdog
systemctl status watchdog-mux
ha-manager status
/dev/watchdog  /dev/watchdog0
softdog                16384  2
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/usr/lib/systemd/system/watchdog-mux.service; static)
     Active: active (running) since Fri 2025-08-22 08:18:35 CEST; 27min ago
 Invocation: 2180962d2d7f41e78cc4476ffe3e4ccb
   Main PID: 670 (watchdog-mux)
      Tasks: 1 (limit: 38034)
     Memory: 224K (peak: 1.8M)
        CPU: 96ms
     CGroup: /system.slice/watchdog-mux.service
             └─670 /usr/sbin/watchdog-mux

Aug 22 08:18:35 pve1 watchdog-mux[670]: Watchdog driver 'Software Watchdog', version 0
quorum OK
master pve1 (idle, Thu Aug 21 19:27:09 2025)
lrm pve (idle, Fri Aug 22 08:45:37 2025)
lrm pve1 (idle, Fri Aug 22 08:45:37 2025)

root@pve1:~# pvecm status
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:45:35 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.287
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20
0x00000002          1    A,V,NMW 172.16.1.21 (local)
0x00000000          1            Qdevice

Pulled the network plug on pve1 therefore no SSH access possible any more logs from pve

Code:

root@pve:~# date && echo "DISCONNECT pve1 NOW"
Fri Aug 22 08:47:14 AM CEST 2025
DISCONNECT pve1 NOW

Code:

root@pve:~# journalctl -f -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux
Aug 22 08:19:27 pve watchdog-mux[746]: Watchdog driver 'Software Watchdog', version 0
Aug 22 08:19:31 pve pve-ha-crm[1208]: starting server
Aug 22 08:19:31 pve pve-ha-crm[1208]: status change startup => wait_for_quorum
Aug 22 08:19:52 pve pve-ha-lrm[1389]: starting server
Aug 22 08:19:52 pve pve-ha-lrm[1389]: status change startup => wait_for_agent_lock
Aug 22 08:48:14 pve pve-ha-lrm[1389]: loop take too long (67 seconds)


root@pve:~# journalctl -f -u corosync -u pve-cluster -u pmxcfs
Aug 22 08:47:15 pve corosync[1143]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 08:47:15 pve corosync[1143]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 08:47:15 pve corosync[1143]:   [KNET  ] host: host: 2 has no active links
Aug 22 08:47:24 pve corosync[1143]:   [TOTEM ] Token has not been received in 22500 ms
Aug 22 08:47:32 pve corosync[1143]:   [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Aug 22 08:48:08 pve corosync[1143]:   [QUORUM] Sync members[1]: 1
Aug 22 08:48:08 pve corosync[1143]:   [QUORUM] Sync left[1]: 2
Aug 22 08:48:08 pve corosync[1143]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Aug 22 08:48:08 pve corosync[1143]:   [TOTEM ] A new membership (1.28b) was formed. Members left: 2
Aug 22 08:48:08 pve corosync[1143]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 22 08:48:08 pve pmxcfs[972]: [dcdb] notice: members: 1/972
Aug 22 08:48:08 pve pmxcfs[972]: [status] notice: members: 1/972
Aug 22 08:48:09 pve corosync[1143]:   [QUORUM] Members[1]: 1
Aug 22 08:48:09 pve corosync[1143]:   [MAIN  ] Completed service synchronization, ready to provide service.


Every 2.0s: ha-manager status && echo "--- Cluster Status ---" && pvecm status                                                                                                        pve: Fri Aug 22 08:51:12 2025

quorum OK
master pve1 (idle, Thu Aug 21 19:27:09 2025)
lrm pve (idle, Fri Aug 22 08:51:09 2025)
lrm pve1 (old timestamp - dead?, Fri Aug 22 08:47:02 2025)
--- Cluster Status ---
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:51:13 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.28b
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000000          1            Qdevice

PiServer Quorum device only .21 (pve1) is lost all went fine the vm's remains up and running like I expect from a cluster.

Code:

pi@piserver:~$ journalctl -f -u corosync-qnetd
Aug 21 18:22:22 piserver systemd[1]: Starting corosync-qnetd.service - Corosync Qdevice Network daemon...
Aug 21 18:22:22 piserver systemd[1]: Started corosync-qnetd.service - Corosync Qdevice Network daemon.
Aug 21 18:30:50 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:55348 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:31:53 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:34180 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:36:28 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:37328 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:59:44 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:45948 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:00:53 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:44904 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:08:08 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:56990 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:33:24 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:36562 doesn't sent any message during 12000ms. Disconnecting
Aug 21 20:15:31 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:43138 doesn't sent any message during 12000ms. Disconnecting
Aug 22 08:25:46 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:51076 doesn't sent any message during 12000ms. Disconnecting

Aug 22 08:47:17 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:33306 doesn't sent any message during 12000ms. Disconnecting
no .20 ip showing up

silverstone · Aug 22, 2025

magingale said:
Thanks, let me do two tests. One without HA configured on VM level and one with HA configured on VM level.

I enabled the watchdog back to default promox state, and I just pulled the network plug on pve1 like yesterday. Below the output of these test, which shows normal behaviour. The remaining pve node stays up and healthy! The cluster shows that it has lost a node but is still reachable, thats also proof that the network is working fine and there are no firewalls in between or any packetdrops

So what exactly is the difference between these 2 Tests and what you did yesterday, which (apparently) resulted in the "healthy" Node also rebooting ?

magingale · Aug 22, 2025

@silverstone the only difference between these test is enabling HA.

My conclusion: CRM is rebooting the wrong node and causing a full cluster outage via watchdog fencing on the remaining node.

Test number two with HA configured:

pve-ha-crm is chiming in and configuring the protected VM is running on the node who got isolated in this test.

Code:

Aug 22 08:53:32 pve pve-ha-crm[1208]: successfully acquired lock 'ha_manager_lock'
Aug 22 08:53:32 pve pve-ha-crm[1208]: watchdog active
Aug 22 08:53:32 pve pve-ha-crm[1208]: status change wait_for_quorum => master
Aug 22 08:53:32 pve pve-ha-crm[1208]: adding new service 'vm:102' on node 'pve1'
Aug 22 08:53:32 pve pve-ha-crm[1208]: service 'vm:102': state changed from 'request_start' to 'started'  (node = pve1)

Cluster up | stable | watchdog running

Code:

root@pve1:~# ls /dev/watchdog*
lsmod | grep softdog
systemctl status watchdog-mux
ha-manager status
/dev/watchdog  /dev/watchdog0
softdog                16384  2
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/usr/lib/systemd/system/watchdog-mux.service; static)
     Active: active (running) since Fri 2025-08-22 08:18:35 CEST; 37min ago
 Invocation: 2180962d2d7f41e78cc4476ffe3e4ccb
   Main PID: 670 (watchdog-mux)
      Tasks: 1 (limit: 38034)
     Memory: 224K (peak: 1.8M)
        CPU: 132ms
     CGroup: /system.slice/watchdog-mux.service
             └─670 /usr/sbin/watchdog-mux

Aug 22 08:18:35 pve1 watchdog-mux[670]: Watchdog driver 'Software Watchdog', version 0
quorum OK
master pve (active, Fri Aug 22 08:55:52 2025)
lrm pve (idle, Fri Aug 22 08:55:54 2025)
lrm pve1 (active, Fri Aug 22 08:55:54 2025)
service vm:102 (pve1, started)
root@pve1:~# pvecm status
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:56:08 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.28f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20
0x00000002          1    A,V,NMW 172.16.1.21 (local)
0x00000000          1            Qdevice

Code:

root@pve:~# ls /dev/watchdog*
lsmod | grep softdog
systemctl status watchdog-mux
ha-manager status
/dev/watchdog  /dev/watchdog0
softdog                16384  2
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/usr/lib/systemd/system/watchdog-mux.service; static)
     Active: active (running) since Fri 2025-08-22 08:19:27 CEST; 35min ago
 Invocation: 316928de3c9c422685bef7c18dd0dd22
   Main PID: 746 (watchdog-mux)
      Tasks: 1 (limit: 38055)
     Memory: 224K (peak: 1.8M)
        CPU: 62ms
     CGroup: /system.slice/watchdog-mux.service
             └─746 /usr/sbin/watchdog-mux

Aug 22 08:19:27 pve watchdog-mux[746]: Watchdog driver 'Software Watchdog', version 0
quorum OK
master pve (active, Fri Aug 22 08:55:22 2025)
lrm pve (idle, Fri Aug 22 08:55:24 2025)
lrm pve1 (active, Fri Aug 22 08:55:24 2025)
service vm:102 (pve1, started)
root@pve:~# pvecm status
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:55:30 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.28f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000002          1    A,V,NMW 172.16.1.21
0x00000000          1            Qdevice
root@pve:~#

And there is the reboot:

Code:

root@pve:~# journalctl -f -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux

Aug 22 08:48:14 pve pve-ha-lrm[1389]: loop take too long (67 seconds)
Aug 22 08:53:32 pve pve-ha-crm[1208]: successfully acquired lock 'ha_manager_lock'
Aug 22 08:53:32 pve pve-ha-crm[1208]: watchdog active
Aug 22 08:53:32 pve pve-ha-crm[1208]: status change wait_for_quorum => master
Aug 22 08:53:32 pve pve-ha-crm[1208]: adding new service 'vm:102' on node 'pve1'
Aug 22 08:53:32 pve pve-ha-crm[1208]: service 'vm:102': state changed from 'request_start' to 'started'  (node = pve1)

Aug 22 08:59:13 pve watchdog-mux[746]: client watchdog is about to expire
Aug 22 08:59:23 pve watchdog-mux[746]: client watchdog expired - disable watchdog updates
Aug 22 08:59:24 pve watchdog-mux[746]: exit watchdog-mux with active connections
Read from remote host pve.high.lan: Connection reset by peer
Connection to pve.high.lan closed.
client_loop: send disconnect: Broken pipe

Code:

root@pve:~# journalctl -f -u corosync -u pve-cluster -u pmxcfs
Aug 22 08:52:50 pve pmxcfs[972]: [dcdb] notice: sent all (2) updates
Aug 22 08:52:50 pve pmxcfs[972]: [dcdb] notice: all data is up to date
Aug 22 08:52:50 pve pmxcfs[972]: [status] notice: received all states
Aug 22 08:52:50 pve pmxcfs[972]: [status] notice: all data is up to date
Aug 22 08:52:50 pve pmxcfs[972]: [status] notice: dfsm_deliver_queue: queue length 1
Aug 22 08:58:40 pve corosync[1143]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 08:58:40 pve corosync[1143]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 08:58:40 pve corosync[1143]:   [KNET  ] host: host: 2 has no active links
Aug 22 08:58:49 pve corosync[1143]:   [TOTEM ] Token has not been received in 22500 ms
Aug 22 08:58:56 pve corosync[1143]:   [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Aug 22 08:59:32 pve corosync[1143]:   [QUORUM] Sync members[1]: 1
Aug 22 08:59:32 pve corosync[1143]:   [QUORUM] Sync left[1]: 2
Aug 22 08:59:32 pve corosync[1143]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Aug 22 08:59:32 pve corosync[1143]:   [TOTEM ] A new membership (1.293) was formed. Members left: 2
Aug 22 08:59:32 pve corosync[1143]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 22 08:59:32 pve pmxcfs[972]: [dcdb] notice: members: 1/972
Aug 22 08:59:32 pve pmxcfs[972]: [status] notice: members: 1/972
Read from remote host pve.high.lan: Connection reset by peer

root@pve:~# journalctl -f
Aug 22 08:58:56 pve corosync[1143]:   [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Aug 22 08:59:13 pve watchdog-mux[746]: client watchdog is about to expire
Aug 22 08:59:23 pve watchdog-mux[746]: client watchdog expired - disable watchdog updates
Aug 22 08:59:24 pve watchdog-mux[746]: exit watchdog-mux with active connections
Aug 22 08:59:24 pve kernel: watchdog: watchdog0: watchdog did not stop!
Aug 22 08:59:28 pve postfix/qmgr[1132]: 92FD75A0495: from=<root@pve.high.lan>, size=2685, nrcpt=1 (queue active)
Aug 22 08:59:28 pve postfix/local[28264]: error: open database /etc/aliases.db: No such file or directory
Aug 22 08:59:28 pve postfix/local[28264]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Aug 22 08:59:28 pve postfix/local[28264]: warning: hash:/etc/aliases: lookup of 'root' failed
Aug 22 08:59:28 pve postfix/local[28264]: 92FD75A0495: to=<root@pve.high.lan>, orig_to=<root>, relay=local, delay=21558, delays=21558/0.01/0/0.01, dsn=4.3.0, status=deferred (alias database unavailable)
Aug 22 08:59:32 pve corosync[1143]:   [QUORUM] Sync members[1]: 1
Aug 22 08:59:32 pve corosync[1143]:   [QUORUM] Sync left[1]: 2
Aug 22 08:59:32 pve corosync[1143]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Aug 22 08:59:32 pve corosync[1143]:   [TOTEM ] A new membership (1.293) was formed. Members left: 2
Aug 22 08:59:32 pve corosync[1143]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 22 08:59:32 pve pmxcfs[972]: [dcdb] notice: members: 1/972
Aug 22 08:59:32 pve pmxcfs[972]: [status] notice: members: 1/972
Read from remote host pve.high.lan: Connection reset by peer

separate grep on watchdog
Aug 22 08:59:13 pve watchdog-mux[746]: client watchdog is about to expire
Aug 22 08:59:23 pve watchdog-mux[746]: client watchdog expired - disable watchdog updates
Aug 22 08:59:24 pve watchdog-mux[746]: exit watchdog-mux with active connections
Aug 22 08:59:24 pve kernel: watchdog: watchdog0: watchdog did not stop!

Here is the PiServer Quorum device loosing the other node too due to the watchdog reset

Code:

pi@piserver:~$ journalctl -f -u corosync-qnetd
Aug 21 18:22:22 piserver systemd[1]: Starting corosync-qnetd.service - Corosync Qdevice Network daemon...
Aug 21 18:22:22 piserver systemd[1]: Started corosync-qnetd.service - Corosync Qdevice Network daemon.
Aug 21 18:30:50 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:55348 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:31:53 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:34180 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:36:28 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:37328 doesn't sent any message during 12000ms. Disconnecting
Aug 21 18:59:44 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:45948 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:00:53 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:44904 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:08:08 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:56990 doesn't sent any message during 12000ms. Disconnecting
Aug 21 19:33:24 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:36562 doesn't sent any message during 12000ms. Disconnecting
Aug 21 20:15:31 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:43138 doesn't sent any message during 12000ms. Disconnecting
Aug 22 08:25:46 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:51076 doesn't sent any message during 12000ms. Disconnecting
Aug 22 08:47:17 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:33306 doesn't sent any message during 12000ms. Disconnecting

Aug 22 08:58:34 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:49094 doesn't sent any message during 12000ms. Disconnecting
Aug 22 08:59:44 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:39420 doesn't sent any message during 12000ms. Disconnecting

aaron · Aug 22, 2025

Can you disable the HA resource, wait ~10 minutes until all LRMs are idle, and then do the following please? With no active LRM, the nodes won't fence.

1. get pvecm status while all nodes are up and working
2. disconnect one of the nodes
3. get pvecm status from the node that can still talk to the qdevice, and also from the disconnected node

Because what should happen is that the node on which the network to the qdevice works should report 2/3 votes present. If that is not the case, I am very curious how the network is set up.

magingale · Aug 22, 2025

aaron said:
Can you disable the HA resource, wait ~10 minutes until all LRMs are idle, and then do the following please? With no active LRM, the nodes won't fence.

1. get pvecm status while all nodes are up and working
2. disconnect one of the nodes
3. get pvecm status from the node that can still talk to the qdevice, and also from the disconnected node

Because what should happen is that the node on which the network to the qdevice works should report 2/3 votes present. If that is not the case, I am very curious how the network is set up.

Goodmorning Aaron thanks for joining, indeed my first test post from today fits your question right? (not my last post). Does show the content you asked for: a test without HA.

As my disconnected host got isolated and my KVM is being repaired im unable to catch the info form there. But let me show you the output of the remaining node without HA, and with the other node disconnected.

The remaining node is still able to talk to the Quorum device without a glitch. Network and Quorum device are communicating well, why does the enduser experience of enabling HA results in a full cluster reboot? Is that expected?

Code:

Every 2.0s: ha-manager status && echo "--- Cluster Status ---" && pvecm status                                                                                                        pve: Fri Aug 22 08:51:12 2025

quorum OK
master pve1 (idle, Thu Aug 21 19:27:09 2025)
lrm pve (idle, Fri Aug 22 08:51:09 2025)
lrm pve1 (old timestamp - dead?, Fri Aug 22 08:47:02 2025)
--- Cluster Status ---
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 08:51:13 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.28b
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000000          1            Qdevice

aaron · Aug 22, 2025

magingale said:
My conclusion: CRM is rebooting the wrong node and causing a full cluster outage via watchdog fencing on the remaining node.

There is a misunderstanding in how fencing works.
It is handled by the LRM on each node. If it is in "active" mode and the host lost the connection to the quorum for more than 60 seconds, it will not renew the watchdog. Once the watchdog runs out, the host will reboot/fence. So you would see a fence within 60 to 70 seconds, depending on the current status of the watchdog. The watchdog timeout is 10 seconds.

But, and I hope I got this right, the setup looks like this?

Code:

       ┌───────┐       
       │QDevice│       
       └───┬───┘       
           │           
        ┌──┴───┐       
   ┌────┤Switch├────┐   
   │    └──────┘    │   
┌──┴──┐          ┌──┴──┐
│Node1│          │Node2│
└─────┘          └─────┘

Node1 → pve
Node2 → pve1

You have HA guests on pve/Node1 and disconnect pve1/Node2 and pve/Node1 fenced itself?
If the QDevice was reachable, it should not have fenced. If you do another test with HA, please check the quorum status of it within that first 60 second grace period.

I am running such a setup for years now as expected without any unexpected fencing.

magingale · Aug 22, 2025

@aaron thats also my understanding that the pve/Node1 (with 95 % of the VM's) should not have fenced I expect that the HA VM is booted on the remaining node. Small addition: You have HA guests on pve1/Node2 and disconnect pve1/Node2 and pve/Node1 fenced itself?

I'm more than happy to do the test again are you able to provide the commands/logs you want analyse?

[TUTORIAL] [High Availability] Watchdog reboots

Renowned Member

Renowned Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

We value your privacy