Getting rid of watchdog emergency node reboot

pwizard · Nov 21, 2023

Hello,

after repeated negative experience with nodes rebooting for no proper reason (this was not the first time but the most impactful one by far) we're wondering how we can get rid of the "reboot nodes PVE believes to misbehave" behavior.

What do we need to do in order to accomplish this?

At the moment we've got an 8 node cluster and we've already removed all resources (10 VMs) from HA and then stopped and "systemctl mask"ed both LRM and CRM services.

Is this update-safe? Are the services re-enabled after apt-update or a major upgrade, say from 7.4 to 8.0? Are there still fencing/watchdog mechanisms at play independent of LRM/CRM?

Best regards,

Patrick

Philipp Hufnagl · Nov 22, 2023

Hello

The services should restart after an upgrade. If you upgrade from PVE 7 to PVE 8 there is a guide helping you to avoid issues.

However, I am not sure if the approach you are taking will solve your issue. The root cause of your issue could easily be a poor network connection or faulty hardware. Have you tried to

send pings between the nodes to ensure you have a stable network with < 10ms latency?
looked into the complete journal (for example with journalctl -b) to see if something is reported there?
Have you looked in the individual task logs of failed tasks (with pvenode task log UPID:....?

EDIT: To get to know your system better, could you give me the output of dmidecode -t processor bios system.

aaron · Nov 22, 2023

Check before the update if all LRMs are in "idle" mode. Then they should not fence themselves in case they lose the connection to the quorate part of the cluster.

I am not sure if masking the services will help. And it sounds a bit like throwing a big hammer at the problem. There might be other unforeseen side effects in the future as well.

Did you add additional Corosync links? Ideally, one physical network will just be used by Corosync to avoid interference by other services that might take up all the bandwidth.

pwizard · Nov 29, 2023

@Philipp Hufnagl
- network is stable, enterprise NICs (Mellanox ConnectX) and switches (Juniper QFX) only, via 2x10G and 2x 25G links. Latency about 0.1 ms
- logs should be over at the other thread, I wanted to use this thread specifically to clear up with you guys how to get rid of HA, but of course we can continue the conversation either here or there. Only message that jumped out to me was "pve-ha-crm[]: loop take too long (63 seconds)"
- there are no failed tasks involved (well, we wanted to transfer the first "test" VM back from another node to the newly ifreloaded, evacuated node prox14, via GUI on storage node proxstore11 and that failed, but that's because at that point prox14 was already rebooting)

@aaron Do I understand you correctly - LRM/CRM is the only component that arms the watchdog and as long as they are disabled, there is no other component of Proxmox, not even corosync / pmxcfs itself, that could trigger a reboot? pmxcfs simply switches to read-only if losing quorum, correct?

In our case was: 3 nodes (out of 5 compute nodes, out of 8 nodes if counting the storage nodes that take part in the vote as well) were running HA resources, at most 1 should've rebooted as not more than 1 should've become non-quorate, at most 3 should've rebooted if there had been a complete quorum partition with 5 in one basket and 3 in the other, basically impossible due to physical location, but instead all 5 compute nodes rebooted. Basically the only scenario where the 5 nodes would be expected to reboot like they did would be if they all lost connection to at least 4 other cluster members as they would be below quorum, and even then only the nodes running HA VMs should reboot? Should I see an "emergency reboot initiated by watchdog" message in the logs or is it "BAM! - DEAD" immediately?

I do not understand which error condition would even allow this to happen, but as long as there is not even a scenario where this could/should have happened due to administrator error I'm not in the mood to risk catastrophic breakdown of the entire (!) cluster again, and nor is my boss. I do understand both of you are not happy with "simply masking the perpetrators", that's why I'm wondering what you would suggest instead? I enabled a feature to increase availability and all it got me was a complete meltdown. Without HA the worst case situation (extreme situations aside) would be 1 host down, with only a fifth of all VMs affected, statistically speaking, until somebody restarted them manually. Doesn't seem worse than what we have with HA at the moment (a loaded gun that could go off at any moment)

Current network config for the compute nodes, bond0=2x10G LACP, bond1=2x25G LACP

Code:

# VM bridge
auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

# Corosync / management / VM migration
auto vlan75
iface vlan75 inet static
        address 192.168.75.10/24
        gateway 192.168.75.254
        vlan-raw-device bond1

# storage / Ceph
auto vlan76
iface vlan76 inet static
        address 192.168.76.10/24
        vlan-raw-device bond0

# storage / iSCSI to NAS
auto vlan302
iface vlan302 inet static
        address 172.19.76.10/16
        vlan-raw-device bond0

We had just finished logically (via ifreload -a) enabling the new bond1 on the last of the 5 compute nodes, prox14, when the crash happened. Doing the same on the other 4 nodes on the same afternoon did not lead to any negative impact at all, as would be expected. No VM was running on prox14 at all at that point in time.
Obviously the distribution of roles / functions to the VLANs / bonds is not ideal at the moment (we would've improved on it certainly following our original changes on that afternoon), but it shouldn't kill ALL THE COMPUTE. Especially considering at that point in time corosync traffic had 2x25G bandwith nearly all for itself, there was no VM migration going on then, and at any rate there would never be enough traffic to saturate the links as up until that day all nodes only had the 2x10G links running ALL of the traffic in parallel, management, storage, and guest VMs, and yet there was never a hint of congestion.

sb-jw · Nov 29, 2023

Can you describe the entire cluster in a little more detail? I would like to get an overall impression of your setup.

aaron · Nov 29, 2023

pwizard said:
Do I understand you correctly - LRM/CRM is the only component that arms the watchdog and as long as they are disabled, there is no other component of Proxmox, not even corosync / pmxcfs itself, that could trigger a reboot? pmxcfs simply switches to read-only if losing quorum, correct?

Yes. If the LRM is in "active" mode, because there currently are, or recently have been, HA guests on it, it will fence itself if it loses the connection to the majority of the cluster for more than a minute.
If the LRM is in "idle" mode, the /etc/pve/ directory will go into read-only mode only.

sb-jw said:
Can you describe the entire cluster in a little more detail? I would like to get an overall impression of your setup.

This please, and can you please show the /etc/pve/corosync.conf contents and the output of pvecm status?

pwizard said:
Should I see an "emergency reboot initiated by watchdog" message in the logs or is it "BAM! - DEAD" immediately?

There is a log about the watchdog that ran out, but with the hard reset right after, it will not always get written down to disk. The best indicator is to check the logs of the fenced node and the other nodes in the cluster as well for anything corosync related. It will show in much detail if the connection to a node in the cluster changes and how it affects the memberships that are formed. With that information, it is possible to create a timeline of what happened regarding Corosync and the Proxmox VE cluster.

pwizard · Nov 29, 2023

Sure:

8 nodes:

5x Lenovo x3650 M5 as compute nodes , 2x Xeon E5-2690 v4, 768 GB RAM, 1x Mellanox ConnectX-3 (2x 10G ports), 1x ConnectX-4 (2x 25G ports)
3x Supermicro SC846E16-RJBOD1 as Ceph storage nodes , 96 GB RAM, 1x Mellanox ConnectX-4 (2x 25G ports)

All in the same datacenter, interconnected via Juniper QFX EVPN-VXLAN fabric.

The storage nodes were set up with 1x NIC (2x 25G), while compute nodes had 1x NIC (2x 10G), each port connected as LACP LAG to one switch in a ToR pair, so from the point of view of the Proxmox nodes there was only one (logical) network connection with either 50G or 20G bandwith, where the 3 VLAN IDs noted here were running tagged along with multiple customer VLAN tags for the guest VMs.

VLAN 75 = Management / cluster communication
VLAN 76 = storage (Ceph)
VLAN 302 = storage (iSCSI to NAS)

All 8 nodes are part of the same Proxmox (HA) cluster, so 8 votes.
3 storage nodes are running Ceph (and not running any VM or containers) in order to expose shared storage for the other 5 nodes to run their workload off.

A few weeks ago we found 25G NICs lying around in the datacenter and decided to upgrade the 5x compute nodes, and as we didn't need the 10G NICs as spares we kept them in and thought about how we'd be able to separate traffic between the two LAG bonds.

We added the 25G NICs by simply evacuating each node in turn, shutting it down, adding the card, booting it back up, moving the VMs back - all that happened with 0 issues.

At that point in time they'd booted with the old network config that only knew of 1 bond and had functionally remained unchanged for years (update to PVE 7 neccessitated a change in the order of interfaces within /etc/network/interfaces, vmbr0 before any vlanXY interface, as only one VLAN interface would go online otherwise) and everything was just fine.
A few days later we'd again evacuate each compute node in turn, but didn't stop any PVE service (neither LRM nor CRM, didn't think it neccessary, HA issues should only trigger on 1 host and only if there was a protected VM running which it didn't), changed /etc/network/interfaces as above, and "ifreload -a".
Waited for a few minutes and as nothing unexpected happened we moved the VMs back. Same procedure for all compute nodes, just when the colleague wanted to move the first VM back onto prox14, the last node to be modified, the entire compute part of the HA cluster was already down. During the evacuations and reboots of the nodes the HA master role had moved to proxstore11 which (along with proxstore12 and proxstore13) had not been involved with the recent changes at all, not rebooted, no service stopped, no VMs moved back or from (always running 0 VMs) and should've had a clear view and connection to all of the other 7 nodes (6 nodes while ifreload was doing its thing).

I note that I missed the dmidecode output, see attached.

Code:

Cluster information
-------------------
Name:             Galaxis
Config Version:   31
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov 29 14:56:57 2023
Quorum provider:  corosync_votequorum
Nodes:            8
Node ID:          0x00000009
Ring ID:          1.14015
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      8
Quorum:           5 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.75.11
0x00000002          1 192.168.75.12
0x00000003          1 192.168.75.13
0x00000004          1 192.168.75.10
0x00000005          1 192.168.75.14
0x00000009          1 192.168.75.111 (local)
0x0000000a          1 192.168.75.112
0x0000000b          1 192.168.75.113

sb-jw · Nov 29, 2023

pwizard said:
All in the same datacenter, interconnected via Juniper QFX EVPN-VXLAN fabric.

Do I understand correctly that your servers are distributed in the data center and are not all connected to the same MLAG but are connected within the data center via evpn?

If that is the case, then your latency and bandwidth between the nodes depends largely on the rest of the network. So if even one link has increased load from other customers in the meantime, this will also affect you directly.

For me, that would be the main trigger for your Problem.

pwizard · Nov 29, 2023

Yes, the 8 servers are distributed over at least 4, not sure if 6 or 8, different ToR switches. They're not all connected to the same switch (pair).

We're far (reeeallly far) away from saturating the 3x 100G uplinks between the Juniper ToR switches and their spines, even thinking about microbursts. Not even legitimate DDoS attacks > 100G passing through the same infrastructure has ever made a blip in the corosync logs about missing heartbeats or whatever their terminology is.

Speaking of which, I forgot to attach the corosync logs for the servers, as obtained with "journalctl -u corosync.service | grep 'Oct 31 16:3'"

pwizard · Dec 19, 2023

@aaron @Philipp Hufnagl

It seems as if we keep getting sidetracked with trying to find a root cause, but when I provide what was asked for the case might seem too complex, too random, too rare to be worth to bother with outside of a proper support ticket. I understand that.

But what about the original query - if we entered into the "HA functionality" years ago, how do we opt out of it? Or rather, would "watchdog = inactive" log messages on all nodes guarantee that no such "synchronized emergency reboot" cannot happen again, or what else do we need for that to be true? (in our case 2 nodes rebooted that did not carry any HA-protected resource - watchdog should've been inactive there already)

Best regards,

Patrick

Philipp Hufnagl · Jan 2, 2024

Hello

There is a chapter about High Availability in the Documentation I would recommend you to read. HA (High Availability) is at Proxmox turned on by default, which should cause next to no overhead. You can disable it by running on all nodes:

Code:

systemctl disable pve-ha-lrm pve-ha-crm

However, while this should not cause any problems, it is a more common to leave HA running (as it is on a default installation) and just remove the VMs from HA with the following command

Code:

ha-manager remove vm:<VMID>

pwizard · Jan 4, 2024

During the holidays I've upgraded all 8 nodes from PVE 7.4 to 8.1 (and Ceph Quincy to Reef) successfully - both HA services remained masked throughout the entire process, the pve7to8 script didn't care, only dist-upgrade logged (cosmetic) messages that the services could not be started because masked, and no node went down unexpectedly, nor did the entire PVE cluster at once.
So looks like we can use this as long-term solution and see whether this leads to any issues down the road (I don't expect it to)

@Philipp Hufnagl
in the original setup there were 10 VMs protected by ha-manager, distributed to 3 (out of 5) compute nodes in our 8 node cluster. However, all 5 compute nodes emergency-rebooted at once - as this did not really match my expectation ("only" the 3 nodes running HA loads should've restarted) I'm hesitant to trust ha-manager too much - 2 hosts whose watchdog should've been in status idle / disarmed (no HA resources) nevertheless were force-restarted.

Issue is "solved" for us now that the current setup survived a major version upgrade without any issues or side effects, but I think I should not mark the entire thread as SOLVED out of respect to the fact that our solution is not recommended / heavy-handed. But for our scenario it suffices.

Thanks,

Patrick

Magneto · Feb 18, 2024

This is rather concerning.

How does one setup high availability for VM's, to auto restart when a host node fails, of HA breaks the whole cluster?

esi_y · Feb 18, 2024

pwizard said:
in the original setup there were 10 VMs protected by ha-manager, distributed to 3 (out of 5) compute nodes in our 8 node cluster. However, all 5 compute nodes emergency-rebooted at once - as this did not really match my expectation ("only" the 3 nodes running HA loads should've restarted) I'm hesitant to trust ha-manager too much - 2 hosts whose watchdog should've been in status idle / disarmed (no HA resources) nevertheless were force-restarted.

Hey @pwizard! I don't think you expect an answer from me, but as I speed-read your thread so far I think the (outstanding) gist for you is the above quoted part.

And then the earlier concern as:

pwizard said:
@aaron Do I understand you correctly - LRM/CRM is the only component that arms the watchdog and as long as they are disabled, there is no other component of Proxmox, not even corosync / pmxcfs itself, that could trigger a reboot? pmxcfs simply switches to read-only if losing quorum, correct?

I have recently filed, for instance, a bugreport [1]. As you can see, there are definitely some rough edges in the HA stack.

The other thing is, when it comes to understanding the watchdog behaviour, it's a bit more complicated. I tried to touch on that in another post [2] where @bjsko was experiencing similar woes, in fact the post from @t.lamprecht referenced within [3] explains it much better than the official docs, which are a simplification at best in my opinion.

First of all, there's a watchdog active at any given point on any standard install of PVE node, whether you ever used HA stack or not. This is because of the very design of the PVE solution, even if you do not have any hardware watchdog [4], where by default you get a software-emulated watchdog device called softdog [5].

Now whether you already know how watchdogs work in general or not, the PVE solution is a bit of a gymnastics with its implementation. The softdog module is loaded no matter what, you can verify so with lsmod | grep softdog. When you consider that a watchdog is essentially a ticking time bomb, which when it goes off causes a reboot, then the only way not to have the countdown reach zero is to reset it every once in a while. The way it works is by providing a device which, if open, then needs to be touched within defined intervals and unless that happens regularly or the device is properly closed, will absolutely cause the system to reboot. The module is loaded for a reason - to be used.

Now this is exactly what PVE does when it loads its watchdog-mux.service, which as its name implies is there to handle the feature in a staged (read: more elaborate than necessary) way. This service loads on every node, every single time, irrespective of your HA stack use. It absolutely does open the watchdog device no matter what [6] and it keeps it open on a running node. NB It sets its timer to 10 seconds, this then means that if something prevents the watchdog-mux from keeping the softdog happy, your node will reboot. The safer way to prevent this from happening is to get rid of - but do not stop it - the watchdog-mux service manually. Do not kill it, as it will fail to close the softdog device which will also cause a reboot. Same would happen if you stop it with active "clients" because ...

You see, the primary purpose of the watchdog-mux.service is to listen on a socket to what it calls clients. Notably, when the service has active clients, it will signify so (confusingly) by creating a /run/watchdog-mux.active/. The clients are the pve-ha-crm.service and pve-ha-lrm.service. This is the two you were pointed to above as for their documentation about the HA stack. The principle is supposed to replicate the general logic that such clients set a subordinate timer [7] with the watchdog-mux.service, which in turn monitors separately if they were able to check-in with it within the specified intervals, that's the higher threshold of 60 seconds for self-fencing. If such service unexpectedly dies, it will cause the watchdog-mux.service to stop resetting the softdog device and that will cause a reboot.

This is also triggered when HA is active (CRM and/or LRM active on that node at that moment) and quorum is lost, despite the machine is not otherwise in a frozen state. It is because a node without quorum will fail to obtain its lock within the cluster at which point it will stop feeding the watchdog-mux.service [8].

In turn, that is why HA services can only be "recovered" within HA stack after a period, the recovery should never start unless the expectation can be met that the node that went incommunicado for whatever reason (could be intermittent but persisting network issues) at least did its part by not having the duplicate services going on albeit having been cut-off.

The cascaded nature of the watchdog multiplexing, CRM (which is "migratory") and LRM (which is only "active" on a node with HA services running, including 10 minutes past the last such migrated away) and the time-sensitive dependency on node being in primary component of the cluster (in the quorum) as well as all services feeding the watchdog(s) running without any hiccups make it much more difficult to answer your question, what might have gone wrong, without more detailed logs.

Definitely beyond grep 'Oct 31 16:3' and corosync alone. As you can imagine from the above, it will be hell of a "structured" debugging if one takes on the endeavour and it's easier to blame upstream component (corosync) or network flicker (user).

But if your only question is how to really disable anything that fires off the kernel watchdog reboots, it is getting rid of the watchdog-mux.service. Before that you have to do the same with pve-ha-crm.service and pve-ha-lrm.service. You stop them in this (reverse) order. And then, you disable them. Upon upgrades, well, you get the idea ... it was not designed to be neatly turned off. It's always going to haunt you.

Or you go full resistance...

Code:

tee /etc/modprobe.d/softdog-deny.conf << 'EOF'
blacklist softdog
install softdog /bin/false
EOF

... or they address it.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2] https://forum.proxmox.com/threads/unexpected-fencing.136345/#post-634179
[3] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[4] https://www.kernel.org/doc/html/latest/watchdog/
[5] https://github.com/torvalds/linux/blob/master/drivers/watchdog/softdog.c
[6] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L157
[7] https://github.com/proxmox/pve-ha-m...e0e8cdb2d0a37d47e0464/src/watchdog-mux.c#L249
[8] https://github.com/proxmox/pve-ha-m...fe0e8cdb2d0a37d47e0464/src/PVE/HA/LRM.pm#L231

UdoB · Feb 18, 2024

tempacc346235 said:
Or you go full resistance...

Well... I will never disable the watchdog, I want to have a works-as-intended HA-stack. But your post is a good read!

esi_y · Feb 19, 2024

UdoB said:
Well... I will never disable the watchdog, I want to have a works-as-intended HA-stack.

If you use the HA-stack, you need it, but I found it vaguely documented in terms of when one is troubleshooting (as OP here) and needs to e.g. exclude watchdog itself having gone haywire as being the culprit. The issue is that it may also happen that a reboot due to watchdog timer having had expired did not manage to flush its log onto drive. When things go wrong, they might go badly wrong - it's the whole point of watchdog to manage to reboot no matter what and the kernel softdog is particularly good at it indeed.

Then there are cases when one actually wants to find a frozen system, well, frozen (can always restart it over IMPI). It makes a difference if e.g. hardware issue froze the system or rebooted it (PSU, etc.). Finding that the last boot log ends in nothing in particular is frustrating.

And for me, it would be important to include special section into the docs on watchdog alone as it is active on any install, at all times. Talking of "disarmed" etc. on a rather complex implementation is simplification that brings more confusion/uncertainty. I will be filing related BZ issues on the docs and some more, just wanted to have it better relate to real-world (forum) instances where this would be beneficial (there are more such inquiries over time here).

UdoB said:
But your post is a good read!

Thanks Udo!

pwizard · Feb 19, 2024

@UdoB
It's the entire point of my post that HA did not work the way it should and took down my entire production cluster. Only 3 out of the 5 compute nodes held any HA resources during the outage (let's say 4 nodes if erring on the side of caution - maybe 10 minutes ago a resource was active on a 4th node, but it shouldn't have been), but all 5 of the compute nodes went down hard.
The CRM master, incidentally, which we found out to be "watchdog armed" at all times as well, did not reboot.

This does not conform to any official nor unofficial explanation of how Proxmox HA and Fencing is supposed to work, therefore I prefer to run my production workload on the safer side rather than risking another unexplained full production outage.

@tempacc346235
Great post certainly.
In our case, just like in at least one other forum thread you posted in, we've seen the "60 seconds" threshold causing the reboot (I believe logs said 61 or 62 seconds -> loop too long) for no discernible reason - if TOTEM protocol really only sends one token in a circle then every node would ultimately notice lack of token and reboot, so obviously the algorithm works differently (sending 2 tokens in opposite directions? More complex?), but that cannot explain why 5 servers sitting in different racks and connected to different switches would all fail to pass the token for a full minute (which is eternity for processor and network processing)

esi_y · Feb 19, 2024

The issue I see there is that you have "just" 8 nodes, nothing too bad. Then since it is production "bond0=2x10G LACP, bond1=2x25G LACP" ... would suggest you should not have corosync issues. When you see logs like:

Code:

Oct 31 16:33:26 prox13 corosync[1748]:   [TOTEM ] A new membership (1.14001) was formed. Members left: 5
Oct 31 16:33:26 prox13 corosync[1748]:   [TOTEM ] Failed to receive the leave message. failed: 5
Oct 31 16:33:26 prox13 corosync[1748]:   [QUORUM] Members[7]: 1 2 3 4 9 10 11

That's totally normal to see, even if something was wrong with node 5, so what you have 7 left for the quorum. When I see the other nodes checking out, that's the haywire thing - but I think someone above confused cause with effect. Of course if you have 5 nodes reboot they will leave the chat. And of course you intermittently lose quorum. So say your watchdog rebooted them for no good reason, it was the fact they rebooted that made them go, 5 of them, for long enough period for the rest to fence, etc, you get the idea.

I suspect a bug in CRM/LRM.

There's one big giveaway from the above - you do not mention (but I have not seen logs) that you have nodes lost on corosync intermittently, certainly not for extended periods. If they are not having those troubles with HA (properly) disabled, then ... the inference I would make was that the corosync got hiccups from the watchdog rampage. But the standard answer here is "your cluster is too big" or maybe your "linksys in the cabinet was rebooting".

If you want to take care and sanitize your full logs from the period including say whole several days before ... it might be worth having a look at. Yes, from ALL the nodes. And not just corosync. But even if we conclude it must have been the watchdog -> corosync and not the other way around, the HA stack is hard to debug - this will be one thing no one would disagree with me on here, I am sure.

esi_y · Feb 19, 2024

I will just reference this here right now:

https://forum.proxmox.com/threads/proxmox-restarted-unexpectedly.141850/#post-635814

The point is - it's run the way it should not, lost quorum could be expected with such setup, but why do the watchdogs trigger? And on 2 nodes at a time?

EDIT: And another ...

https://forum.proxmox.com/threads/all-nodes-restart-when-a-nodes-have-connectivity-issue.141864/

esi_y · Feb 20, 2024

Linking through related: (Why does a standalone node have a watchdog active?)

https://forum.proxmox.com/threads/f...alone-hosts-or-workaround.141422/#post-636300

Getting rid of watchdog emergency node reboot

New Member

Active Member

Proxmox Staff Member

New Member

Famous Member

Proxmox Staff Member

New Member

Attachments

Famous Member

New Member

Attachments

New Member

Active Member

New Member

Well-Known Member

Active Member

Famous Member

Active Member

New Member

Active Member

Active Member

Active Member