Proxmox Cluster: Disable watchdog restart

SkySpy · Jan 15, 2023

I am currently using two nodes with a third qdevice for my cluster.
My nodes are luks encrypted and need to be unlocked via ssh.

If I remove the network cable from one of my nodes it will do a hard reboot without shutting the machine down (Problem 1).
Then the dropbear within the initramfs will try to get an ip address via dhcp. But it will timeout after around 2 minutes (Problem 2)
Even if the network connection would be there, I would need to reenter my password to boot the device (Problem 3)

Is there any way to disable the reboot of the machine, if it the connection to the other nodes is lost?

The problem may get even more problematic if I would restart my network (including router/switches). This would end up that both nodes would reboot and get stuck.

sterzy · Jan 16, 2023

SkySpy said:
Is there any way to disable the reboot of the machine, if it the connection to the other nodes is lost?

Are you using HA? The problem you are describing sounds a lot like fencing [1]. Fencing is necessary to ensure that the same VM doesn't run on two nodes. If you allowed for that, the guest would quickly run into an inconsistent state and would probably start corrupting its own data quickly. So no, if you use HA, you would not want to turn off fencing.

Please note that the node can not distinguish between a temporary network outage and the failure of other nodes. So nodes will always fence themselves, if they loose quorum. No matter the reason for the failure.

If you need to do maintenance, you could temporarily disarm the HA daemons, as outlined here [2].

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2]: https://forum.proxmox.com/threads/c...soon-as-i-join-a-new-node.116804/#post-510633

SkySpy · Jan 16, 2023

Yes I am using HA.
I would like to keep fencing, just without forcing my machines into a hard reboot.

sterzy · Jan 17, 2023

SkySpy said:
I would like to keep fencing, just without forcing my machines into a hard reboot.

Well, then what would you like it to do? This is just how fencing is implemented in Proxmox VE. From the manual chapter I've linked:

During normal operation, ha-manager regularly resets the watchdog timer to prevent it from elapsing. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and trigger a reset of the whole server (reboot).

The reasoning behind this is also given a couple of paragraphs before that:

There are different methods to fence a node, for example, fence devices which cut off the power from the node or disable their communication completely. Those are often quite expensive and bring additional critical components into a system, because if they fail you cannot recover any service.

We thus wanted to integrate a simpler fencing method, which does not require additional external hardware. This can be done using watchdog timers.

So I am sorry to say that I am unsure what to change and how you would then want to keep the guarantees provided by the current implementation in place.

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

SkySpy · Jan 17, 2023

Hi Stefan,
I would prefer the server would just snapshot and shutdown all the VMs/Containers on its node. As an alternative to call kexec and reboot this way.

sterzy · Jan 18, 2023

That just isn't how watchdogs work in this case [1]. The problem, is that the watchdog doesn't really know whether this one node has failed or it's just something like a network outage etc. So a watchdog will just reset the system. A regular system shutdown would be less reliable, as processes could block the system from going down for a while, which may then cause inconsistencies if the VM is spun up on another node in the meantime. Similarly, if you need to wait for snapshots to complete (especially when these snapshots are supposed to be stored on networked storage, which is basically a Catch-22). Remember, other nodes can't know whether the node is still live or not.

So the idea here is to keep the watchdog as simple as possible to make as reliable as possible. You want it act consistently and quickly. Thus, a hard reset is just the simplest way to get back to a functioning system

[1]: https://en.wikibooks.org/wiki/The_Linux_Kernel/Softdog_Driver

Search

Search

Proxmox Cluster: Disable watchdog restart

SkySpy

Active Member

sterzy

Proxmox Staff Member

SkySpy

Active Member

sterzy

Proxmox Staff Member

SkySpy

Active Member

sterzy

Proxmox Staff Member