PVE 4.x: Hardware or software watchdog preffered?

wosp

Renowned Member
Apr 18, 2015
203
23
83
37
The Netherlands
What's the preffered watchdog for PVE 4.x, hardware or software based? I'm asking because I've had a issue last days with a Dell idrac module in one of my nodes (https://forum.proxmox.com/threads/ipmi-couldnt-get-irq-info-and-runs-to-slow.27846/). It seems this module is faulty and it will be replaced. Because of this I'm currently using software watchdog on this node. This make me think: why should I use hardware watchdog instead of software based watchdog? If I had used software watchdog from day one, I didn't had this issue last days. Although the module was still faulty now and probably I didn't notice (yet).

So, what's the preffered watchdog and what are the cons of software watchdog, that will make me use hardware watchdog again in the future?
 
I initially thought HW watchdogs are more reliable, but many people reported problems with them (especially with ipmi). The software watchdog works reliable, and we use that by default.
 
well, the con of the software watchdog is that it is run in software ;)
that means in case of a total system failure (think kernel panic or something like this) the watchdog is not executed and cannot reset the system, but as dietmar said, some hw watchdogs are not good implemented either
 
I have used the software version since it was added by the devs and have had no problems at all with it. 5 node setup was the largest cluster, though. Currently using version 4.2 and no issues.
 
well, the con of the software watchdog is that it is run in software ;)
that means in case of a total system failure (think kernel panic or something like this) the watchdog is not executed and cannot reset the system, but as dietmar said, some hw watchdogs are not good implemented either

Does this mean that when a total system failure occurs, the VM's running on the crashed node are not moved to another node (because this node can't be fenced)? Or does this only mean the node isn't rebooted automaticly?
 
It does mean the node does not reboot, but from the perspective of the other nodes it is gone, so they can start the ha vms.
(the target of fencing is, that the fenced nodes cannot access shared resources anymore, and when one node completely hangs, well it can not access anything)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!