HA fencing with softdog vs IPMI watchdog — which one is actually used when both are configured?

atlas32

New Member
Jun 15, 2026
4
2
3
Hi,

I'm trying to understand the watchdog selection logic in PVE HA. My 3-node
cluster has:

- IPMI/iLO present on all nodes (HPE ProLiant, ipmi_watchdog module loadable)
- softdog also available (default fallback)
- /etc/default/pve-ha-manager: WATCHDOG_MODULE=ipmi_watchdog

Question: if I configure ipmi_watchdog but the IPMI BMC becomes unresponsive
(stuck firmware, network partition affecting the dedicated IPMI port, etc.),
does pve-ha-manager fall back to softdog automatically, or does HA simply not
work until I fix IPMI?

I'd like to understand the failure modes before I rely on this in production.
Reading the source got me partway but I'd love confirmation from people who
have actually seen it fail.
 
Hi!

The default fallback is selected if no other watchdog is configured when the watchdog-mux daemon is started. There is no mechanism to fall back if a hardware watchdog fails. Such a failure is something you'd probably want to know about and fix instead of being silently moved to softdog, since you're relying on a hardware watchdog as an independent component.
 
I recommend to stick with the soft watchdog. It works well and you avoid any issues due to questionable quality of OOBM hardware and software ;)
 
@atlas32 — just to add one practical bit on top of what Michael and Aaron
already said, in case you do end up going the ipmi_watchdog route despite
Aaron's (very sensible) recommendation:

if you commit to it, don't just configure it and forget it — monitor the
BMC-side state from the OS explicitly:

ipmitool mc watchdog get

and alert on two things:
- "Watchdog Timer Is:" showing "Stopped" while HA thinks the node is
active
- "Watchdog Timer Actions:" not matching what you configured (e.g. wiki
example uses "Hard Reset")

Reason for the second check: some BMC firmware updates silently reset the
watchdog configuration back to defaults. There's a thread on this forum
where an iDRAC9 v6 -> v7 upgrade broke ipmi_watchdog integration entirely,
so it's not a theoretical concern.

That way you're not blindly trusting a component whose health you can't
observe from the OS side — which was, I think, the core of Michael's point
about "you'd want to know about it".

Nothing to add to the architecture side, Michael already nailed it.