Watchdog Timeout hardcoded to 10s

Marius Matei

Renowned Member
Jun 23, 2014
13
0
66
Bucharest, Romania, Romania
Hello,

We have a 6 PowerEdge R730 cluster running ceph with 8 OSDs on each node and KVM VMs.
I've started upgrading to v4.3 yesterday and upon any node reboot, all nodes in the cluster go down.
I get notified by the idrac that the watchdog timer expired.

I've tried increasing the timeout in /etc/modprobe.d/ipmi_watchdog.conf, but no change there seems to work. The only parameter that seems to get interpreted is "action=". The timeout is set correctly when i do

Code:
rmmod ipmi_watchdog
modprobe ipmi_watchdog timeout=300

but is immediately reset when I restart watchdog-mux.
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

I need to increase the timeout so I can figure out what is causing the timer to expire in the first place, and I would also like to leave it at a higher value.

Regards,
Marius
 
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

You should not change that timeout. Instead, try to find out why watchdog-mux fails to update the watchdog within that time. Maybe you just lost quorum (multicast problem)?
 
@marius: did you find a reason why watchdog-mux failed to update watchdog in 10s?

I'm seeing ipmi_watchdog not getting ping from watchdog-mux on single hardware setup and thus ugly reset happens.

watchdog-mux doesn't use mlock() (while regular watchdog does that, packages.debian.org/pl/sid/watchdog) so it can be swapped out and not deliver in time - that's one of possible reasons.

ps.
systemctl stop watchdog-mux.service ; sleep 1; ipmitool mc watchdog off
should allow stopping watchdog-mux if using ipmi_watchdog