Watchdog Timeout hardcoded to 10s

Marius Matei · Oct 26, 2016

Hello,

We have a 6 PowerEdge R730 cluster running ceph with 8 OSDs on each node and KVM VMs.
I've started upgrading to v4.3 yesterday and upon any node reboot, all nodes in the cluster go down.
I get notified by the idrac that the watchdog timer expired.

I've tried increasing the timeout in /etc/modprobe.d/ipmi_watchdog.conf, but no change there seems to work. The only parameter that seems to get interpreted is "action=". The timeout is set correctly when i do

Code:

rmmod ipmi_watchdog
modprobe ipmi_watchdog timeout=300

but is immediately reset when I restart watchdog-mux.
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

I need to increase the timeout so I can figure out what is causing the timer to expire in the first place, and I would also like to leave it at a higher value.

Regards,
Marius

dietmar · Oct 27, 2016

Marius Matei said:
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

You should not change that timeout. Instead, try to find out why watchdog-mux fails to update the watchdog within that time. Maybe you just lost quorum (multicast problem)?

spirit · Oct 27, 2016

Do you have installed openmanage ?
if yes, do you have follow the wiki:
https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29

(openmanage is using watchdog socket by default)

are · Mar 6, 2019

@marius: did you find a reason why watchdog-mux failed to update watchdog in 10s?

I'm seeing ipmi_watchdog not getting ping from watchdog-mux on single hardware setup and thus ugly reset happens.

watchdog-mux doesn't use mlock() (while regular watchdog does that, packages.debian.org/pl/sid/watchdog) so it can be swapped out and not deliver in time - that's one of possible reasons.

ps.
systemctl stop watchdog-mux.service ; sleep 1; ipmitool mc watchdog off
should allow stopping watchdog-mux if using ipmi_watchdog

Search

Search

Watchdog Timeout hardcoded to 10s

Marius Matei

Renowned Member

dietmar

Proxmox Staff Member

spirit

Distinguished Member

are

Active Member

We value your privacy