Watchdog Timeout hardcoded to 10s

Marius Matei

Renowned Member
Jun 23, 2014
13
0
66
Bucharest, Romania, Romania
Hello,

We have a 6 PowerEdge R730 cluster running ceph with 8 OSDs on each node and KVM VMs.
I've started upgrading to v4.3 yesterday and upon any node reboot, all nodes in the cluster go down.
I get notified by the idrac that the watchdog timer expired.

I've tried increasing the timeout in /etc/modprobe.d/ipmi_watchdog.conf, but no change there seems to work. The only parameter that seems to get interpreted is "action=". The timeout is set correctly when i do

Code:
rmmod ipmi_watchdog
modprobe ipmi_watchdog timeout=300

but is immediately reset when I restart watchdog-mux.
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

I need to increase the timeout so I can figure out what is causing the timer to expire in the first place, and I would also like to leave it at a higher value.

Regards,
Marius
 
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.

You should not change that timeout. Instead, try to find out why watchdog-mux fails to update the watchdog within that time. Maybe you just lost quorum (multicast problem)?
 
@marius: did you find a reason why watchdog-mux failed to update watchdog in 10s?

I'm seeing ipmi_watchdog not getting ping from watchdog-mux on single hardware setup and thus ugly reset happens.

watchdog-mux doesn't use mlock() (while regular watchdog does that, packages.debian.org/pl/sid/watchdog) so it can be swapped out and not deliver in time - that's one of possible reasons.

ps.
systemctl stop watchdog-mux.service ; sleep 1; ipmitool mc watchdog off
should allow stopping watchdog-mux if using ipmi_watchdog
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!