Hello,
We have a 6 PowerEdge R730 cluster running ceph with 8 OSDs on each node and KVM VMs.
I've started upgrading to v4.3 yesterday and upon any node reboot, all nodes in the cluster go down.
I get notified by the idrac that the watchdog timer expired.
I've tried increasing the timeout in /etc/modprobe.d/ipmi_watchdog.conf, but no change there seems to work. The only parameter that seems to get interpreted is "action=". The timeout is set correctly when i do
but is immediately reset when I restart watchdog-mux.
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.
I need to increase the timeout so I can figure out what is causing the timer to expire in the first place, and I would also like to leave it at a higher value.
Regards,
Marius
We have a 6 PowerEdge R730 cluster running ceph with 8 OSDs on each node and KVM VMs.
I've started upgrading to v4.3 yesterday and upon any node reboot, all nodes in the cluster go down.
I get notified by the idrac that the watchdog timer expired.
I've tried increasing the timeout in /etc/modprobe.d/ipmi_watchdog.conf, but no change there seems to work. The only parameter that seems to get interpreted is "action=". The timeout is set correctly when i do
Code:
rmmod ipmi_watchdog
modprobe ipmi_watchdog timeout=300
but is immediately reset when I restart watchdog-mux.
I've downloaded watchdog-mux src code from git and replaced watchdog_timeout default value of 10 with 300 and that did it. I'm not a programmer, so I can't figure out if the 10s timeout is intended and hardcoded or if I'm doing something wrong.
I need to increase the timeout so I can figure out what is causing the timer to expire in the first place, and I would also like to leave it at a higher value.
Regards,
Marius