Testing the watchdog

kwinz · Mar 29, 2021

I have Dell based servers with iDRAC. In UEFI BIOS I have enabled the setting "Integrated Devices" - "os watchdog timer: enabled"

I have successfully enabled the hardware watchdog using this guide: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29

Basically I did

1. specify
WATCHDOG_MODULE=ipmi_watchdog
in /etc/default/pve-ha-manager

2. edited /etc/default/grub with
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
and then did update-grub

3. created /etc/modprobe.d/ipmi_watchdog.conf with

Code:

options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

This seem to work and now I see that the watchdog is active:

Code:

root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec

The old logs like this hinting that the software watchdog was active are now gone:

Code:

Mar 28 13:38:50 pve3 kernel: [    0.422317] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Mar 28 13:38:50 pve3 watchdog-mux[849]: Watchdog driver 'Software Watchdog', version 0

and instead I now get

Code:

Mar 28 21:22:03 pve3 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Mar 28 21:22:03 pve3 watchdog-mux[853]: Loading watchdog module 'ipmi_watchdog'
Mar 28 21:22:03 pve3 watchdog-mux[853]: Watchdog driver 'IPMI', version 1
Mar 28 21:22:03 pve3 kernel: [    0.202344] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 kernel: [    5.521300] IPMI Watchdog: driver initialized
Mar 28 21:22:06 pve3 corosync[1371]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Mar 28 21:22:06 pve3 corosync[1371]:   [WD    ] Watchdog not enabled by configuration
Mar 28 21:22:06 pve3 corosync[1371]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Mar 28 21:25:21 pve3 pve-ha-crm[1674]: watchdog active
Mar 28 21:25:55 pve3 pve-ha-lrm[1737]: watchdog active

No random reboots so far. I tested the HA failover and it also works great. Looks good!

Except that I am not a very trusting person when it comes to servers:
Is there any way to manually stop kicking the watchdog for 10 seconds to test if the server actually power cycles?

Thanks in advance!

kwinz · Mar 29, 2021

I found a solution to test stopping to kick the watchdog:

Code:

root@pve4:~# lsof /dev/watchdog
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
watchdog- 861 root    3w   CHR 10,130      0t0  431 /dev/watchdog
root@pve4:~# kill -9 861

The terminal on the VGA output showed a warning message.
The watchdog timer keeps going down unimpeded:

Code:

root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      8 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      7 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      5 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      4 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      0 sec

After 10 seconds the server power cycles as expected (cold reset).

Last log before the reset was

Code:

Mar 29 08:50:24 pve4 kernel: [41023.418168] IPMI Watchdog: Unexpected close, not stopping watchdog!
Mar 29 08:50:24 pve4 systemd[1]: watchdog-mux.service: Main process exited, code=killed, status=9/KILL
Mar 29 08:50:24 pve4 systemd[1]: watchdog-mux.service: Failed with result 'signal'.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$

And the iDRAC Event log contains a new line:

Code:

Mon Mar 29 2021 08:50:34    The watchdog timer power cycled the system.

Proxmox starts as intended. And since you can apparently never be paranoid enough ( https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252863 ), yes the watchdog is counting down again after the power cycle reboot:

Code:

root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

In summary everything worked perfectly!

kwinz · Aug 26, 2021

I am revisiting this now 5 months later.
One of the servers apparently in the last months by itself switched its Watchdog action to No action (0x00)

Code:

root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     Reserved (0x40)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   1 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      15 sec
Present Countdown:      14 sec

I checked the config files and they were fine. All other servers with identical configs were also fine.
And a reboot solved the problem. Server PVE3's watchdog is back to Action: Power Cycle (0x03)

Code:

root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

No related messages in the syslog files or iDRAC log that I could find. Maybe someone here has any idea what can cause this?

kwinz · Aug 26, 2021

Another problem that I noticed is that I get "The watchdog timer expired." messages in iDRAC if I reboot any of the servers using the reboot button in the PVE GUI, sudo reboot or sudo init 6.
So I don't think the servers are restarting cleanly. They get power cycled by the watchdog during reboot because the watchdog is not properly disabled before reboot. I need a solution for this problem as well. Made a new thread for this: https://forum.proxmox.com/threads/m...during-reboot-any-idea-how-to-fix-that.95878/

Search

Search

Testing the watchdog

kwinz

Active Member

kwinz

Active Member

kwinz

Active Member

kwinz

Active Member