Testing the watchdog

kwinz

Active Member
Apr 18, 2020
40
16
28
36
I have Dell based servers with iDRAC. In UEFI BIOS I have enabled the setting "Integrated Devices" - "os watchdog timer: enabled"

I have successfully enabled the hardware watchdog using this guide: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29

Basically I did

1. specify
WATCHDOG_MODULE=ipmi_watchdog
in /etc/default/pve-ha-manager

2. edited /etc/default/grub with
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
and then did update-grub

3. created /etc/modprobe.d/ipmi_watchdog.conf with
Code:
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

This seem to work and now I see that the watchdog is active:
Code:
root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec

The old logs like this hinting that the software watchdog was active are now gone:
Code:
Mar 28 13:38:50 pve3 kernel: [    0.422317] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Mar 28 13:38:50 pve3 watchdog-mux[849]: Watchdog driver 'Software Watchdog', version 0

and instead I now get

Code:
Mar 28 21:22:03 pve3 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Mar 28 21:22:03 pve3 watchdog-mux[853]: Loading watchdog module 'ipmi_watchdog'
Mar 28 21:22:03 pve3 watchdog-mux[853]: Watchdog driver 'IPMI', version 1
Mar 28 21:22:03 pve3 kernel: [    0.202344] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 kernel: [    5.521300] IPMI Watchdog: driver initialized
Mar 28 21:22:06 pve3 corosync[1371]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Mar 28 21:22:06 pve3 corosync[1371]:   [WD    ] Watchdog not enabled by configuration
Mar 28 21:22:06 pve3 corosync[1371]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Mar 28 21:25:21 pve3 pve-ha-crm[1674]: watchdog active
Mar 28 21:25:55 pve3 pve-ha-lrm[1737]: watchdog active

No random reboots so far. I tested the HA failover and it also works great. Looks good!

Except that I am not a very trusting person when it comes to servers:
Is there any way to manually stop kicking the watchdog for 10 seconds to test if the server actually power cycles?

Thanks in advance!
 
Last edited:
I found a solution to test stopping to kick the watchdog:

Code:
root@pve4:~# lsof /dev/watchdog
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
watchdog- 861 root    3w   CHR 10,130      0t0  431 /dev/watchdog
root@pve4:~# kill -9 861

The terminal on the VGA output showed a warning message.
The watchdog timer keeps going down unimpeded:

Code:
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      9 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      8 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      7 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      5 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      10 sec
Present Countdown:      4 sec
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x04)
Watchdog Timer Is:      Stopped
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      0 sec

After 10 seconds the server power cycles as expected (cold reset).

Last log before the reset was

Code:
Mar 29 08:50:24 pve4 kernel: [41023.418168] IPMI Watchdog: Unexpected close, not stopping watchdog!
Mar 29 08:50:24 pve4 systemd[1]: watchdog-mux.service: Main process exited, code=killed, status=9/KILL
Mar 29 08:50:24 pve4 systemd[1]: watchdog-mux.service: Failed with result 'signal'.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$

And the iDRAC Event log contains a new line:

Code:
Mon Mar 29 2021 08:50:34    The watchdog timer power cycled the system.

Proxmox starts as intended. And since you can apparently never be paranoid enough ( https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252863 ), yes the watchdog is counting down again after the power cycle reboot:

Code:
root@pve4:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

In summary everything worked perfectly!
 
Last edited:
I am revisiting this now 5 months later.
One of the servers apparently in the last months by itself switched its Watchdog action to No action (0x00)

Code:
root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     Reserved (0x40)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: No action (0x00)
Pre-timeout interval:   1 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      15 sec
Present Countdown:      14 sec

I checked the config files and they were fine. All other servers with identical configs were also fine.
And a reboot solved the problem. Server PVE3's watchdog is back to Action: Power Cycle (0x03)

Code:
root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

No related messages in the syslog files or iDRAC log that I could find. Maybe someone here has any idea what can cause this?
 
Last edited:
Another problem that I noticed is that I get "The watchdog timer expired." messages in iDRAC if I reboot any of the servers using the reboot button in the PVE GUI, sudo reboot or sudo init 6.
So I don't think the servers are restarting cleanly. They get power cycled by the watchdog during reboot because the watchdog is not properly disabled before reboot. I need a solution for this problem as well. Made a new thread for this: https://forum.proxmox.com/threads/m...during-reboot-any-idea-how-to-fix-that.95878/
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!