I have Dell based servers with iDRAC. In UEFI BIOS I have enabled the setting "Integrated Devices" - "os watchdog timer: enabled"
I have successfully enabled the hardware watchdog using this guide: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29
Basically I did
1. specify
WATCHDOG_MODULE=ipmi_watchdog
in /etc/default/pve-ha-manager
2. edited /etc/default/grub with
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
and then did update-grub
3. created /etc/modprobe.d/ipmi_watchdog.conf with
This seem to work and now I see that the watchdog is active:
The old logs like this hinting that the software watchdog was active are now gone:
and instead I now get
No random reboots so far. I tested the HA failover and it also works great. Looks good!
Except that I am not a very trusting person when it comes to servers:
Is there any way to manually stop kicking the watchdog for 10 seconds to test if the server actually power cycles?
Thanks in advance!
I have successfully enabled the hardware watchdog using this guide: https://pve.proxmox.com/wiki/High_A...x#Dell_IDrac_.28module_.22ipmi_watchdog.22.29
Basically I did
1. specify
WATCHDOG_MODULE=ipmi_watchdog
in /etc/default/pve-ha-manager
2. edited /etc/default/grub with
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
and then did update-grub
3. created /etc/modprobe.d/ipmi_watchdog.conf with
Code:
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10
This seem to work and now I see that the watchdog is active:
Code:
root@pve3:~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 10 sec
Present Countdown: 9 sec
The old logs like this hinting that the software watchdog was active are now gone:
Code:
Mar 28 13:38:50 pve3 kernel: [ 0.422317] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Mar 28 13:38:50 pve3 watchdog-mux[849]: Watchdog driver 'Software Watchdog', version 0
and instead I now get
Code:
Mar 28 21:22:03 pve3 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 systemd[1]: Started Proxmox VE watchdog multiplexer.
Mar 28 21:22:03 pve3 watchdog-mux[853]: Loading watchdog module 'ipmi_watchdog'
Mar 28 21:22:03 pve3 watchdog-mux[853]: Watchdog driver 'IPMI', version 1
Mar 28 21:22:03 pve3 kernel: [ 0.202344] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.78-2-pve root=/dev/mapper/pve-root ro quiet nmi_watchdog=0
Mar 28 21:22:03 pve3 kernel: [ 5.521300] IPMI Watchdog: driver initialized
Mar 28 21:22:06 pve3 corosync[1371]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Mar 28 21:22:06 pve3 corosync[1371]: [WD ] Watchdog not enabled by configuration
Mar 28 21:22:06 pve3 corosync[1371]: [SERV ] Service engine loaded: corosync watchdog service [7]
Mar 28 21:25:21 pve3 pve-ha-crm[1674]: watchdog active
Mar 28 21:25:55 pve3 pve-ha-lrm[1737]: watchdog active
No random reboots so far. I tested the HA failover and it also works great. Looks good!
Except that I am not a very trusting person when it comes to servers:
Is there any way to manually stop kicking the watchdog for 10 seconds to test if the server actually power cycles?
Thanks in advance!
Last edited: