Configure Hardware Watchdog / IPMI Fencing

rafman

Active Member
Feb 14, 2017
8
5
43
Switzerland
Hi,

We are testing a Proxmox Cluster. I want to setup IPMI or Hardware Watchdog (Intel) Fencing. However the documentation i found in https://pve.proxmox.com/wiki/Fencing seems to be outdated and the admin guide just refers to "/etc/default/pve-ha-manager", where I could enable the line ipmi_watchdog

Could anyone point me into the direction where I find more documentation how to setup the hardware watchdog or IPMI Fencing, as I suppose this is the preferred way to setup the Fencing?
 
Thanks! That helped a lot!

How do I temporarily disable fencing altogether, to facilitate updates/reboots of the cluster? During testing it sometimes happend that machines rebooted randomly when others were rebooted as well.
 
Thanks! That helped a lot!

How do I temporarily disable fencing altogether, to facilitate updates/reboots of the cluster? During testing it sometimes happend that machines rebooted randomly when others were rebooted as well.
Strange, never had this problem before. I have here 3 Servers in Ceph. 2 with intel watchdog (dell) and one with softdog (hp). When i do some maintenance i switch on node (HA-group) to nofailback. Then i can reboot, shutdown, what ever you want.
Or do I misunderstand you?
 
I guess, the nofailback option is the solution to the maintenance problem. Thanks again.

The random rebooting just happened once. It would be however still nice to have to option to disarm fencing, just in case something goes wrong. Is that not possible currently?
 
Why would you disarm fencing ? You are supposed to migrate VMs before rebooting a node.
For HW watchdogs, be aware that a server has several ones nowadays. Spend some time to master your hardware and configurations of it. For example, here are my notes for a R730, with proxmox 4.1 :

Watchdog
From https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs
3.1.1.
IDRAC watchdog : disable
For Dell IDrac, disable the Automated System Recovery Agent in IDrac configuration: overview ->
iDRAC Settings → Network / Services tab : uptick “enable” in Automated System Recovery Agent.
After disabling iDRAC watchdog, you have to reboot the server. RQ : this watchdog is disabled by
default.
3.1.2.
OpenManage watchdog : disable
If openmanage is installed, you need to disable watchdog management from openmanage :
/opt/dell/srvadmin/sbin/dcecfg command=removepopalias aliasname=dcifru
3.1.3.
APIC nmi watchdog : disable
edit: /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
then
update-grub
3.1.4.
iTCO Watchdog (module "iTCO_wdt") : not enabled
Proxmox VE default installation does not loads iTCO watchdog module. Check with :
lsmod | grep iTCO_wdt
3.1.5.
IPMI watchdog : enable
Edit /etc/default/pve-ha-manager and change :
WATCHDOG_MODULE=ipmi_watchdog
Enable IPMI watchdog :
echo "options ipmi_watchdog action=power_cycle" > /etc/modprobe.d/ipmi_watchdog.conf
And at last, reboot the server.
Verify that countdown is 10s and not overridden by OpenManage :
ipmitool mc watchdog getWatchdog Timer Use:
Watchdog Timer Is:
Watchdog Timer Actions:
Pre-timeout interval:
Timer Expiration Flags:
Initial Countdown:
Present Countdown:
SMS/OS (0x44)
Started/Running
Power Cycle (0x03)
0 seconds
0x00
10 sec
9 sec
RQ : at reboot, the kernel complains about not stopping watchdog “IPMI Watchdog: Unexpected
close, not stopping watchdog!”, it is normal. Systemd configures the watchdog at 10 minutes before
running the shutdown sequence.
 
Thanks for the good information, I configured the ipmi_watchdog on a Dell R610 and R720 according to your docs, but on one machine the timer is set coorectly after reboot and on the R610 not:
#bmc-watchdog --get
Timer Use: SMS/OS
Timer: Running
Logging: Enabled
Timeout Action: None
Pre-Timeout Interrupt: None
Pre-Timeout Interval: 0 seconds
There are no entries for alias name (dcifru)

Timer Use BIOS FRB2 Flag: Clear
Timer Use BIOS POST Flag: Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag: Clear
Timer Use BIOS OEM Flag: Clear
Initial Countdown: 480 seconds
Current Countdown: 479 seconds

module is loaded.
OpenManage disabled I guess: -> There are no entries for alias name (dcifru)
entry in /etc/modules.d/ exists

I could set the params via freeipmi commands, but after rebboot I guess it is gone ?
Any ideas why that happens ?
 
Thanks for the good information, I configured the ipmi_watchdog on a Dell R610 and R720 according to your docs, but on one machine the timer is set coorectly after reboot and on the R610 not:
#bmc-watchdog --get
Timer Use: SMS/OS
Timer: Running
Logging: Enabled
Timeout Action: None
Pre-Timeout Interrupt: None
Pre-Timeout Interval: 0 seconds
There are no entries for alias name (dcifru)

Timer Use BIOS FRB2 Flag: Clear
Timer Use BIOS POST Flag: Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag: Clear
Timer Use BIOS OEM Flag: Clear
Initial Countdown: 480 seconds
Current Countdown: 479 seconds

module is loaded.
OpenManage disabled I guess: -> There are no entries for alias name (dcifru)
entry in /etc/modules.d/ exists

I could set the params via freeipmi commands, but after rebboot I guess it is gone ?
Any ideas why that happens ?

Update:
I found something on the console: IPMI Watchdog: response: Error d4 on cmd 24
Google didn't really help, where to go with that one ? Kernel mailing list ? Dell tech center ? I don't have Pro support or alike.
 
Ok, for Dell R610 the iDrac Watchdog was enabled (not sure if default), so I disabled it, as mentioned by ghusson above.
Trying to set bmc-watchdog --set -a 3 -i 10 also failed with the same ipmi_watchdog driver error message.
Then I enabled the iDrac Automated System Recovery Agent again and ran the bmc-watchdog --set command and now it worked.
So it seems that iDRAC watchdog and ipmi watchdog are the same device on an R610 differing from 12G servers ?
After reboot the Power Cycle Action and timeout after 10s were also set correctly, when having the iDrac Automated System Recovery Agent enabled:
# bmc-watchdog --get
Timer Use: SMS/OS
Timer: Running
Logging: Enabled
Timeout Action: Power Cycle
Pre-Timeout Interrupt: None
Pre-Timeout Interval: 0 seconds
Timer Use BIOS FRB2 Flag: Clear
Timer Use BIOS POST Flag: Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag: Set
Timer Use BIOS OEM Flag: Clear
Initial Countdown: 10 seconds
Current Countdown: 9 seconds
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!