IPMI watchdog not working

mo_

Renowned Member
Oct 27, 2011
401
9
83
Germany
Hey guys, long time no see, but I'm back with a fresh problem.

There's a new HA cluster I'm setting up and I am having issues with the IPMI watchdog on one of the three servers. The kern.log once every second says:

Apr 4 17:42:14 prox5 kernel: [ 5105.588943] IPMI Watchdog: response: Error d5 on cmd 22

From strace I got:

epoll_wait(5, {}, 10, 1000) = 0
ioctl(3, WDIOC_KEEPALIVE, 0) = -1 EINVAL (Invalid argument)
write(2, "watchdog update failed: Invalid "..., 41watchdog update failed:
Invalid argument
) = 41

These are all HP blades, this one being a slightly different version (can only get details about that tomorrow).

lsmod does not show hpwdt by the way even though I havent blacklisted it:


root@prox5:~# lsmod|grep hp
hpilo 20480 0
shpchp 36864 0
hpsa 98304 1
scsi_transport_sas 45056 1 hpsa

Turns out the error spam stops if I stop the watchdog-mux service but that also stops the ipmi watchdog from setting and resetting timers. Since watchdog-mux has no documentation I could find anywhere... how do I find out which watchdog other than ipmi it is trying to use thats not actually there? And how do I make it stop that...

Any hints are greatly appreciated.

/edit: forgot pveversion:
Code:
root@prox5:~# pveversion -v
proxmox-ve: 4.4-86 (running kernel: 4.4.49-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
 
Last edited:
>These are all HP blades, this one being a slightly different version (can only get details about that tomorrow).

So... turns out theyre all HP BL460C Gen8s. CPUs are different tho. Problematic prox5 has: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Other 2 have: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
 
things just got weirder still. This error spam from the IPMI watchdog *stops* if you ENable HP ASR (their own watchdog thingie) and then reboot:

Code:
hpasmcli -s "enable asr";reboot

The wiki says to DISable all other watchdogs that may be active but here you have to ENable the HP watchdog so that ipmi_watchdog then uses it instead of the actual IPMI systems? Im confused.
 
Last edited:
Were you able to find a solution or workaround? I'm seeing this error as well on one(!) out of 6 HP DL 360 Gen9 servers:
Code:
kernel: IPMI Watchdog: response: Error d5 on cmd 24
Versions:
Code:
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.13-5-pve: 4.13.13-38
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1