[SOLVED] watchdog-mux fails to set timeout

isti · Oct 16, 2015

Trying to test self fencing in PVE 4. I am using 3 Dell workstations as PoC, which have a HW(?) watchdog: iTCO_wdt.
If I load the module, a new watchdog device /dev/watchdog1 appears. echo 1 > /dev/watchdog1 reboots the node in a few seconds.
But there seems to be another watchdog device, /dev/watchdog0 which causes watchdog-mux to fail.

Code:

strace -f watchdog-mux
...
stat("/run/watchdog-mux.active", 0x7ffee01e0c90) = -1 ENOENT (No such file or directory)
stat("/dev/watchdog", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 130), ...}) = 0
open("/dev/watchdog", O_WRONLY)         = 3
ioctl(3, WDIOC_SETTIMEOUT, 0x603134)    = -1 EINVAL (Invalid argument)
dup(2)                                  = 4
fcntl(4, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(4, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5a8eb9a000
lseek(4, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(4, "watchdog set timeout: Invalid ar"..., 39watchdog set timeout: Invalid argument
) = 39
close(4)                                = 0
munmap(0x7f5a8eb9a000, 4096)            = 0
write(3, "V", 1)                        = 1
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

I have no idea what device /dev/watchdog0 is, tried to disable all possible watchdogs with kernel boot param: nmi_watchdog=0 soft_watchdog=0, blacklisting modules, poking around in /sys/devices/virtual/watchdog/watchdog0, but it remains, and does not seem to like:

Code:

ioctl(3, WDIOC_SETTIMEOUT, 0x603134)

Any idea how to work around the situation, remove this watchdog0 device, make watchdog-mux use a specific watchdog device rather than the /dev/watchdog?

PVE4 installed on top of jessie, or from ISO does not seem to matter,

Code:

uname -a
Linux proxmox4 4.2.2-1-pve #1 SMP Mon Oct 5 18:23:31 CEST 2015 x86_64 GNU/Linux

coppercore · Oct 21, 2015

I am also having this issue with watchdog-mux. It's causing the pve-ha-crm service to fail.Oct 21 01:38:59 pinkie watchdog-mux[1744]: watchdog set timeout: Invalid argumentOct 21 01:38:59 pinkie systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILUREOct 21 01:38:59 pinkie systemd[1]: Unit watchdog-mux.service entered failed state.This keeps happening, even after reinstalls. It's absolutley infuriating because it's causing softdog HA to NOT FUNCTION properly, not to mention I keep having this repeated hundreds of times in syslog

ve-ha-crm[1106]: watchdog update failed - Broken pipeI have a three node cluster this is happening on. I'll be happy to provide any logs or crashdumps/etc. that is needed in order to get this solved.

dietmar · Oct 21, 2015

Did you configure the watchdog in the BIOS somehow? If so, remove those settings.

coppercore · Oct 21, 2015

No, these are three Dell optiplex 755s. They have no hardware watchdog to speak of is my understanding.

dietmar · Oct 21, 2015

Please check your syslog to see what kind of watchdog device gets loaded.

coppercore · Oct 21, 2015

You mentioning the BIOS got me to thinking.

It turns out there was a second watchdog being loaded!

I'm not sure how or what it was, but these Dell Optiplex 755s have the Intel Active Managment BIOS extension. This apparently has a sort of hardware watchdog.

As soon as I disabled Intel AMT on these three machines all the HA functions started working normally.

If you really need me to I can turn back on Intel Management Engine/AMT to see which driver it loads for the watchdog.

EDIT: https://www.kernel.org/doc/Documentation/misc-devices/mei/mei.txt
I believe this is what it was loading.

isti · Oct 21, 2015

mei_me and mei were the culprits, thanks for the tip.

BTW, our Optiplex machines do not have the AMT option in the config (it's selectable, when you purchase the machines), neither is anything about any watchdog in the BIOS.
Nevertheless, the module gets loaded, so some parts of it must be present.

Blacklisting mei and mei_me solved it.
Thanks.

It would still be interesting to be able to tell watchdog-mux to only fiddle with a specific watchdog (/dev/watchdogX), not the /dev/watchdog device.

t.lamprecht · Oct 21, 2015

I updated the "High Availability Cluster 4.x" wiki page accordingly, thanks for the input.

For the question of different watchdogs I quote the watchdog linux mailing list:

The watchdog device node <-> driver mapping is fragile and
can change from one kernel version to the next or even across reboot, so
users shouldn't assume it to be persistent.

This is still the case and we cannot guarantee an assignment of a watchdog in all cases, and even the easier cases would need a lot of fiddling/checking/special cases which most of the time are prone to bugs.

Search

Search

[SOLVED] watchdog-mux fails to set timeout

isti

Renowned Member

Attachments

coppercore

New Member

dietmar

Proxmox Staff Member

coppercore

New Member

dietmar

Proxmox Staff Member

coppercore

New Member

isti

Renowned Member

t.lamprecht

Proxmox Staff Member