[SOLVED] watchdog-mux fails to set timeout

isti

Renowned Member
Oct 16, 2015
2
0
66
Trying to test self fencing in PVE 4. I am using 3 Dell workstations as PoC, which have a HW(?) watchdog: iTCO_wdt.
If I load the module, a new watchdog device /dev/watchdog1 appears. echo 1 > /dev/watchdog1 reboots the node in a few seconds.
But there seems to be another watchdog device, /dev/watchdog0 which causes watchdog-mux to fail.
Code:
strace -f watchdog-mux
...
stat("/run/watchdog-mux.active", 0x7ffee01e0c90) = -1 ENOENT (No such file or directory)
stat("/dev/watchdog", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 130), ...}) = 0
open("/dev/watchdog", O_WRONLY)         = 3
ioctl(3, WDIOC_SETTIMEOUT, 0x603134)    = -1 EINVAL (Invalid argument)
dup(2)                                  = 4
fcntl(4, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(4, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5a8eb9a000
lseek(4, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(4, "watchdog set timeout: Invalid ar"..., 39watchdog set timeout: Invalid argument
) = 39
close(4)                                = 0
munmap(0x7f5a8eb9a000, 4096)            = 0
write(3, "V", 1)                        = 1
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

I have no idea what device /dev/watchdog0 is, tried to disable all possible watchdogs with kernel boot param: nmi_watchdog=0 soft_watchdog=0, blacklisting modules, poking around in /sys/devices/virtual/watchdog/watchdog0, but it remains, and does not seem to like:
Code:
ioctl(3, WDIOC_SETTIMEOUT, 0x603134)

Any idea how to work around the situation, remove this watchdog0 device, make watchdog-mux use a specific watchdog device rather than the /dev/watchdog?

PVE4 installed on top of jessie, or from ISO does not seem to matter,
Code:
uname -a
Linux proxmox4 4.2.2-1-pve #1 SMP Mon Oct 5 18:23:31 CEST 2015 x86_64 GNU/Linux
 

Attachments

  • watchdog-mux.trace.txt
    9.1 KB · Views: 5
  • lsmod.txt
    4.2 KB · Views: 3
  • dmesg.zip
    16 KB · Views: 0
Last edited:
I am also having this issue with watchdog-mux. It's causing the pve-ha-crm service to fail.Oct 21 01:38:59 pinkie watchdog-mux[1744]: watchdog set timeout: Invalid argumentOct 21 01:38:59 pinkie systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILUREOct 21 01:38:59 pinkie systemd[1]: Unit watchdog-mux.service entered failed state.This keeps happening, even after reinstalls. It's absolutley infuriating because it's causing softdog HA to NOT FUNCTION properly, not to mention I keep having this repeated hundreds of times in syslog:pve-ha-crm[1106]: watchdog update failed - Broken pipeI have a three node cluster this is happening on. I'll be happy to provide any logs or crashdumps/etc. that is needed in order to get this solved.
 
No, these are three Dell optiplex 755s. They have no hardware watchdog to speak of is my understanding.
 
You mentioning the BIOS got me to thinking.

It turns out there was a second watchdog being loaded!


I'm not sure how or what it was, but these Dell Optiplex 755s have the Intel Active Managment BIOS extension. This apparently has a sort of hardware watchdog.


As soon as I disabled Intel AMT on these three machines all the HA functions started working normally.


If you really need me to I can turn back on Intel Management Engine/AMT to see which driver it loads for the watchdog.

EDIT: https://www.kernel.org/doc/Documentation/misc-devices/mei/mei.txt
I believe this is what it was loading.
 
mei_me and mei were the culprits, thanks for the tip.

BTW, our Optiplex machines do not have the AMT option in the config (it's selectable, when you purchase the machines), neither is anything about any watchdog in the BIOS.
Nevertheless, the module gets loaded, so some parts of it must be present.

Blacklisting mei and mei_me solved it.
Thanks.

It would still be interesting to be able to tell watchdog-mux to only fiddle with a specific watchdog (/dev/watchdogX), not the /dev/watchdog device.
 
I updated the "High Availability Cluster 4.x" wiki page accordingly, thanks for the input.

For the question of different watchdogs I quote the watchdog linux mailing list:
The watchdog device node <-> driver mapping is fragile and
can change from one kernel version to the next or even across reboot, so
users shouldn't assume it to be persistent.

This is still the case and we cannot guarantee an assignment of a watchdog in all cases, and even the easier cases would need a lot of fiddling/checking/special cases which most of the time are prone to bugs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!