[SOLVED] pve-ha-lrm and watchdog-mux services fail to start

Feb 12, 2013
8
1
68
USA, Texas
Running PVE 8.0.4
ipmi_watchdog configured

After disabling maintenance mode via ha-manager crm-command node-maintenance disable node3,
ha-manager status shows:
lrm node3 (old timestamp - dead?, [date & time])
...
service vm:XXXX (node3, freeze)

systemctl status watchdog-mux pve-ha-lrm shows they are not running, failed to start.

After attempting to (re)start those services, logs show:
Oct 12 14:59:25 node3 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
Oct 12 14:59:25 node3 watchdog-mux[40674]: watchdog open: Device or resource busy
Oct 12 14:59:25 node3 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: starting server
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: status change startup => wait_for_agent_lock
Oct 12 14:59:26 node3 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: successfully acquired lock 'ha_agent_node3_lock'
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: ERROR: unable to open watchdog socket - No such file or directory
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: restart LRM, freeze all services
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: server stopped
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Not sure what is the cause of the "ERROR: unable to open watchdog socket - No such file or directory". Am I missing some additional configuration for the watchdog or could it be something else?

Thanks!
 
Hi,
what is the output of systemctl status watchdog-mux.service? If it's not active (running), you can try (re-)starting the service.
 
Fiona,
Thanks for the reply. The logs above included the output of journalctl for the services. However, I'm happy to say that we likely found our issue and solution. It appears that a recent upgrade to the iDRAC9 f/w from v6 to v7 series was the culprit. After reverting to v6 and rebooting the OS, everything began to work as expected. The bug/incompatibility between ipmi_watchdog and iDRAC9 f/w will hopefully be fixed in the future.

Thanks,
 
  • Like
Reactions: fiona