[SOLVED] pve-ha-lrm and watchdog-mux services fail to start

pherrera_tamu · Oct 12, 2023

Running PVE 8.0.4
ipmi_watchdog configured

After disabling maintenance mode via ha-manager crm-command node-maintenance disable node3,
ha-manager status shows:
lrm node3 (old timestamp - dead?, [date & time])
...
service vm:XXXX (node3, freeze)

systemctl status watchdog-mux pve-ha-lrm shows they are not running, failed to start.

After attempting to (re)start those services, logs show:

Oct 12 14:59:25 node3 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
Oct 12 14:59:25 node3 watchdog-mux[40674]: watchdog open: Device or resource busy
Oct 12 14:59:25 node3 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: starting server
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: status change startup => wait_for_agent_lock
Oct 12 14:59:26 node3 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: successfully acquired lock 'ha_agent_node3_lock'
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: ERROR: unable to open watchdog socket - No such file or directory
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: restart LRM, freeze all services
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: server stopped
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.

Not sure what is the cause of the "ERROR: unable to open watchdog socket - No such file or directory". Am I missing some additional configuration for the watchdog or could it be something else?

Thanks!

fiona · Oct 13, 2023

Hi,
what is the output of systemctl status watchdog-mux.service? If it's not active (running), you can try (re-)starting the service.

pherrera_tamu · Oct 13, 2023

Fiona,
Thanks for the reply. The logs above included the output of journalctl for the services. However, I'm happy to say that we likely found our issue and solution. It appears that a recent upgrade to the iDRAC9 f/w from v6 to v7 series was the culprit. After reverting to v6 and rebooting the OS, everything began to work as expected. The bug/incompatibility between ipmi_watchdog and iDRAC9 f/w will hopefully be fixed in the future.

Thanks,

Search

Search

[SOLVED] pve-ha-lrm and watchdog-mux services fail to start

pherrera_tamu

Renowned Member

fiona

Proxmox Staff Member

pherrera_tamu

Renowned Member

We value your privacy