[SOLVED] pve-ha-lrm and watchdog-mux services fail to start

Feb 12, 2013
8
1
68
USA, Texas
Running PVE 8.0.4
ipmi_watchdog configured

After disabling maintenance mode via ha-manager crm-command node-maintenance disable node3,
ha-manager status shows:
lrm node3 (old timestamp - dead?, [date & time])
...
service vm:XXXX (node3, freeze)

systemctl status watchdog-mux pve-ha-lrm shows they are not running, failed to start.

After attempting to (re)start those services, logs show:
Oct 12 14:59:25 node3 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
Oct 12 14:59:25 node3 watchdog-mux[40674]: watchdog open: Device or resource busy
Oct 12 14:59:25 node3 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 14:59:25 node3 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: starting server
Oct 12 14:59:26 node3 pve-ha-lrm[40694]: status change startup => wait_for_agent_lock
Oct 12 14:59:26 node3 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: successfully acquired lock 'ha_agent_node3_lock'
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: ERROR: unable to open watchdog socket - No such file or directory
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: restart LRM, freeze all services
Oct 12 14:59:32 node3 pve-ha-lrm[40694]: server stopped
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 12 14:59:32 node3 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.
Not sure what is the cause of the "ERROR: unable to open watchdog socket - No such file or directory". Am I missing some additional configuration for the watchdog or could it be something else?

Thanks!
 
Hi,
what is the output of systemctl status watchdog-mux.service? If it's not active (running), you can try (re-)starting the service.
 
Fiona,
Thanks for the reply. The logs above included the output of journalctl for the services. However, I'm happy to say that we likely found our issue and solution. It appears that a recent upgrade to the iDRAC9 f/w from v6 to v7 series was the culprit. After reverting to v6 and rebooting the OS, everything began to work as expected. The bug/incompatibility between ipmi_watchdog and iDRAC9 f/w will hopefully be fixed in the future.

Thanks,
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!