pve-ha-lrm keeps failing

Belokan

Active Member
Apr 27, 2016
155
16
38
Hello,

I have a 3 nodes cluster based on pvetest. Nothing has been installed on pve nodes but proxmox distrib.
One of the node has its pve-ha-lrm failed all the time. It fails in few seconds when started from the GUI (or after reboot) with the following messages in syslog:

May 5 15:26:32 pve2 pvedaemon[1260]: <root@pam> starting task UPID: pve2:00001747:0000AE60:572B4A08:srvstart: pve-ha-lrm:root@pam:
May 5 15:26:32 pve2 pvedaemon[5959]: starting service pve-ha-lrm: UPID: pve2:00001747:0000AE60:572B4A08:srvstart: pve-ha-lrm:root@pam:
May 5 15:26:32 pve2 watchdog-mux[5961]: watchdog active - unable to restart watchdog-mux
May 5 15:26:32 pve2 systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
May 5 15:26:32 pve2 systemd[1]: Unit watchdog-mux.service entered failed state.
May 5 15:26:32 pve2 pve-ha-lrm[5984]: starting server
May 5 15:26:32 pve2 pve-ha-lrm[5984]: status change startup => wait_for_agent_lock
May 5 15:26:34 pve2 pve-ha-lrm[5984]: successfully acquired lock 'ha_agent_pve2_lock'
May 5 15:26:34 pve2 pve-ha-lrm[5984]: ERROR: unable to open watchdog socket - No such file or directory
May 5 15:26:34 pve2 pve-ha-lrm[5984]: restart LRM, freeze all services
May 5 15:26:34 pve2 pve-ha-lrm[5984]: server stopped
May 5 15:26:34 pve2 systemd[1]: pve-ha-lrm.service: main process exited, code=exited, status=255/n/a
May 5 15:26:35 pve2 systemd[1]: Unit pve-ha-lrm.service entered failed state.
May 5 15:27:40 pve2 systemd-timesyncd[638]: interval/delta/delay/jitter/drift 512s/-0.013s/0.036s/0.014s/+40ppm

All nodes are installed the same way, only their hardware differs. Any idea what could cause this issue ?

Thanks a lot in advance !
 
I've followed this discussion: https://forum.proxmox.com/threads/4-1-ha-software-watchdog-reset-does-not-work.25474/

And it looks like I have the same symptoms. Right after a reboot, softdog module is not loaded and watchdog-mux is failed as well:

root@pve2:~# lsmod | grep softdog

root@pve2:~# systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: exit-code) since Thu 2016-05-05 18:45:13 CEST; 1min 36s ago
Process: 1025 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 1025 (code=exited, status=1/FAILURE)

May 05 18:45:13 pve2 watchdog-mux[1025]: watchdog set timeout: Invalid argument
May 05 18:45:13 pve2 systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
May 05 18:45:13 pve2 systemd[1]: Unit watchdog-mux.service entered failed state.​
 
Service pve-ha-lrm is online at boot even if softdog is not loaded and watchdog-mux failed:

root@pve2:~# systemctl status pve-ha-lrm.service
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
Active: active (running) since Thu 2016-05-05 18:45:16 CEST; 4min 22s ago
Process: 1264 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 1272 (pve-ha-lrm)
CGroup: /system.slice/pve-ha-lrm.service
└─1272 pve-ha-lrm

May 05 18:45:16 pve2 pve-ha-lrm[1272]: starting server
May 05 18:45:16 pve2 pve-ha-lrm[1272]: status change startup => wait_for_agent_lock​

But as soon as I try to migrate a HA VM:

Executing HA migrate for VM 104 to node pve1
TASK OK

May 05 18:51:41 pve2 pve-ha-crm[1263]: status change wait_for_quorum => slave
May 05 18:51:48 pve2 pve-ha-lrm[1272]: successfully acquired lock 'ha_agent_pve2_lock'
May 05 18:51:48 pve2 pve-ha-lrm[1272]: ERROR: unable to open watchdog socket - No such file or directory
May 05 18:51:48 pve2 pve-ha-lrm[1272]: restart LRM, freeze all services
May 05 18:51:48 pve2 pve-ha-lrm[1272]: server stopped
May 05 18:51:48 pve2 systemd[1]: pve-ha-lrm.service: main process exited, code=exited, status=255/n/a
May 05 18:51:48 pve2 systemd[1]: Unit pve-ha-lrm.service entered failed state.
May 05 18:51:49 pve2 pvedaemon[1256]: <root@pam> starting task UPID: pve2:000016FD:00009D77:572B7A25:hamigrate:104:root@pam:​

And then the pve-ha-lrm service fails:

root@pve2:~# systemctl status pve-ha-lrm.service
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
Active: failed (Result: exit-code) since Thu 2016-05-05 18:51:48 CEST; 44s ago
Process: 5864 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
Process: 1264 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 1272 (code=exited, status=255)

May 05 18:45:16 pve2 pve-ha-lrm[1272]: starting server
May 05 18:45:16 pve2 pve-ha-lrm[1272]: status change startup => wait_for_agent_lock
May 05 18:51:48 pve2 pve-ha-lrm[1272]: successfully acquired lock 'ha_agent_pve2_lock'
May 05 18:51:48 pve2 pve-ha-lrm[1272]: ERROR: unable to open watchdog socket - No such file or directory
May 05 18:51:48 pve2 pve-ha-lrm[1272]: restart LRM, freeze all services
May 05 18:51:48 pve2 pve-ha-lrm[1272]: server stopped
May 05 18:51:48 pve2 systemd[1]: pve-ha-lrm.service: main process exited, code=exited, status=255/n/a
May 05 18:51:48 pve2 systemd[1]: Unit pve-ha-lrm.service entered failed state.​
 
Last edited:
I've tried to stop pve-ha-crm and lve-ha-lrm (which was already failed) in order to start failed watchdog-mux manually but:

root@pve2:~# systemctl start watchdog-mux.service

root@pve2:~# systemctl status watchdog-mux.service
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: exit-code) since Fri 2016-05-06 11:43:16 CEST; 3min 16s ago
Process: 21922 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 21922 (code=exited, status=1/FAILURE)

May 06 11:43:16 pve2 watchdog-mux[21922]: watchdog set timeout: Invalid argument
May 06 11:43:16 pve2 systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
May 06 11:43:16 pve2 systemd[1]: Unit watchdog-mux.service entered failed state.​
 
Replying to myself as I'm at office right now and can't check this feature/workaround.

The "failing" node is based on Core i5 2500M and to 2 "working" ones are based on Core i3 6100U and Atom C2538 (it's a VM only used as "virtual" 3rd node for HA).
It appears that only the failing one as Intel vPro (Active Management Technology) embedded and regarding this thread may cause for issue with softdog as it is kind of hardware watchdog implementation ...

https://forum.proxmox.com/threads/watchdog-mux-fails-to-set-timeout.23965/

I hope I'll be able to disable it from BIOS and I'll post the result here.
 
Hello,

So it was too good to be true. AMT was indeed enabled on the node's BIOS but even after disabling it (and removing softdog blacklisting in /lib/modprobe.d/blacklist_pve-kernel-4.4.6-1-pve.conf, softdog module is still not loading at boot and watchdog-mux + pve-ha-lrm services are failing ...

Any idea ? Do you think removing the node from cluster, reinstalling it completely and adding it again worth something ?

Thanks !
 
Hello,

A feedback and a solution finally !
Since one of my PVE's hardware crashed I've replaced it with the same model as the one with faulted pve-ha-lrm.
So both physical PVE were not able to correctly host HA VMs at that time ...

Then I've discovered one line in the following page that I did not see before:

http://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs

"Intel AMT (OS Health Watchdog) should be disabled and with it the mei and mei_me modules, as they may cause problems."

I did disable AMT in the BIOS but I did not blacklist the mei* modules. Now that it is done on both physical PVEs (the 3rd one being an external VM used for quorum only) I've been able to configure HA resources and migrate them in case of PVE crash !

Thanks again for this great product !

Olivier

PS: Could it be possible to have a distinction between "basic" and HA VMs in the left panel view ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!