VM migration sometimes fails on host reboot/shutdown

Nov 11, 2019
9
6
23
Hello,

We have a 8 servers Proxmox 6.1-5 cluster (last updated 2 days ago) that is configured for HA with "shutdown_policy=migrate".
However we noticed that this sometimes works and sometimes doesn't, having a VM with HA enabled running on a host that is rebooted or powered off can result in either:

1. the VM is first migrated to another host, then the host is rebooted or powered off, after the host starts up again the VM is migrated back to it
or
2. the VM is not migrated, the host is rebooted or powered off, the VM goes into state "fence" on the HA status page, the VM is then started on another host

Looking through the logs it seems like what causes case 2. is the order in which the services are stopped on reboot/power off,
pve-ha-lrm doesn't get to migrate the VM because pve-cluster is stopped before pve-ha-lrm, here's an excerpt from the system logs:

... Jan 22 14:48:31 psvirt04 pmxcfs[3010]: [main] notice: teardown filesystem ... Jan 22 14:48:32 psvirt04 pmxcfs[3010]: [main] notice: exit proxmox configuration filesystem (0) Jan 22 14:48:32 psvirt04 systemd[1]: pve-cluster.service: Succeeded. Jan 22 14:48:32 psvirt04 systemd[1]: Stopped The Proxmox VE cluster filesystem. ... Jan 22 14:48:41 psvirt04 systemd[1]: Stopping PVE Local HA Resource Manager Daemon... ... Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: Unable to load access control list: Connection refused Jan 22 14:48:42 psvirt04 systemd[1]: pve-ha-lrm.service: Control process exited, code=exited, status=111/n/a .... Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: received signal TERM Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: got shutdown request with shutdown policy 'migrate' Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: reboot LRM, doing maintenance, removing this node from active list Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: Can't locate object method "log" via package "PVE::HA::LRM" at /usr/share/perl5/PVE/HA/LRM.pm line 121. Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: lost lock 'ha_agent_psvirt04_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?) Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: status change active => lost_agent_lock Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services ...

So it seems like if pve-cluster is stopped before pve-ha-lrm the VM migration can't happen.
On the same host as above, here's an example of a successful migration before the host reboots:

... Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: received signal TERM Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: got shutdown request with shutdown policy 'migrate' Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: reboot LRM, doing maintenance, removing this node from active list ... Jan 22 14:41:01 psvirt04 pve-ha-lrm[4131]: status change active => maintenance Jan 22 14:41:01 psvirt04 pve-ha-lrm[5936]: <root@pam> starting task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam: Jan 22 14:41:01 psvirt04 pve-ha-lrm[5935]: <root@pam> starting task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam: Jan 22 14:41:01 psvirt04 pve-ha-lrm[5937]: <root@pam> starting task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam: ... Jan 22 14:41:06 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting Jan 22 14:41:06 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting Jan 22 14:41:06 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting Jan 22 14:41:11 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting Jan 22 14:41:11 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting Jan 22 14:41:11 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting Jan 22 14:41:12 psvirt04 pve-ha-lrm[5935]: <root@pam> end task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam: OK Jan 22 14:41:12 psvirt04 pve-ha-lrm[5936]: <root@pam> end task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam: OK Jan 22 14:41:13 psvirt04 pve-ha-lrm[5937]: <root@pam> end task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam: OK Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: watchdog closed (disabled) Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: server stopped Jan 22 14:41:21 psvirt04 systemd[1]: pve-ha-lrm.service: Succeeded. ...


Any clues on how to further look into this issue?
 
Last edited:
  • Like
Reactions: badji
hi,

indeed if the cluster service ends first the VM cannot migrate

is there a possibility that the shutdown is done in a different way in two cases? maybe in the GUI for the successful case (i'd imagine) and maybe a command for the unsuccessful? that could narrow down the issue
 
Hi,

The reboots were done either from the Proxmox web interface (Reboot button in the upper right) or from the command line of the host by running the "reboot" command.
I did a new test, this time rebooting the same host twice from the web interface.
The first reboot went as expected, with VM migration triggered before the host is rebooted and then VM migrated back after the host was up.
The second reboot didn't migrate the VM, checking the logs shows the same behaviour as above, the cluster service is stopped before the pve-ha-lrm service.
 
Can anyone confirm they have the same issue in 6.1-5?

Looking into it a bit further it would seem like the order in which services are stopped on reboot/poweroff varies and is inconsistent between reboots/poweroffs, is there are a way to guarantee that the pve-cluster service is always stopped after pve-ha-lrm is stopped?
 
  • Like
Reactions: badji
I have the same problem.
Servers take a long time to reboot or shut down too.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!