VM migration sometimes fails on host reboot/shutdown

givan · Jan 22, 2020

Hello,

We have a 8 servers Proxmox 6.1-5 cluster (last updated 2 days ago) that is configured for HA with "shutdown_policy=migrate".
However we noticed that this sometimes works and sometimes doesn't, having a VM with HA enabled running on a host that is rebooted or powered off can result in either:

1. the VM is first migrated to another host, then the host is rebooted or powered off, after the host starts up again the VM is migrated back to it
or
2. the VM is not migrated, the host is rebooted or powered off, the VM goes into state "fence" on the HA status page, the VM is then started on another host

Looking through the logs it seems like what causes case 2. is the order in which the services are stopped on reboot/power off,
pve-ha-lrm doesn't get to migrate the VM because pve-cluster is stopped before pve-ha-lrm, here's an excerpt from the system logs:

...
Jan 22 14:48:31 psvirt04 pmxcfs[3010]: [main] notice: teardown filesystem
...
Jan 22 14:48:32 psvirt04 pmxcfs[3010]: [main] notice: exit proxmox configuration filesystem (0)
Jan 22 14:48:32 psvirt04 systemd[1]: pve-cluster.service: Succeeded.
Jan 22 14:48:32 psvirt04 systemd[1]: Stopped The Proxmox VE cluster filesystem.
...
Jan 22 14:48:41 psvirt04 systemd[1]: Stopping PVE Local HA Resource Manager Daemon...
...
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: Unable to load access control list: Connection refused
Jan 22 14:48:42 psvirt04 systemd[1]: pve-ha-lrm.service: Control process exited, code=exited, status=111/n/a
....
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: received signal TERM
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: got shutdown request with shutdown policy 'migrate'
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: reboot LRM, doing maintenance, removing this node from active list
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: Can't locate object method "log" via package "PVE::HA::LRM" at /usr/share/perl5/PVE/HA/LRM.pm line 121.
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: lost lock 'ha_agent_psvirt04_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: status change active => lost_agent_lock
Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
...

So it seems like if pve-cluster is stopped before pve-ha-lrm the VM migration can't happen.
On the same host as above, here's an example of a successful migration before the host reboots:

...
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: received signal TERM
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: got shutdown request with shutdown policy 'migrate'
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: reboot LRM, doing maintenance, removing this node from active list
...
Jan 22 14:41:01 psvirt04 pve-ha-lrm[4131]: status change active => maintenance
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5936]: <root@pam> starting task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5935]: <root@pam> starting task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5937]: <root@pam> starting task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:
...
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting
Jan 22 14:41:12 psvirt04 pve-ha-lrm[5935]: <root@pam> end task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam: OK
Jan 22 14:41:12 psvirt04 pve-ha-lrm[5936]: <root@pam> end task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam: OK
Jan 22 14:41:13 psvirt04 pve-ha-lrm[5937]: <root@pam> end task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam: OK
Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: watchdog closed (disabled)
Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: server stopped
Jan 22 14:41:21 psvirt04 systemd[1]: pve-ha-lrm.service: Succeeded.
...

Any clues on how to further look into this issue?

oguz · Jan 22, 2020

hi,

indeed if the cluster service ends first the VM cannot migrate

is there a possibility that the shutdown is done in a different way in two cases? maybe in the GUI for the successful case (i'd imagine) and maybe a command for the unsuccessful? that could narrow down the issue

givan · Jan 22, 2020

Hi,

The reboots were done either from the Proxmox web interface (Reboot button in the upper right) or from the command line of the host by running the "reboot" command.
I did a new test, this time rebooting the same host twice from the web interface.
The first reboot went as expected, with VM migration triggered before the host is rebooted and then VM migrated back after the host was up.
The second reboot didn't migrate the VM, checking the logs shows the same behaviour as above, the cluster service is stopped before the pve-ha-lrm service.

givan · Jan 24, 2020

Can anyone confirm they have the same issue in 6.1-5?

Looking into it a bit further it would seem like the order in which services are stopped on reboot/poweroff varies and is inconsistent between reboots/poweroffs, is there are a way to guarantee that the pve-cluster service is always stopped after pve-ha-lrm is stopped?

badji · Jan 24, 2020

I have the same problem.
Servers take a long time to reboot or shut down too.

givan · Jan 24, 2020

@badji - I experienced the long reboot/shutdown times too before installing the ifupdown2 package, check the Proxmox admin guide at:

https://pve.proxmox.com/pve-docs/pve-admin-guide.pdf

Read the "Reload Network with ifupdown2" section carefully, there are some notes and warnings that you should check before installing ifupdown2.

Search

Search

VM migration sometimes fails on host reboot/shutdown

givan

Active Member

oguz

Proxmox Retired Staff

givan

Active Member

givan

Active Member

badji

Renowned Member

givan

Active Member