Hello,
We have a 8 servers Proxmox 6.1-5 cluster (last updated 2 days ago) that is configured for HA with "shutdown_policy=migrate".
However we noticed that this sometimes works and sometimes doesn't, having a VM with HA enabled running on a host that is rebooted or powered off can result in either:
1. the VM is first migrated to another host, then the host is rebooted or powered off, after the host starts up again the VM is migrated back to it
or
2. the VM is not migrated, the host is rebooted or powered off, the VM goes into state "fence" on the HA status page, the VM is then started on another host
Looking through the logs it seems like what causes case 2. is the order in which the services are stopped on reboot/power off,
pve-ha-lrm doesn't get to migrate the VM because pve-cluster is stopped before pve-ha-lrm, here's an excerpt from the system logs:
So it seems like if pve-cluster is stopped before pve-ha-lrm the VM migration can't happen.
On the same host as above, here's an example of a successful migration before the host reboots:
Any clues on how to further look into this issue?
We have a 8 servers Proxmox 6.1-5 cluster (last updated 2 days ago) that is configured for HA with "shutdown_policy=migrate".
However we noticed that this sometimes works and sometimes doesn't, having a VM with HA enabled running on a host that is rebooted or powered off can result in either:
1. the VM is first migrated to another host, then the host is rebooted or powered off, after the host starts up again the VM is migrated back to it
or
2. the VM is not migrated, the host is rebooted or powered off, the VM goes into state "fence" on the HA status page, the VM is then started on another host
Looking through the logs it seems like what causes case 2. is the order in which the services are stopped on reboot/power off,
pve-ha-lrm doesn't get to migrate the VM because pve-cluster is stopped before pve-ha-lrm, here's an excerpt from the system logs:
...
Jan 22 14:48:31 psvirt04 pmxcfs[3010]: [main] notice: teardown filesystem
...
Jan 22 14:48:32 psvirt04 pmxcfs[3010]: [main] notice: exit proxmox configuration filesystem (0)
Jan 22 14:48:32 psvirt04 systemd[1]: pve-cluster.service: Succeeded.
Jan 22 14:48:32 psvirt04 systemd[1]: Stopped The Proxmox VE cluster filesystem.
...
Jan 22 14:48:41 psvirt04 systemd[1]: Stopping PVE Local HA Resource Manager Daemon...
...
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[1] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[2] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: ipcc_send_rec[3] failed: Connection refused
Jan 22 14:48:42 psvirt04 pve-ha-lrm[5896]: Unable to load access control list: Connection refused
Jan 22 14:48:42 psvirt04 systemd[1]: pve-ha-lrm.service: Control process exited, code=exited, status=111/n/a
....
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: received signal TERM
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: got shutdown request with shutdown policy 'migrate'
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: reboot LRM, doing maintenance, removing this node from active list
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: Can't locate object method "log" via package "PVE::HA::LRM" at /usr/share/perl5/PVE/HA/LRM.pm line 121.
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:48:49 psvirt04 pve-ha-lrm[4058]: lost lock 'ha_agent_psvirt04_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: status change active => lost_agent_lock
Jan 22 14:48:54 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:48:59 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: updating service status from manager failed: Connection refused
Jan 22 14:49:04 psvirt04 pve-ha-lrm[4058]: get shutdown request in state 'lost_agent_lock' - detected 3 running services
...
So it seems like if pve-cluster is stopped before pve-ha-lrm the VM migration can't happen.
On the same host as above, here's an example of a successful migration before the host reboots:
...
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: received signal TERM
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: got shutdown request with shutdown policy 'migrate'
Jan 22 14:40:51 psvirt04 pve-ha-lrm[4131]: reboot LRM, doing maintenance, removing this node from active list
...
Jan 22 14:41:01 psvirt04 pve-ha-lrm[4131]: status change active => maintenance
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5936]: <root@pam> starting task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5935]: <root@pam> starting task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:
Jan 22 14:41:01 psvirt04 pve-ha-lrm[5937]: <root@pam> starting task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:
...
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting
Jan 22 14:41:06 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5936]: Task 'UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5935]: Task 'UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam:' still active, waiting
Jan 22 14:41:11 psvirt04 pve-ha-lrm[5937]: Task 'UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam:' still active, waiting
Jan 22 14:41:12 psvirt04 pve-ha-lrm[5935]: <root@pam> end task UPID:psvirt04:00001733:000086E7:5E2850ED:qmigrate:900:root@pam: OK
Jan 22 14:41:12 psvirt04 pve-ha-lrm[5936]: <root@pam> end task UPID:psvirt04:00001734:000086E7:5E2850ED:qmigrate:901:root@pam: OK
Jan 22 14:41:13 psvirt04 pve-ha-lrm[5937]: <root@pam> end task UPID:psvirt04:00001732:000086E7:5E2850ED:qmigrate:902:root@pam: OK
Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: watchdog closed (disabled)
Jan 22 14:41:21 psvirt04 pve-ha-lrm[4131]: server stopped
Jan 22 14:41:21 psvirt04 systemd[1]: pve-ha-lrm.service: Succeeded.
...
Any clues on how to further look into this issue?
Last edited: