Hi,
experimenting with Proxmox VE and HA; 3 nodes (proxmoxmf01 - proxmoxmf03); using Proxmox VE 5.2
If I correctly understand the documentation at https://pve.proxmox.com/wiki/High_Availability#ha_manager_start_failure_policy, I would expect a HA VM to migrate to a different node if it fails to start.
If the VM is assigned to a HA group containing multiple nodes of the same priority, this works as expected.
But if within the HA group the failing node has higher priority, the VM will never migrate away. Instead it keeps trying to start on the same node; apparently it does that (max_relocate+1)*(max_restart+1) times. The same happens with an unrestricted HA group containing just the failing node.
In this scenario, the node proxmoxmf03 is up and online in the cluster, just the VM is unable to start (to provoke the failure, I assigned an ISO image to the VM that is not available on node proxmoxmf03).
In group prefer_03_byprio, all cluster nodes are listed, with node proxmoxmf03 having the highest priority.
Situation before attempted VM start:
VM start attempt and situation after some time:
Is this a bug?
Regards
Matthias Ferdinand
experimenting with Proxmox VE and HA; 3 nodes (proxmoxmf01 - proxmoxmf03); using Proxmox VE 5.2
If I correctly understand the documentation at https://pve.proxmox.com/wiki/High_Availability#ha_manager_start_failure_policy, I would expect a HA VM to migrate to a different node if it fails to start.
If the VM is assigned to a HA group containing multiple nodes of the same priority, this works as expected.
But if within the HA group the failing node has higher priority, the VM will never migrate away. Instead it keeps trying to start on the same node; apparently it does that (max_relocate+1)*(max_restart+1) times. The same happens with an unrestricted HA group containing just the failing node.
In this scenario, the node proxmoxmf03 is up and online in the cluster, just the VM is unable to start (to provoke the failure, I assigned an ISO image to the VM that is not available on node proxmoxmf03).
In group prefer_03_byprio, all cluster nodes are listed, with node proxmoxmf03 having the highest priority.
Situation before attempted VM start:
Code:
root@proxmoxmf03:~# VMID=108; echo; ha-manager status | grep "^service vm:${VMID} "; echo; vmres=`sed -ne "/^vm: ${VMID}$/,/^$/p" /etc/pve/ha/resources.cfg;`; echo "${vmres}"; echo; vmgroup=`echo "${vmres}" | sed -ne 's/^[[:space:]]*group //p'`; sed -ne "/^group: ${vmgroup}$/,/^$/p" /etc/pve/ha/groups.cfg; journalctl -u pve-ha-lrm | egrep "( VM |:)${VMID}(:|$)"
service vm:108 (proxmoxmf03, stopped)
vm: 108
group prefer_03_byprio
max_relocate 1
state stopped
group: prefer_03_byprio
nodes proxmoxmf02:1,proxmoxmf01:1,proxmoxmf03:2
nofailback 0
restricted 0
VM start attempt and situation after some time:
Code:
root@proxmoxmf03:~# ha-manager set vm:108 --state started
root@proxmoxmf03:~# VMID=108; echo; ha-manager status | grep "^service vm:${VMID} "; echo; vmres=`sed -ne "/^vm: ${VMID}$/,/^$/p" /etc/pve/ha/resources.cfg;`; echo "${vmres}"; echo; vmgroup=`echo "${vmres}" | sed -ne 's/^[[:space:]]*group //p'`; sed -ne "/^group: ${vmgroup}$/,/^$/p" /etc/pve/ha/groups.cfg; journalctl -u pve-ha-lrm | egrep "( VM |:)${VMID}(:|$)"
service vm:108 (proxmoxmf03, error)
vm: 108
group prefer_03_byprio
max_relocate 1
state started
group: prefer_03_byprio
nodes proxmoxmf02:1,proxmoxmf01:1,proxmoxmf03:2
nofailback 0
restricted 0
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: starting service vm:108
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: <root@pam> starting task UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam:
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3382]: start VM 108: UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam:
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: <root@pam> end task UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3409]: starting service vm:108
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3409]: <root@pam> starting task UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam:
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3411]: start VM 108: UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam:
May 23 17:59:41 proxmoxmf03 pve-ha-lrm[3409]: <root@pam> end task UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: starting service vm:108
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: <root@pam> starting task UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam:
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3440]: start VM 108: UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam:
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: <root@pam> end task UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: starting service vm:108
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: <root@pam> starting task UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam:
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3469]: start VM 108: UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam:
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: <root@pam> end task UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: unable to start service vm:108
Is this a bug?
Regards
Matthias Ferdinand