HA: VM startup failure, but no migration (HA group with priorities)

mf14v

New Member
May 15, 2018
3
0
1
Hi,

experimenting with Proxmox VE and HA; 3 nodes (proxmoxmf01 - proxmoxmf03); using Proxmox VE 5.2

If I correctly understand the documentation at https://pve.proxmox.com/wiki/High_Availability#ha_manager_start_failure_policy, I would expect a HA VM to migrate to a different node if it fails to start.

If the VM is assigned to a HA group containing multiple nodes of the same priority, this works as expected.

But if within the HA group the failing node has higher priority, the VM will never migrate away. Instead it keeps trying to start on the same node; apparently it does that (max_relocate+1)*(max_restart+1) times. The same happens with an unrestricted HA group containing just the failing node.

In this scenario, the node proxmoxmf03 is up and online in the cluster, just the VM is unable to start (to provoke the failure, I assigned an ISO image to the VM that is not available on node proxmoxmf03).

In group prefer_03_byprio, all cluster nodes are listed, with node proxmoxmf03 having the highest priority.


Situation before attempted VM start:
Code:
root@proxmoxmf03:~# VMID=108; echo; ha-manager status | grep "^service vm:${VMID} "; echo; vmres=`sed -ne "/^vm: ${VMID}$/,/^$/p" /etc/pve/ha/resources.cfg;`; echo "${vmres}"; echo; vmgroup=`echo "${vmres}" | sed -ne 's/^[[:space:]]*group //p'`; sed -ne "/^group: ${vmgroup}$/,/^$/p" /etc/pve/ha/groups.cfg; journalctl -u pve-ha-lrm | egrep "( VM |:)${VMID}(:|$)"

service vm:108 (proxmoxmf03, stopped)

vm: 108 
        group prefer_03_byprio
        max_relocate 1
        state stopped

group: prefer_03_byprio
        nodes proxmoxmf02:1,proxmoxmf01:1,proxmoxmf03:2
        nofailback 0
        restricted 0

VM start attempt and situation after some time:
Code:
root@proxmoxmf03:~# ha-manager set vm:108 --state started

root@proxmoxmf03:~# VMID=108; echo; ha-manager status | grep "^service vm:${VMID} "; echo; vmres=`sed -ne "/^vm: ${VMID}$/,/^$/p" /etc/pve/ha/resources.cfg;`; echo "${vmres}"; echo; vmgroup=`echo "${vmres}" | sed -ne 's/^[[:space:]]*group //p'`; sed -ne "/^group: ${vmgroup}$/,/^$/p" /etc/pve/ha/groups.cfg; journalctl -u pve-ha-lrm | egrep "( VM |:)${VMID}(:|$)"

service vm:108 (proxmoxmf03, error)

vm: 108 
        group prefer_03_byprio
        max_relocate 1
        state started

group: prefer_03_byprio
        nodes proxmoxmf02:1,proxmoxmf01:1,proxmoxmf03:2
        nofailback 0
        restricted 0

May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: starting service vm:108
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: <root@pam> starting task UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam:
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3382]: start VM 108: UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam:
May 23 17:59:30 proxmoxmf03 pve-ha-lrm[3379]: <root@pam> end task UPID:proxmoxmf03:00000D36:00009569:5B058FE2:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3409]: starting service vm:108
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3409]: <root@pam> starting task UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam:
May 23 17:59:40 proxmoxmf03 pve-ha-lrm[3411]: start VM 108: UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam:
May 23 17:59:41 proxmoxmf03 pve-ha-lrm[3409]: <root@pam> end task UPID:proxmoxmf03:00000D53:00009966:5B058FEC:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: starting service vm:108
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: <root@pam> starting task UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam:
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3440]: start VM 108: UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam:
May 23 17:59:50 proxmoxmf03 pve-ha-lrm[3438]: <root@pam> end task UPID:proxmoxmf03:00000D70:00009CFD:5B058FF6:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: starting service vm:108
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: <root@pam> starting task UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam:
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3469]: start VM 108: UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam:
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: <root@pam> end task UPID:proxmoxmf03:00000D8D:0000A0F8:5B059000:qmstart:108:root@pam: volume 'ISO:iso/ubuntu-16.04.2-server-amd64.iso' does not exist
May 23 18:00:00 proxmoxmf03 pve-ha-lrm[3467]: unable to start service vm:108

Is this a bug?

Regards
Matthias Ferdinand
 
This is expected behavior. The algorithm tries to restart on the list of available/online nodes with highest priority.
 
That means using different priority levels in a HA group (or using unrestricted groups that don't contain the full set of nodes) can actually reduce availability. The documentation should warn about that.

I come from a pacemaker background, therefore I had assumed a resource (vm) would have per-node failcounts, eventually making the node uneligible to run the failed resource.

Regards
Matthias
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!