HA or migration of VMs that are turned off on a node that is shut down or rebooted

Feb 3, 2021
7
1
8
44
Hi, I have a cluster with two proxmox nodes at version 7.1 and a raspberry-like thing acting as corosync Qdevice to have quorum for HA.

The nodes have replicated ZFS storage and a tiny Ceph cluster with "min size 1" (this Ceph setup is mostly for testing and holding ISOs and stuff).

I'm seeing that HA works fine on VMs that are powered on, and the proxmox hosts implement the fencing correctly.
If a host loses contact to the other host and to the Qdevice it reboots, and the VMs that were online are restarted on the other node by HA.

My problem is on VMs that are powered off but still "owned" or "assigned" to a host that is powered off or rebooted. These VMs can be migrated by HA fine if they are turned on, or manually with live migration.

In https://pve.proxmox.com/wiki/High_Availability it says "The CRM tries to keep the resource in stopped state, but it still tries to relocate the resources on node failures." So I put them in the same HA group as the others, and put them as "requested status stopped" as indicated.

But HA moves these VMs only if the host is disconnected (i.e. I disconnect the network cable and the host fences itself and reboots), not if I ask a host shutdown or a reboot from web interface or console.

powered off VMs on the node that is shut down or rebooted are just locked out and can't be started or edited or migrated. Migration can be requested but fails with a timeout (as the other host is not responding).

I don't know if this is intended, maybe the logic behind this choice was "HA acts only if there is a failure and a admin requesting a shutdown or reboot isn't a failure".

But hear me out. What if the host is not coming up again after I turned it off, so it experiences a failure on boot or while someone was doing hardware maintenance, and it never comes up again? This isn't impossible, especially with older hardware.

I would be fine with either HA just moving the VMs around automatically (like when I pull the network cable) OR if you allow manual migration even if the node is offline so I can at least request a migration of a VM from a dead node if it is still technically possible (i.e. it is on shared or replicated storage).

For the moment, the workaround is to just ssh in the node (that is still online) and do

Code:
mv /etc/pve/nodes/pve-big2/qemu-server/*.conf /etc/pve/nodes/pve-big1/qemu-server/

this will move VM config files inside the corosync/cluster-shared-folder from the folder of the VMs in use from of the "dead" node (called "pve-big2" in my example) to the folder of the "online" node (called "pve-big1" in my example).

In a few seconds, the VMs will reappear in the web interface and I can start them or do anything as normal.

As I said multiple times, this works only for VMs that were on shared or replicated storage, since I'm just moving a VM config file, not the virtual disks.
 
Last edited:
It's set to migrate, obviously, as that's what happens with VMs that are powered on.
/etc/pve/datacenter.cfg
Code:
console: vv
ha: shutdown_policy=migrate
keyboard: it
migration: insecure,network=192.168.111.222/24

My issue is with VMs that are powered off.

Those VM are already in the HA config, with "request status stopped". That's why the HA migrates them if I pull the network cable.

My problem is that HA does not migrate them if I shut down or reboot the node. Only powered on VMs will be migrated (or restarted)

A snippet of /etc/pve/ha/resources.cfg note VM 101 and 104 and 102
Code:
vm: 106
group HA-group
max_relocate 5
max_restart 5
state started

vm: 101
group HA-group
max_relocate 5
max_restart 5
state stopped

vm: 104
group HA-group
max_relocate 5
max_restart 5
state stopped

vm: 108
group HA-group
max_relocate 5
max_restart 5
state stopped

vm: 100
group HA-group
max_relocate 5
max_restart 5
state started

vm: 102
group HA-group
max_relocate 5
max_restart 5
state stopped

Btw, the "migrate" shutdown policy is talking only about "running services" (VMs or containers), so not moving shut down VMs seems intentional.
The LRM will try to delay the shutdown process, until all running services get moved away. But, this expects that the running services can be migrated to another node.
 
Last edited:
THANK YOU.

i have a cluster of nodes and one of them had a complete hardware failure (never will come online again) and this was the only solution that worked. THANK YOU.


mv /etc/pve/nodes/HPDL350/qemu-server/*.conf /etc/pve/nodes/m93p/qemu-server/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!