Hi, I have a cluster with two proxmox nodes at version 7.1 and a raspberry-like thing acting as corosync Qdevice to have quorum for HA.
The nodes have replicated ZFS storage and a tiny Ceph cluster with "min size 1" (this Ceph setup is mostly for testing and holding ISOs and stuff).
I'm seeing that HA works fine on VMs that are powered on, and the proxmox hosts implement the fencing correctly.
If a host loses contact to the other host and to the Qdevice it reboots, and the VMs that were online are restarted on the other node by HA.
My problem is on VMs that are powered off but still "owned" or "assigned" to a host that is powered off or rebooted. These VMs can be migrated by HA fine if they are turned on, or manually with live migration.
In https://pve.proxmox.com/wiki/High_Availability it says "The CRM tries to keep the resource in stopped state, but it still tries to relocate the resources on node failures." So I put them in the same HA group as the others, and put them as "requested status stopped" as indicated.
But HA moves these VMs only if the host is disconnected (i.e. I disconnect the network cable and the host fences itself and reboots), not if I ask a host shutdown or a reboot from web interface or console.
powered off VMs on the node that is shut down or rebooted are just locked out and can't be started or edited or migrated. Migration can be requested but fails with a timeout (as the other host is not responding).
I don't know if this is intended, maybe the logic behind this choice was "HA acts only if there is a failure and a admin requesting a shutdown or reboot isn't a failure".
But hear me out. What if the host is not coming up again after I turned it off, so it experiences a failure on boot or while someone was doing hardware maintenance, and it never comes up again? This isn't impossible, especially with older hardware.
I would be fine with either HA just moving the VMs around automatically (like when I pull the network cable) OR if you allow manual migration even if the node is offline so I can at least request a migration of a VM from a dead node if it is still technically possible (i.e. it is on shared or replicated storage).
For the moment, the workaround is to just ssh in the node (that is still online) and do
this will move VM config files inside the corosync/cluster-shared-folder from the folder of the VMs in use from of the "dead" node (called "pve-big2" in my example) to the folder of the "online" node (called "pve-big1" in my example).
In a few seconds, the VMs will reappear in the web interface and I can start them or do anything as normal.
As I said multiple times, this works only for VMs that were on shared or replicated storage, since I'm just moving a VM config file, not the virtual disks.
The nodes have replicated ZFS storage and a tiny Ceph cluster with "min size 1" (this Ceph setup is mostly for testing and holding ISOs and stuff).
I'm seeing that HA works fine on VMs that are powered on, and the proxmox hosts implement the fencing correctly.
If a host loses contact to the other host and to the Qdevice it reboots, and the VMs that were online are restarted on the other node by HA.
My problem is on VMs that are powered off but still "owned" or "assigned" to a host that is powered off or rebooted. These VMs can be migrated by HA fine if they are turned on, or manually with live migration.
In https://pve.proxmox.com/wiki/High_Availability it says "The CRM tries to keep the resource in stopped state, but it still tries to relocate the resources on node failures." So I put them in the same HA group as the others, and put them as "requested status stopped" as indicated.
But HA moves these VMs only if the host is disconnected (i.e. I disconnect the network cable and the host fences itself and reboots), not if I ask a host shutdown or a reboot from web interface or console.
powered off VMs on the node that is shut down or rebooted are just locked out and can't be started or edited or migrated. Migration can be requested but fails with a timeout (as the other host is not responding).
I don't know if this is intended, maybe the logic behind this choice was "HA acts only if there is a failure and a admin requesting a shutdown or reboot isn't a failure".
But hear me out. What if the host is not coming up again after I turned it off, so it experiences a failure on boot or while someone was doing hardware maintenance, and it never comes up again? This isn't impossible, especially with older hardware.
I would be fine with either HA just moving the VMs around automatically (like when I pull the network cable) OR if you allow manual migration even if the node is offline so I can at least request a migration of a VM from a dead node if it is still technically possible (i.e. it is on shared or replicated storage).
For the moment, the workaround is to just ssh in the node (that is still online) and do
Code:
mv /etc/pve/nodes/pve-big2/qemu-server/*.conf /etc/pve/nodes/pve-big1/qemu-server/
this will move VM config files inside the corosync/cluster-shared-folder from the folder of the VMs in use from of the "dead" node (called "pve-big2" in my example) to the folder of the "online" node (called "pve-big1" in my example).
In a few seconds, the VMs will reappear in the web interface and I can start them or do anything as normal.
As I said multiple times, this works only for VMs that were on shared or replicated storage, since I'm just moving a VM config file, not the virtual disks.
Last edited: