Automatic/Unattended Failover HA

sclementi

New Member
Jan 28, 2025
6
0
1
New to Proxmox, but not to virtualization. I installed a three node cluster last week, configured Ceph for shared storage, created some VMs and everything seems to work normally. I can migrate VMs from host to host no issue, I can put hosts into maintenance mode and VMs move automatically.

Issue: If I lose power on a node, it takes about 2 minutes for the VMs to automatically reregister on other hosts on the cluster and their OSes never load up. 3 minutes later and the VMs are still offline so I turn the failed host back on and it comes back online and 5 minutes later the VMs finally start up on the host they moved to and are back online. Now I can move them around with no issue.

What am I doing wrong... and what additional info can I give to help? Or is there simply no disaster failover (can't see how that could be)
 
Configured a HA Group with the three hosts. Configured the VMs to use the group. As stated, failover works when a node is placed in maintenance mode according to group settings, the VMs migrate off.
 
Removing power from the server so it goes offline.

Shutdown Policy is Migrate. I also tried Failover.

as an update to my original post... the VMs do seem to try and start up, but they never get to a point where they load an OS or even display a console for that matter.
 
Last edited:
Two of the VMs show as "running" (green icon,103 and 105 originally on node2), but they are not online nor can I get to the console and one of them doesn't come up at all (TestVM, 100):
1738102450359.png
The VMs are configured to use the HA group:
1738102618466.png

In this test I powered off node 2 if it wasn't apparent.
 
As a test, I failed node 1 this time... and the VMs all came online in under 5 minutes. The only reason I tested this was because I saw that the lrm on node 2 was idle, but was active on 1 and 3. The failure, I guess, forced lrm to go active on node 2 as it is active now.

After bringing node 1 back online, the lrm states are all active.

Could that have something to do with it?
 
Will test that tomorrow as we have a died disk in one pve node, powercut of one without maintenance mode in 5node 8.3.3 cluster because new disk is waiting on desk :)
 
Manual "powercut" on node with failed disk, after around 2min the 9vm+1lxc auto-started on other node as defined for ha prefered host group.
After bring back the pseudo failed node the 10 machines auto-migrated back as ha was defined, so all works fine and as desired.
@sclementi : unhappily there's somethink wrong in your cluster / ha configuration.
 
Last edited:
We have Shutdown Policy default/conditional set so try that instead of your choice of migrate or failover.