Automatic/Unattended Failover HA

sclementi · Tuesday at 19:03

New to Proxmox, but not to virtualization. I installed a three node cluster last week, configured Ceph for shared storage, created some VMs and everything seems to work normally. I can migrate VMs from host to host no issue, I can put hosts into maintenance mode and VMs move automatically.

Issue: If I lose power on a node, it takes about 2 minutes for the VMs to automatically reregister on other hosts on the cluster and their OSes never load up. 3 minutes later and the VMs are still offline so I turn the failed host back on and it comes back online and 5 minutes later the VMs finally start up on the host they moved to and are back online. Now I can move them around with no issue.

What am I doing wrong... and what additional info can I give to help? Or is there simply no disaster failover (can't see how that could be)

maxim.webster · Tuesday at 19:06

Did you configure HA for selected VMs or are you assuming that this will be done automatically?

sclementi · Tuesday at 19:35

Configured a HA Group with the three hosts. Configured the VMs to use the group. As stated, failover works when a node is placed in maintenance mode according to group settings, the VMs migrate off.

alexskysilk · Tuesday at 19:52

sclementi said:
What am I doing wrong... and what additional info can I give to help? Or is there simply no disaster failover (can't see how that could be)

How are you are you simulating failure? also, what is your power off vm policy (datacenter-options-ha settings)

sclementi · Tuesday at 22:23

Removing power from the server so it goes offline.

Shutdown Policy is Migrate. I also tried Failover.

as an update to my original post... the VMs do seem to try and start up, but they never get to a point where they load an OS or even display a console for that matter.

alexskysilk · Tuesday at 23:00

ok, thats a good test.

Do you see the affected virtual machines restarted (or attempted to be) by other nodes? if not, do you have them in HA? datacenter-HA

sclementi · Tuesday at 23:19

Two of the VMs show as "running" (green icon,103 and 105 originally on node2), but they are not online nor can I get to the console and one of them doesn't come up at all (TestVM, 100):

The VMs are configured to use the HA group:

In this test I powered off node 2 if it wasn't apparent.

sclementi · Tuesday at 23:34

As a test, I failed node 1 this time... and the VMs all came online in under 5 minutes. The only reason I tested this was because I saw that the lrm on node 2 was idle, but was active on 1 and 3. The failure, I guess, forced lrm to go active on node 2 as it is active now.

After bringing node 1 back online, the lrm states are all active.

Could that have something to do with it?

waltar · Tuesday at 23:58

Will test that tomorrow as we have a died disk in one pve node, powercut of one without maintenance mode in 5node 8.3.3 cluster because new disk is waiting on desk

waltar · 2025-01-29T12:43:57+0100

Manual "powercut" on node with failed disk, after around 2min the 9vm+1lxc auto-started on other node as defined for ha prefered host group.
After bring back the pseudo failed node the 10 machines auto-migrated back as ha was defined, so all works fine and as desired.
@sclementi : unhappily there's somethink wrong in your cluster / ha configuration.

sclementi · 2025-01-29T20:52:22+0100

@waltar : Thanks for confirming... it is why I posted here, looking for assistance to get it to work properly.

waltar · 2025-01-29T21:42:56+0100

We have Shutdown Policy default/conditional set so try that instead of your choice of migrate or failover.

Search

Search

Automatic/Unattended Failover HA

sclementi

New Member

maxim.webster

Active Member

sclementi

New Member

alexskysilk

Distinguished Member

sclementi

New Member

alexskysilk

Distinguished Member

sclementi

New Member

sclementi

New Member

waltar

Renowned Member

waltar

Renowned Member

sclementi

New Member

waltar

Renowned Member

We value your privacy