Automatic/Unattended Failover HA

sclementi

New Member
Jan 28, 2025
8
0
1
New to Proxmox, but not to virtualization. I installed a three node cluster last week, configured Ceph for shared storage, created some VMs and everything seems to work normally. I can migrate VMs from host to host no issue, I can put hosts into maintenance mode and VMs move automatically.

Issue: If I lose power on a node, it takes about 2 minutes for the VMs to automatically reregister on other hosts on the cluster and their OSes never load up. 3 minutes later and the VMs are still offline so I turn the failed host back on and it comes back online and 5 minutes later the VMs finally start up on the host they moved to and are back online. Now I can move them around with no issue.

What am I doing wrong... and what additional info can I give to help? Or is there simply no disaster failover (can't see how that could be)
 
Configured a HA Group with the three hosts. Configured the VMs to use the group. As stated, failover works when a node is placed in maintenance mode according to group settings, the VMs migrate off.
 
Removing power from the server so it goes offline.

Shutdown Policy is Migrate. I also tried Failover.

as an update to my original post... the VMs do seem to try and start up, but they never get to a point where they load an OS or even display a console for that matter.
 
Last edited:
Two of the VMs show as "running" (green icon,103 and 105 originally on node2), but they are not online nor can I get to the console and one of them doesn't come up at all (TestVM, 100):
1738102450359.png
The VMs are configured to use the HA group:
1738102618466.png

In this test I powered off node 2 if it wasn't apparent.
 
As a test, I failed node 1 this time... and the VMs all came online in under 5 minutes. The only reason I tested this was because I saw that the lrm on node 2 was idle, but was active on 1 and 3. The failure, I guess, forced lrm to go active on node 2 as it is active now.

After bringing node 1 back online, the lrm states are all active.

Could that have something to do with it?
 
Will test that tomorrow as we have a died disk in one pve node, powercut of one without maintenance mode in 5node 8.3.3 cluster because new disk is waiting on desk :)
 
Manual "powercut" on node with failed disk, after around 2min the 9vm+1lxc auto-started on other node as defined for ha prefered host group.
After bring back the pseudo failed node the 10 machines auto-migrated back as ha was defined, so all works fine and as desired.
@sclementi : unhappily there's somethink wrong in your cluster / ha configuration.
 
Last edited:
We have Shutdown Policy default/conditional set so try that instead of your choice of migrate or failover.
 
Sorry for the delay. The set up is simple. Three nodes, a single IP address for management, storage, etc.

I am not in a position to post actual files, but I can give you a rundown. All three nodes are exactly the same, triple checked:

Hosts file:
line 1: 127.0.0.1 localhost.localdomain localhost
line2: xx.xx.xx.101 <proper fqdn> <host name>

remaining is the default IPv6 info

Interfaces file: (just the important parts, not all offline adaptors and this was transcribed so any typos might just be typos)
auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto enp4s0
iface enp4s0 inet manual

auto enp129s0
iface enp129s0 inet manual

auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode balance-rr

auto bond1
iface bond1 inet manual
bond-slaves enp4s0 enp129s0
bond-miimon 100
bond-mode balance-tlb

auto vmbr0
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 32

auto vmbr0.32
iface vmbr0.32 inet static
address xx.xx.32.181/24
gateway xx.xx.32.1

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 13,132

auto vmbr1.13
iface vmbr1.13 inet manual

auto vmbr1.132
iface vmbr1.132 inet manual

source /etc/network/interfaces.d/*

And lastly, Corosync.conf: (again, hand transcribed so typos might not be typos)
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: xxxx-tmppmox01
nodeid: 1
quorum_votes: 1
ring0_addr: xx.xx.32.181
}
node {
name: xxxx-tmppmox02
nodeid: 2
quorum_votes: 1
ring0_addr: xx.xx.32.182
}
node {
name: xxxx-tmppmox03
nodeid: 3
quorum_votes: 1
ring0_addr: xx.xx.32.183
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: xxxx-tmppmoxc1
config_version: 3
interface {
linknumber: 0
}
ip_versionL ipv4-6
link_mode: passive
secauth: on
version: 2
}

______________________________________________________________________________________________

After looking at the cluster logs, it was apparent that there were some issues with the shared storage and with updates. I switched the subscription repository to no_sub and updated all of the nodes after rebooting them. Rebooted again after updating and things seem "normal", but I have not tested anything yet. Wanted to post this first before anything else.

I still see this in the ceph screen and it is concerning, but don't think it is an issue for right now:
! mon.xxxx-tmppmox01 has 26% avail mon.xxxx-tmppmox02 has 6% avail mon.xxxx-tmppmox03 has 6% avail.

What is this in reference to?
I know it is not space since I have 14TB and only a handful of 50Gb VMs.
 
dont use vmbr subinterfaces. thats not how it works.

instead of

Code:
auto vmbr0.32
iface vmbr0.32 inet static
address xx.xx.32.181/24
gateway xx.xx.32.1

use
Code:
auto bond0.32
iface bond.32 inet manual

auto vmbr1
iface vmbr1 inet static
address xx.xx.32.181/24
gateway xx.xx.32.1
[/code]

next, your corosync interface(s) dont need to be bonded or bridged. you're better off using eno1 and eno2 for corosync ring0 and ring1

Also, there is no need to make so many bridges unless you intend to attach virtual machines to them; and if you are, its simpler to make a one bridge and allow vlan use, like so:

Code:
auto vmbr1
iface vmbr1 inet manual
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

I expect once you fix your networking your problems will go away.
 
  • Like
Reactions: waltar
Running through my testing now without making any of the changes you suggest and things are working as expected. I believe the issues had more to do with a service that was hung up prior to testing that was fixed after the reboot or the updates/reboot.

This is just a POC/Temp solution until I can reconfigure the production equipment. I'll have more flexibility in the networking there when the time comes and can simplify things. With regards to the bridges though... I am not sure I understand. I have vmbr0 which is using bond0 which is backed by eno1 and eno2, two 1GbE interfaces tagged for multiple vlans, but in reality, I only need access to 32 which is non native. vmbr1 is my main traffic bridge for VM networking using bond1 and backed by two 10GbE interfaces. Ideally, I would want my storage traffic to use those as well and I realize now that I never configured a separate storage network so it is using the 1GbE interfaces. The 10GbE interfaces don't have access to the mgmt vlan (32) so I can't use vmbr1 for the mgmt interface. I could probably change this and just use the 10GbE nics only, but again, POC/Temp/Swing environment to offload prod and rebuild prod with Proxmox.

None-the-less, thank you for the assistance, thus-far.