HA does not trigger when node loses LAN but remains in quorum

zenoprax

New Member
Dec 24, 2024
18
2
3
Hello all!

The setup: I’m running a small three-node Proxmox cluster with Ceph. Each node has a single NVMe OSD; Ceph itself is stable and behaving correctly.
Cluster and corosync traffic runs over a dedicated 2.5 GbE network (directly connected between nodes as a mesh - no switch). VM/LXC traffic uses vmbr0, backed by a bond consisting of the onboard 1 GbE NIC plus a USB Ethernet adapter on each node. Unfortunately due to a kernel bug I cannot yet rely on the motherboard NIC and I actually use the USB NIC as the primary link with 'eth0' as the secondary in an active-backup bond (bond0).


The issue is failure detection and HA behavior when the LAN-facing interface fails.
If 'bond0' (the LAN side backing vmbr0) goes down, the node remains fully reachable over the cluster network, corosync stays healthy, and the cluster remains quorate. From Proxmox’s perspective this is correct, but the practical result is that VMs/LXCs lose external connectivity while HA never triggers a failover. I cannot manually initiate migrations remotely because management access depends on the same failed LAN path. The only solution is to do pull the power cord and wait for the HA churn as things are finally moved around.

Adding USB NICs has reduced the likelihood of failure but doesn’t address the underlying design problem: loss of LAN connectivity is not treated as a node failure condition. This is something I didn’t account for in the original plan and I have since learned how it works through trial-and-error! The approaches I’ve considered so far are:
  1. A local health check that validates LAN reachability and, on failure, forces the node into maintenance mode or reboots it to trigger HA.
  2. Routing LAN traffic over the existing cluster/mesh network as a fallback (e.g., node-to-node forwarding into another node’s vmbr0).
  3. Collapsing cluster/corosync and LAN traffic into a single, highly redundant bond (2×2.5 GbE + onboard 1 GbE + USB), accepting shared fate.
Before I go too far down any of these paths, I’d like to sanity-check the overall approach. Is there a recommended way in Proxmox to handle “node is alive but LAN is unreachable” so that HA can respond appropriately? I'm not opposed to a more complicated "overlay" solution but I don't know enough to trust or verify anything that the LLM chatbots are telling me. For example:
> "Run a routing daemon (FRR) on each node [...] and advertise a default route “upstream available” only from nodes whose bond is up and has gateway reachability."


Any guidance or pointers would be appreciated.


Thanks,
Corey
 
From Proxmox’s perspective this is correct, but the practical result is that VMs/LXCs lose external connectivity while HA never triggers a failover.
Yes, you're right. It works as intended.

The same is true for a vanished storage. Losing all virtual disks is surprisingly not considered a problem.

If you want this problem area to be more visible to the developers you could open a Feature Request using the official bug tracker: https://bugzilla.proxmox.com/enter_bug.cgi?product=pve

Please cross-link from there to here and from here to there.
 
Did you ever open up a feature request for this? I've been testing failure scenarios on a Proxmox cluster backed by FC storage and hit a similar bug - if all paths go offline for a LVM volume backed by a LUN from a SAN array on a single host the VMs running on that particular host are not failed over to other nodes).
 
Last edited:
No, unfortunately I don't have time to work on anything associated with this problem right now. In addition to your scenario, there is the rather simple problem of simply not having a connection to the management interface. It would be nice to configure much more complex failure conditions for HA.
 
> it is very hard to determine a good set of robust rules enabling this to be automatic ...

I feel like they are missing the point of your proposal which was essentially "add the ability to add some additional conditions for failure". Deploying an ad-hoc system to determine if a node is up or not seems like a great way to add unpredictable behaviour to a system that is already under PVE's control.

https://pve.proxmox.com/wiki/High_Availability#_how_it_works
> Virtualization environments like Proxmox VE make it much easier to reach high availability because they remove the “hardware” dependency. They also support the setup and use of redundant storage and network devices, so if one host fails, you can simply start those services on another host within your cluster.
>
> Better still, Proxmox VE provides a software stack called ha-manager,which can do that automatically for you. It is able to automatically detect errors and do automatic failover.
>
> Proxmox VE ha-manager works like an “automated” administrator. First, you configure what resources (VMs, containers, …) it should manage. Then, ha-manager observes the correct functionality, and handles service failover to another node in case of errors. ha-manager can also handle normal user requests which may start, stop, relocate and migrate a service.

Reading this on its own gives a much different impression of its capabilities compared to my current assessment of: "HA Manager determines a node's health by proxy: if a given node is quorate then it is deemed to be available."

Thanks for submitting the idea anyway! I'll probably create a service to reboot if it can't ping the router (but only start it if there is network connectivity N minutes after boot, and the NIC has been restarted, and so on).