Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

fabian · Oct 7, 2022

dlasher said:
Not if PMX thinks we need to reboot. So far, none of the failures have taken down CEPH, it's pmx/HA that gets offended. (Ironic, because corosync/totem has (4) rings, and CEPH sits on a single vlan, but I digress)

The concept of "reboot makes everything better" is fine if the issue is a software failure, but is the absolute wrong action if there's a hardware failure. You'll reboot, and be in the exact same state. Is there a "fence in place" option - that shuts down all the VM's, throws errors (syslog/email/etc) but doesn't reboot the underlying OS? That would be FAR preferred to troubleshoot whatever issue occurred, while preserving the integrity of the rest of the cluster. And if the failure is temporal, that would be a much better recovery strategy.

I'd like to be able to choose the fence behavior..Something other than "reboot". Actually to be more clear, I *never* want it to reboot upon corosync failure. How do I make that happen?

fencing is not configurable at the moment - it's done via a watchdog that is armed as soon as HA is active, if it expires the node is fenced. if the "source" of fencing was a hardware event (like a broken network cable) the node will not be able to rejoin the cluster upon rebooting, so you are not in the same state at all - no guests/HA resources running (and no option to start them either), PVE read-only.

fencing has a single purpose - ensure the node and any leftover guests running there are killed for sure so that we can recover the guests on another node. it comes with the downside that an unstable cluster (for whatever reason - misconfiguration, hardware issues, software bugs) can take down the entire cluster if every node loses quorum - so if you don't need the automatic recovery/failover part, you are probably better off with non-HA clustering and manual recovery in case of an outage (you can basically do the same steps our HA stack takes, but you need to ensure you don't violate any invariants while doing so).

Search

Search

Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

fabian

Proxmox Staff Member