Is nodes inside a HA Group the only ones to fences ?

DC-CA1 · Feb 8, 2023

hi let say we have a 16 node cluster

and we have a HA group for a Few VM on 3 Nodes.

1 of those node Fail , and similar to what we faced recently , another crash for no reason .

will the fencing from HA mecanism will only try to fence the Nodes part of this specific Group , or HA might try to Fence all the subsequents nodes part of the 16 node cluster?

We Like HA , but we faced a situation we cant explain yet here :
https://forum.proxmox.com/threads/3...n-1000-email-in-5-minutes.122354/#post-531832

and we dont want to do HA anywhere until we figure out this question .

DC-CA1 · Feb 10, 2023

anyone ?

t.lamprecht · Feb 10, 2023

Hi,

DC-CA1 said:
will the fencing from HA mecanism will only try to fence the Nodes part of this specific Group , or HA might try to Fence all the subsequents nodes part of the 16 node cluster?

Nodes fence them self through activating a watchdog, which resets the node if it doesn't get frequently reset itself.
Which the HA stack either actively stops if cluster quorum is down for 60s+ or indirectly if the node hangs up/freezes.

The self-fencing watchdog is active on a node in the following cases:

The Cluster Resource Manager service is the active master of the cluster
The LRM hosts active services now, or did so in the past ~ 10 minutes, active means any HA service that is not in the (request) state of ignored, error, or stopped

Note, that in the case of a complete freeze, the watchdog might trigger in anyway - but then the nodes and the hosted guests normally wouldn't have been responsive anyway.

So, if you configure HA groups with a subset of nodes, and all services are allocated to such groups, then the LRM's of the other nodes won't activate the watchdog.
The CRM master lock acquisition isn't considering any service node membership, it's first come first server (we could actually make this a bit more intelligently, so that a node without HA services tries to release the manager lock to another standby master).
So, currently it might happen that one extra node, that doesn't host any HA services due to group membership restraints, is still having the watchdog active for fencing, but not more than that.

Checking out logs and you should be able to see what nodes got active watchdogs. E.g., the following journalctl filter line should give you the logs from the most relevant services, first three are directly HA related and the last two are the main ones from the cluster (quorum) stack:

Bash:

journalctl -u pve-ha-lrm -u pve-ha-crm -u watchdog-mux -u corosync -u pve-cluster

Might want to throw in a --since=-1week or -1month or so for limiting the time frame and making querying the logs faster. If you need help analyzing what happened I'd recommend our enterprise support offerings, if you got elligible support for that cluster (level basic or more) you might want to directly head to the enterprise support portal.

DC-CA1 · Feb 10, 2023

Hi thx for the excellent explanation.

Is there a way to define a preferred master ?

So i can select nodes not part of a HA as prefered master or it willl not improve any potential failure

Search

Search

Is nodes inside a HA Group the only ones to fences ?

DC-CA1

Member

DC-CA1

Member

t.lamprecht

Proxmox Staff Member

DC-CA1

Member