Is nodes inside a HA Group the only ones to fences ?

Jan 16, 2022
194
8
23
37
hi let say we have a 16 node cluster

and we have a HA group for a Few VM on 3 Nodes.

1 of those node Fail , and similar to what we faced recently , another crash for no reason .

will the fencing from HA mecanism will only try to fence the Nodes part of this specific Group , or HA might try to Fence all the subsequents nodes part of the 16 node cluster?

We Like HA , but we faced a situation we cant explain yet here :
https://forum.proxmox.com/threads/3...n-1000-email-in-5-minutes.122354/#post-531832

and we dont want to do HA anywhere until we figure out this question .
 
Hi,
will the fencing from HA mecanism will only try to fence the Nodes part of this specific Group , or HA might try to Fence all the subsequents nodes part of the 16 node cluster?
Nodes fence them self through activating a watchdog, which resets the node if it doesn't get frequently reset itself.
Which the HA stack either actively stops if cluster quorum is down for 60s+ or indirectly if the node hangs up/freezes.

The self-fencing watchdog is active on a node in the following cases:
  • The Cluster Resource Manager service is the active master of the cluster
  • The LRM hosts active services now, or did so in the past ~ 10 minutes, active means any HA service that is not in the (request) state of ignored, error, or stopped
Note, that in the case of a complete freeze, the watchdog might trigger in anyway - but then the nodes and the hosted guests normally wouldn't have been responsive anyway.

So, if you configure HA groups with a subset of nodes, and all services are allocated to such groups, then the LRM's of the other nodes won't activate the watchdog.
The CRM master lock acquisition isn't considering any service node membership, it's first come first server (we could actually make this a bit more intelligently, so that a node without HA services tries to release the manager lock to another standby master).
So, currently it might happen that one extra node, that doesn't host any HA services due to group membership restraints, is still having the watchdog active for fencing, but not more than that.

Checking out logs and you should be able to see what nodes got active watchdogs. E.g., the following journalctl filter line should give you the logs from the most relevant services, first three are directly HA related and the last two are the main ones from the cluster (quorum) stack:

Bash:
journalctl -u pve-ha-lrm -u pve-ha-crm -u watchdog-mux -u corosync -u pve-cluster

Might want to throw in a --since=-1week or -1month or so for limiting the time frame and making querying the logs faster. If you need help analyzing what happened I'd recommend our enterprise support offerings, if you got elligible support for that cluster (level basic or more) you might want to directly head to the enterprise support portal.
 
Hi thx for the excellent explanation.

Is there a way to define a preferred master ?

So i can select nodes not part of a HA as prefered master or it willl not improve any potential failure
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!