Fencing

Aug 17, 2020
8
0
6
Hello,

I have installed a 2-node cluster with a corosync external voter. If I have HA enabled, do I need to configure anything related to fencing? I'm currently using softdog watchdog, as it's the default. Is that sufficient or should i try to switch to hardware based watchdog?

Thank you,
Manuel
 
The softdog is usually okay, but it depends on how much reliability you need. A hardware watchdog will always be more reliable since it is completely independant of the software, but that only makes a difference in rare cases where the software (including the kernel) is bugged or broken beyond repair.

Read our documentation on fencing for more info.
 
What's the best way to simulate a "hardware failure" in terms of fencing? Killing pve-ha-lrm on one node?

Does the fencing mechanism also take some "system state" into account, or is it a plain dumb mechanism? In other words, is it possible that a node is fenced while the node and core services are still (more the less) responsive and working, because some system parameters are abnormal, or some processes do not react to certain signals or commands in a proper way?

For example, with the HA Simulator, the node is fenced when I turn off network. Is that also true for latest version of Proxmox using self-fencing?
 
Last edited:
For example, with the HA Simulator, the node is fenced when I turn off network. Is that also true for latest version of Proxmox using self-fencing?
Yes, the mechanism used to decide if fencing is necessary is Corosync. Once a node loses connection to the quorum part of the cluster and has (or had since the last reboot) HA enabled guests, it will fence itself if a connection with the quorum cannot be established for 2 minutes. Other nodes will start the HA guests that used to be present on that node after 3 minutes after the node has last been seen.
 
Are there any other situations than lost quorum / lost network connectivity when a node is considered "defunct" and fenced?
No, corosync is the only metric used to decide if a node is still part of the cluster.

It is therefore important to have consistent low latency on the physical network in which corosync is running. It is possible that a high load by other services increases the latency of corosync to the point where it loses the connection. Likely candidates are backup or storage services running on the same physical network as corosync.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!