fenced cluster node

Binary Bandit

Well-Known Member
Dec 13, 2018
60
9
48
53
Hi All,

Today when our three node cluster restarted cold. I received an email from ait3, one of the nodes.
Code:
The node 'ait3' failed and needs manual intervention.

The PVE HA manager tries  to fence it and recover the
configured HA resources to a healthy node if possible.

Current fence status:  FENCE
Try to fence node 'ait3'

All of the VMs on AIT3 we're in fenced state.

HA in the GUI under "datacenter" looked like the one in this thread: https://forum.proxmox.com/threads/master-old-timestamp-dead.26489/ with ait being the node marked "old timestamp dead?".

To resolve this I made sure that the cluster was up, that quorum was present and then I removed all of the VMs from HA. Once this was done AIT3 seemed to "un-fence" itself. I could then add all of the VMs back to the HA list ... as well as start / stop them ... something that I couldn't do before.

I did this after a lot of searching on the forums.

My question is, why did this happen? Also, is there a better way to resolve the issue. I suspect that I could have removed only the VMs marked "fence" from HA?

best,

James
 
Today when our three node cluster restarted cold.

Just to make things clear, was this cold start something you triggered, or a result from a fencing operation due to some sort of network, or thelike, outage?

My question is, why did this happen? Also, is there a better way to resolve the issue. I suspect that I could have removed only the VMs marked "fence" from HA?

Hmm, the VM's get theire "fence" state are currently on a node marked as "fence", i.e., a failed node, but yes if a Node comes up again it does not tries to get it's cluster HA node lock until it has no fenced services anymore, as the current master should be in the process of recovering the failed nodes services.

What was the rest of your HA state during this? Was there an active manager, at this point, I mean you talk about a total cluster cold start, but to mark nodes and thus services to as "to be fenced" a manager _must_ have been there, else nobody would've written that state out and sent you the fencing mail.. This email should have the manager state at time of fencing included, could you please take a look at the syslog of the manager node during the time window of interest and check what HA is doing (or not doing)?
 
Just to make things clear, was this cold start something you triggered, or a result from a fencing operation due to some sort of network, or thelike, outage?
My actions started the events that led to the cold start. Here's the details to be sure that my calling it a cold start is correct.

Cluster node 3 was running on an older kernel and needed to be rebooted to load the newer kernel as part of troubleshooting. Troubleshooting details here. I moved several live VMs off of node3 and to node 2 and manually triggered a soft reboot of node 3. All good so far.

Shortly after this another node hard rebooted on it's own, likely due to the IPMI Watchdog timer running out. I assume that the last remaining node then rebooted as no corosync communication remained to prevent the countdown of its Watchdog timer.

When everything came up node 3 showed "master : old timestamp dead" and the VMs that I moved off of it prior to the soft reboot were all marked as "fence".

Hmm, the VM's get theire "fence" state are currently on a node marked as "fence", i.e., a failed node, but yes if a Node comes up again it does not tries to get it's cluster HA node lock until it has no fenced services anymore, as the current master should be in the process of recovering the failed nodes services.
OK, so that explains it ... node 3 very likely came up first and had no master to recover its services.

What was the rest of your HA state during this? Was there an active manager, at this point, I mean you talk about a total cluster cold start, but to mark nodes and thus services to as "to be fenced" a manager _must_ have been there, else nobody would've written that state out and sent you the fencing mail.. This email should have the manager state at time of fencing included, could you please take a look at the syslog of the manager node during the time window of interest and check what HA is doing (or not doing)?
Given your explanation of how this all works I think that the VMs didn't fully migrate before I rebooted the node. I'm not entirely sure. While the entire cluster was certainly down at one point it didn't all happen at once. More like node 3 reboot ... wait for it ... node 2 reboot ... wait for it ... node 1 reboot ... everything down ... then reboot in the same order maybe 10ish seconds apart as each node works though it's BIOS initialization.

Looking into syslog on the master node is a great idea. My questions are answered though. That is, you explained how the email was sent, why my third node wasn't getting its node lock and that clearing its fenced services would cause it to go after its lock. I'll continue troubleshooting in the original thread. I don't think that it makes sense to troubleshoot in this thread.

Thank you for helping me understand Proxmox further.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!