Proxmox cluster reboots on network loss?

BunkerHosted

New Member
Aug 10, 2020
13
2
3
28
Our team has experienced this twice now, when switches are rebooted during a network upgrade it causes all nodes in the cluster to reboot. I am unable to find information regarding this.

Note that Watchdog is turned off.
 
This sounds like you have/had HA enabled guests on the nodes at some point. If a node with HA enabled guests (present or since the last reboot) loses the connection to the quorum part of the cluster for more than 2 minutes, it will fence (hard reset) itself.

You can check it by going to the Datacenter->HA panel. Each node for which the `lrm` service is shown as `active` will fence itself if it loses connection to the cluster.
 
Sorry to necro this thread, but it's one of *many* that come up with this title, and it's directly to the core issue.

Proxmox needs to have an configurable option for behavior on Fencing. Rebooting an entire cluster upon the loss of a networking element is the sledgehammer, and we need the scalpel.
 
  • Like
Reactions: qwertbert
If you plan to do some changes on the network that could cause problems for Corosync, you can temporarily disable the HA stack (which handles the fencing). First stop the LRM on all nodes
Code:
systemctl stop pve-ha-lrm.service
then the CRM:
Code:
systemctl stop pve-ha-crm.service

This way, the HA stack is completely disabled.

Once you are done, start them in the same order, first the LRM on all nodes, then the CRM.

Alternatively, you can set all HA resources to ignored. After about 10 minutes, all nodes should show "idle" in the HA overview. This way they also won't fence if they lose the connection to the cluster.
To automate that, you could run a oneliner like this (quick and dirty, could probably be improved):
Code:
for service in $(ha-manager status | grep service | awk '{print $2}'); do ha-manager set $service --state ignored; done
 
  • Like
Reactions: Dunuin
@aaron : Thank you for providing a specific work around. However, this feels like one of those "old" ideas in desperate need of an update.

I challenge you : When is rebooting the entire cluster on the loss of a network element the preferred behavior? You're taking a communication issue and setting off a nuke. This might have been an acceptable way to handle watchdog timeouts in the past, but that's no longer the desired behavior, especially in what's probably your largest install base of customers.

HA on proxmox must mature, or people will just avoid using it, and build diversity instead of redundancy. (potato/potato says the architect)
 
Last edited:
fencing is a necessary pre-requisite for HA recovery - the recovering node (or rather, the quorate partition of the cluster) needs to be sure that the failed node (non-quorate partition) is gone, else you are in a split brain situation and will corrrupt your data/guests. it's not possible for a node in the non-quorate partition to tell if there is a quorate partition (fencing required) or not (no fencing required), since if it could communicate with all other nodes, it would not be non-quorate.

TL;DR: no fencing, no HA. it's as simple as that.

if you want clustering without fencing, you can do that (just don't enable HA features). if you want to automate some sort of failover without our HA stack, you can do that. but you get to keep the pieces as well when (not if) something breaks as a result.
 
Greetings!

We came across the same 'issue'.

I totally understand why the node reboots - BUT: why doesn't it shutdown the VMs gracefully first?
We now have an inconsistent database inside one of the guest VMs that needs to be restored - just because a switch was rebootet. :confused:
 
Greetings!

We came across the same 'issue'.

I totally understand why the node reboots - BUT: why doesn't it shutdown the VMs gracefully first?
We now have an inconsistent database inside one of the guest VMs that needs to be restored - just because a switch was rebootet. :confused:
because it needs to go off as fast as possible to keep the time period other nodes have to wait until they can take over small. waiting for (all) guests to shut down cleanly is not an option, both since there is no (failsafe) way to signal "I am done shutting down, you can take over", and because we actually want to be as fast as possible. your HA VMs should always be configured in a way to be crash-consistent.
 
Greetings.

Thank you for your response.

So I understand the behavior correctly:
* Is this only happening if one configures any HA (WebUI: Datacenter - HA) that includes the affected pve-host?
* Or is this the default behavior for any nodes inside a 'Datacenter'? (even if they have no HA-resources configured)

Failover on application-level is out-of-scope for our use-case.
 
Yes, this only happens on hosts that have HA-resources running.
Ok, thank you.

Then we'll just make sure no VMs that cannot be 'crash-consistent' are placed on HA-enabled nodes.

One more question:
Wouldn't it make sense the fenced node at least performs a graceful shutdown of VMs that are not configured as 'HA resource'? (so they only exist on this node and are not relevant to HA-operations)
Of course after hard-killing the HA-enabled VMs first - so the shutdown delay would not have a (major) negative impact.
If this information is even available/cached on the pve-node..

Context: In our case the VM serving the database was not configured as HA-resource.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!