Proxmox cluster reboots on network loss?

BunkerHosted · Aug 10, 2020

Our team has experienced this twice now, when switches are rebooted during a network upgrade it causes all nodes in the cluster to reboot. I am unable to find information regarding this.

Note that Watchdog is turned off.

aaron · Aug 10, 2020

This sounds like you have/had HA enabled guests on the nodes at some point. If a node with HA enabled guests (present or since the last reboot) loses the connection to the quorum part of the cluster for more than 2 minutes, it will fence (hard reset) itself.

You can check it by going to the Datacenter->HA panel. Each node for which the `lrm` service is shown as `active` will fence itself if it loses connection to the cluster.

dlasher · Oct 6, 2022

Sorry to necro this thread, but it's one of *many* that come up with this title, and it's directly to the core issue.

Proxmox needs to have an configurable option for behavior on Fencing. Rebooting an entire cluster upon the loss of a networking element is the sledgehammer, and we need the scalpel.

qwertbert · Jul 18, 2023

Is there a way to turn this off?
It f*ed up our entire cluster yesterday when we were making updates to our network infrastructure.

aaron · Jul 18, 2023

If you plan to do some changes on the network that could cause problems for Corosync, you can temporarily disable the HA stack (which handles the fencing). First stop the LRM on all nodes

Code:

systemctl stop pve-ha-lrm.service

then the CRM:

Code:

systemctl stop pve-ha-crm.service

This way, the HA stack is completely disabled.

Once you are done, start them in the same order, first the LRM on all nodes, then the CRM.

Alternatively, you can set all HA resources to ignored. After about 10 minutes, all nodes should show "idle" in the HA overview. This way they also won't fence if they lose the connection to the cluster.
To automate that, you could run a oneliner like this (quick and dirty, could probably be improved):

Code:

for service in $(ha-manager status | grep service | awk '{print $2}'); do ha-manager set $service --state ignored; done

dlasher · Jul 19, 2023

@aaron : Thank you for providing a specific work around. However, this feels like one of those "old" ideas in desperate need of an update.

I challenge you : When is rebooting the entire cluster on the loss of a network element the preferred behavior? You're taking a communication issue and setting off a nuke. This might have been an acceptable way to handle watchdog timeouts in the past, but that's no longer the desired behavior, especially in what's probably your largest install base of customers.

HA on proxmox must mature, or people will just avoid using it, and build diversity instead of redundancy. (potato/potato says the architect)

fabian · Jul 19, 2023

fencing is a necessary pre-requisite for HA recovery - the recovering node (or rather, the quorate partition of the cluster) needs to be sure that the failed node (non-quorate partition) is gone, else you are in a split brain situation and will corrrupt your data/guests. it's not possible for a node in the non-quorate partition to tell if there is a quorate partition (fencing required) or not (no fencing required), since if it could communicate with all other nodes, it would not be non-quorate.

TL;DR: no fencing, no HA. it's as simple as that.

if you want clustering without fencing, you can do that (just don't enable HA features). if you want to automate some sort of failover without our HA stack, you can do that. but you get to keep the pieces as well when (not if) something breaks as a result.

NiceRath · Oct 4, 2023

Greetings!

We came across the same 'issue'.

I totally understand why the node reboots - BUT: why doesn't it shutdown the VMs gracefully first?
We now have an inconsistent database inside one of the guest VMs that needs to be restored - just because a switch was rebootet.

fabian · Oct 4, 2023

NiceRath said:
Greetings!

We came across the same 'issue'.

I totally understand why the node reboots - BUT: why doesn't it shutdown the VMs gracefully first?
We now have an inconsistent database inside one of the guest VMs that needs to be restored - just because a switch was rebootet.

because it needs to go off as fast as possible to keep the time period other nodes have to wait until they can take over small. waiting for (all) guests to shut down cleanly is not an option, both since there is no (failsafe) way to signal "I am done shutting down, you can take over", and because we actually want to be as fast as possible. your HA VMs should always be configured in a way to be crash-consistent.

shanreich · Oct 4, 2023

NiceRath said:
We now have an inconsistent database inside one of the guest VMs that needs to be restored

Particularly with databases (e.g. PostgresSQL [1]) it is often possible to do failover / recovery on an application level. Maybe that could be an option for you?

[1] https://www.postgresql.org/docs/current/high-availability.html

NiceRath · Oct 4, 2023

Greetings.

Thank you for your response.

So I understand the behavior correctly:
* Is this only happening if one configures any HA (WebUI: Datacenter - HA) that includes the affected pve-host?
* Or is this the default behavior for any nodes inside a 'Datacenter'? (even if they have no HA-resources configured)

Failover on application-level is out-of-scope for our use-case.

shanreich · Oct 4, 2023

NiceRath said:
* Is this only happening if one configures any HA (WebUI: Datacenter - HA) that includes the affected pve-host?

Yes, this only happens on hosts that have HA-resources running.

NiceRath · Oct 4, 2023

shanreich said:
Yes, this only happens on hosts that have HA-resources running.

Ok, thank you.

Then we'll just make sure no VMs that cannot be 'crash-consistent' are placed on HA-enabled nodes.

One more question:
Wouldn't it make sense the fenced node at least performs a graceful shutdown of VMs that are not configured as 'HA resource'? (so they only exist on this node and are not relevant to HA-operations)
Of course after hard-killing the HA-enabled VMs first - so the shutdown delay would not have a (major) negative impact.
If this information is even available/cached on the pve-node..

Context: In our case the VM serving the database was not configured as HA-resource.

Christoph Lechleitner · Nov 11, 2024

We have run in the same problem a few days ago when the DC operator performed some network maintainence, and research led us here.

I can see the argument for a hard reset, but we'd appreciate some config option that changes the behaviour to make PVE at least attempt clean shutdowns of guests, maybe with a configurable timeout. Even in split brain situations manual recovery (and data merge) is way easier if databases and file systems shut down cleanly.

Anyway, after some tests we found out that the mitigation variants described above for planned network interruptions should be merged, and since we're on a community sub we probably should share our conclusion and notes.

Our approach now is to remember the wanted HA service states, set them to ignored, stop all HA services, and do all that in reverse after network interruption. Like this:

Bash:

# on one node (or more if you like):
# remember wanted HA service stati, might not be 'started' for some

ha-manager status |grep service | awk '{print $2 " " $4}' |tr -d ')' |tee /var/tmp/ha-service-stati.txt

for service in $(ha-manager status | grep service | awk '{print $2}'); do ha-manager set $service --state ignored; done


# on all nodes stop HA services:

# wait for state switch to 'ignored' is distribured:
ha-manager status

# not before all are 'ignored', stop HA services, feel free to add sleeps inbetween:
systemctl stop pve-ha-lrm.service
systemctl stop pve-ha-crm.service
systemctl stop watchdog-mux

# optionally check services are stopped
systemctl status watchdog-mux pve-ha-crm pve-ha-lrm

# optionally check if watchdog module is unused now
lsmod |grep dog


# perform network maintenance


# on all nodes: Start HA services, better with a moment inbetween:
systemctl start watchdog-mux
sleep 3
systemctl start pve-ha-crm.service
sleep 3
systemctl start pve-ha-lrm.service

# optionally check that services are up again, if not retry starting them:
systemctl status watchdog-mux pve-ha-crm pve-ha-lrm

ha-manager status


# on one node you rememberd the original wanted HA service stati:
cat /var/tmp/ha-service-stati.txt |while read HA_SERVICE_STATE ; do
  HA_SERVICE=$(echo "$HA_SERVICE_STATE" |cut -f1 -d ' ')
  HA_STATE=$(echo "$HA_SERVICE_STATE" |cut -f2 -d ' ')
  echo "Calling   ha-manager set ${HA_SERVICE} --state ${HA_STATE}"
  ha-manager set ${HA_SERVICE} --state ${HA_STATE}
done

Hope this helps someone sooner or later ...

Search

Search

Proxmox cluster reboots on network loss?

BunkerHosted

New Member

aaron

Proxmox Staff Member

dlasher

Renowned Member

qwertbert

Member

aaron

Proxmox Staff Member

dlasher

Renowned Member

fabian

Proxmox Staff Member

NiceRath

Member

fabian

Proxmox Staff Member

shanreich

Proxmox Staff Member

NiceRath

Member

shanreich

Proxmox Staff Member

NiceRath

Member

Christoph Lechleitner

Renowned Member

We value your privacy