Restarting one cluster node triggers HA fencing on others

FireStormOOO

New Member
Dec 15, 2023
8
0
1
I have a 3 node PVE 8 + Ceph cluster, community update server. I've previously seen some instability before with restarts and network disruption, but my last batch of updates crashed every other node in the cluster, for each server I commanded a reboot on. As yet I've failed to pull anything helpful out of my logs, though I'm unclear if that's b/c I need to configure additional logging to catch this. Servers mostly just crash and reboot, though I've also seen them hang fully unresponsive with no video output. Minimal customization, beyond enabling root ZFS encryption and securing ceph and migration traffic with IPSec.

I have a 4th non-clustered server with an otherwise very similar config which has been perfectly stable. The crashing seems to be exacerbated by network issues, I've previously seen crashes when a network fault causes links to flap. Crashing got much worse after I set up link aggregations for all 3 nodes, though notably the stable server is connected to the same switches with the same config. All 3 of the problem servers have ConnectX 4 NICs, the stable one is a ConnectX 3 NIC; various AMD Ryzen CPUs in a spattering of consumer ASUS boards, limited hardware commonality aside from that. I've also reproduced the crash after pulling network cables - this is a little less consistent and doesn't typically crash all nodes. Crash happens maybe 30-60 seconds after the network disruption or initiating the reboot on another node.

It's possible my choice to re-use the cluster-managed certificates for IPSec could be causing this? The cluster filesystem mount in /etc/pve/ becomes at least partially unavailable when quorum is lost IIRC, and I don't see a more stable location to reference the root certificate from than /etc/pve/pve-root-ca.pem.
 
Last edited:
Hello FireStormOOO! Could you please:
  1. Post the output of pveversion -v
  2. Post the journal around the time of the crash, e.g. a few hours before and some time after that (journalctl --since <TIME> --until <TIME>), ideally from all nodes in the cluster.
  3. Also, please tell us a bit more about the servers in the cluster and the network configuration for Ceph.
Maybe you are already aware, but there's a chapter on recommendations for a Healthy Ceph Cluster, including a section on network recommendations.
 
Code:
pveversion -v (after updates, all 3 nodes):
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0

ksmtuned: 4.20150326+b1 (Node 2 only)
zfsutils-linux: 2.2.7-pve2 (Nodes 2,3)
zfsutils-linux: 2.3.1-1~bpo12+1 (Node 4)

Network on all 3 cluster nodes is a dual port 25GbE configured as an LACP bond0. bond0 is attatched to a VLAN aware bridge. I've allocated a VLAN and subnet for all recommended networks for both Proxmox and Ceph. Migration traffic and Ceph data are together on a VLAN sub-interface which has been configured with strongswan to protect the traffic with IPSec using the cluster's self-signed certificates/root CA. All nodes also reachable from their onboard GbE NICs as a sort of recovery network, neither Proxmox or Ceph are configured to use this. I've set some bandwidth limits on things like migrations, but not bothered configuring QoS since I've got 25-50Gb/s between nodes.

The network switches are more than a little temperamental, and changes to one bond sometimes set off a fit of link flapping affecting all nodes; in addition to the obvious changes then, restarting or disconnecting a node may briefly cause the cluster to loose quarum while all the links drop and get renegotiated. Seems to be a known issue with these switch chips and MCLAG. At some point I'll likely move to a VXLAN overlay instead since layer 3 features seem better supported on these switches (SONiC on some older broadcom based white-box switches). I'd consider this a minor issue if it wasn't crashing my nodes somehow. I stopped having issues during normal operation since I configured everything to use LACP slow - seems to be some kind of race condition when the control plane is under load. Possibly relevant, the switch control plane is also Debian 12.

Code:
Excerpt from the hosts file with the cluster nodes:

172.19.7.120 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2
172.19.7.120 fss-srv-vmhost2.lan.firestorm.space fss-srv-vmhost2
172.19.7.120 pbs.lan.firestorm.space backup.firestorm.space
172.19.5.1 fire-srv-vmhost2.sana.firestorm.space
172.19.5.1 fss-srv-vmhost2.sana.firestorm.space
#172.20.10.2 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2
#172.20.10.34 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2

172.19.7.124 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
#172.20.10.3 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
#172.20.10.35 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
172.19.5.3 fss-srv-vmhost3.sana.firestorm.space

172.19.7.125 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4
172.19.5.4 fss-srv-vmhost4.sana.firestorm.space
#172.20.10.4 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4
#172.20.10.36 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4

I seem to be missing some logs from around the time of the crashes, I don't think they're successfully getting flushed to disk when it crashes? In any case I'll attach what I've got.
Code:
journalctl --since "2025-05-18 22:00" --until "2025-05-19 02:00" | grep -v ledmon | grep -v postfix
All updates were done by 01:00.
I deliberately tested some network partitions after that which caused some further crashes.
ledmon excluded b/c chatty and irrelevant; postfix has logged PII and doesn't seem relevant.
vmhost4 appears to have logged nothing since 22:11, before the update (it was the first node updated) despite being up and seemingly running normally for the last 8 hours. Bounced the service and now it's logging again.
Code:
-- Boot c8c92a37b46f4d579a7d9d1167305400 --
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: systemd-journald.service: Deactivated successfully.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Stopped systemd-journald.service - Journal Service.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Starting systemd-journald.service - Journal Service...
May 19 10:51:24 FSS-SRV-VMHOST4 systemd-journald[341839]: Journal started
May 19 10:51:24 FSS-SRV-VMHOST4 systemd-journald[341839]: System Journal (/var/log/journal/70af026eeee445c0ad713a25dadaf54d) is 27.2M, max 4.0G, 3.9G free.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Started systemd-journald.service - Journal Service.
May 19 10:52:22 FSS-SRV-VMHOST4 cat[342420]: test log message

Let me know anything else you'd like described at more length. I omitted an exhaustive description of subnets and VLANs as it's currently all running through the same bond.

ETA: IPSec configs, vmhost2 network page screenshot.
 

Attachments

Last edited:
Hello and sorry for the delay. The logs show that the node lost quorum and restarted itself. Also, there are many Ceph heartbeat_check errors around the same time.

Note that while it is possible to achieve quorum with only 3 nodes, if one node is offline (2 left), no other node is allowed to go offline, otherwise they will lose quorum (at least in the default configuration). I'm not saying it's an issue, I'm just saying that you might want to have an additional node in the cluster for such cases.

Corosync is very sensitive to latency, and the behavior we see in the logs is exactly what is explained in the documentation chapter I linked to above (see Network section):
The volume of traffic, especially during recovery, will interfere with other services on the same network, especially the latency sensitive Proxmox VE corosync cluster stack can be affected, resulting in possible loss of cluster quorum. Moving the Ceph traffic to dedicated and physical separated networks will avoid such interference, not only for corosync, but also for the networking services provided by any virtual guests.

My guess at this point is that disconnecting one node caused additional network traffic, which made corosync unstable, made it lose quorum, and restarted the node.
 
  • Like
Reactions: Johannes S
I'm fairly sure the temporary loss of quorum isn't your bug, it's seemingly a known issue with my switches but is also consistently resolving itself in under a minute. HA is a nice to have for me here, if the cluster sorts itself out in under 5 minutes it's still far better than nothing.

I'm confused how we get from loss of quorum to the node crashing. I looked at the logs again and I'm not seeing any relevant logs from the nodes that crashed. I see the restarts I initiated for updates on each nodes (~22:09 on vmhost4, ~22:45 on vmhost3, ~00:13 vmhost2), but I don't see any corresponding logs from the other hosts that outright crashed at the same time.

On the nodes that crash it seems like the journal isn't getting written to disk because it's failing to shut down cleanly. I imagine I'm going to have to do something to get logs from the nodes that are crashing, especially of that 30 second window between the first node being partitioned or restarted and the other two crashing. I'm just a little bit at a loss of how to set that up, I've never seen anything quite like this. As I understand this would have to be a kernel panic, I'd have logs if this was related to a userspace service.

Unfortunately I do have a fairly reliable reproduction on this, so I can trigger the issue again after configuring whatever logging you think might help.
 
I'm confused how we get from loss of quorum to the node crashing.
This behavior is explained by the Fencing chapter of the Proxmox VE documentation:
During normal operation, ha-manager regularly resets the watchdog timer to prevent it from elapsing. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and trigger a reset of the whole server (reboot).
This is also confirmed by the logs from vmhost2:
May 18 22:12:17 fire-srv-vmhost2 watchdog-mux[2585]: client watchdog expired - disable watchdog updates
A few seconds after this, the server rebooted. In other words, it worked as intended.
 
This is also confirmed by the logs from vmhost2:
May 18 22:12:17 fire-srv-vmhost2 watchdog-mux[2585]: client watchdog expired - disable watchdog updates
A few seconds after this, the server rebooted. In other words, it worked as intended.
Got it, now I'm following. The documentation doesn't mention what the key time thresholds are for fencing, which was I think part of my confusion. I suspect if I can double the allowed recovery time on HA manager this problem goes away. Network recovery is taking about as long as the watchdog timer right now which would explain why it's only sometimes restarting. I suppose I could also allow corosync traffic on that fully separate network as the lowest priority.

Not spelled out in the documentation, but I assume if a node doesn't have any HA resources present it's also protected from being fenced?
And just to verify, Ceph has no bearing on the fencing or HA recovery logic?

ETA: One of the other reasons I had assumed I was looking at a crash was that the fencing restart is breaking SSH sessions, whereas a normal restart closes them gracefully. Is that also expected?

ETA2: I did end up implementing the additional cluster interface over the gigabit management/recovery interface and that does seem to have stabilized things significantly. Thanks for the help!
 
Last edited: