Restarting one cluster node crashes others

FireStormOOO

New Member
Dec 15, 2023
6
0
1
I have a 3 node PVE 8 + Ceph cluster, community update server. I've previously seen some instability before with restarts and network disruption, but my last batch of updates crashed every other node in the cluster, for each server I commanded a reboot on. As yet I've failed to pull anything helpful out of my logs, though I'm unclear if that's b/c I need to configure additional logging to catch this. Servers mostly just crash and reboot, though I've also seen them hang fully unresponsive with no video output. Minimal customization, beyond enabling root ZFS encryption and securing ceph and migration traffic with IPSec.

I have a 4th non-clustered server with an otherwise very similar config which has been perfectly stable. The crashing seems to be exacerbated by network issues, I've previously seen crashes when a network fault causes links to flap. Crashing got much worse after I set up link aggregations for all 3 nodes, though notably the stable server is connected to the same switches with the same config. All 3 of the problem servers have ConnectX 4 NICs, the stable one is a ConnectX 3 NIC; various AMD Ryzen CPUs in a spattering of consumer ASUS boards, limited hardware commonality aside from that. I've also reproduced the crash after pulling network cables - this is a little less consistent and doesn't typically crash all nodes. Crash happens maybe 30-60 seconds after the network disruption or initiating the reboot on another node.

It's possible my choice to re-use the cluster-managed certificates for IPSec could be causing this? The cluster filesystem mount in /etc/pve/ becomes at least partially unavailable when quorum is lost IIRC, and I don't see a more stable location to reference the root certificate from than /etc/pve/pve-root-ca.pem.
 
Last edited:
Hello FireStormOOO! Could you please:
  1. Post the output of pveversion -v
  2. Post the journal around the time of the crash, e.g. a few hours before and some time after that (journalctl --since <TIME> --until <TIME>), ideally from all nodes in the cluster.
  3. Also, please tell us a bit more about the servers in the cluster and the network configuration for Ceph.
Maybe you are already aware, but there's a chapter on recommendations for a Healthy Ceph Cluster, including a section on network recommendations.
 
Code:
pveversion -v (after updates, all 3 nodes):
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
ceph: 18.2.7-pve1
ceph-fuse: 18.2.7-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0

ksmtuned: 4.20150326+b1 (Node 2 only)
zfsutils-linux: 2.2.7-pve2 (Nodes 2,3)
zfsutils-linux: 2.3.1-1~bpo12+1 (Node 4)

Network on all 3 cluster nodes is a dual port 25GbE configured as an LACP bond0. bond0 is attatched to a VLAN aware bridge. I've allocated a VLAN and subnet for all recommended networks for both Proxmox and Ceph. Migration traffic and Ceph data are together on a VLAN sub-interface which has been configured with strongswan to protect the traffic with IPSec using the cluster's self-signed certificates/root CA. All nodes also reachable from their onboard GbE NICs as a sort of recovery network, neither Proxmox or Ceph are configured to use this. I've set some bandwidth limits on things like migrations, but not bothered configuring QoS since I've got 25-50Gb/s between nodes.

The network switches are more than a little temperamental, and changes to one bond sometimes set off a fit of link flapping affecting all nodes; in addition to the obvious changes then, restarting or disconnecting a node may briefly cause the cluster to loose quarum while all the links drop and get renegotiated. Seems to be a known issue with these switch chips and MCLAG. At some point I'll likely move to a VXLAN overlay instead since layer 3 features seem better supported on these switches (SONiC on some older broadcom based white-box switches). I'd consider this a minor issue if it wasn't crashing my nodes somehow. I stopped having issues during normal operation since I configured everything to use LACP slow - seems to be some kind of race condition when the control plane is under load. Possibly relevant, the switch control plane is also Debian 12.

Code:
Excerpt from the hosts file with the cluster nodes:

172.19.7.120 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2
172.19.7.120 fss-srv-vmhost2.lan.firestorm.space fss-srv-vmhost2
172.19.7.120 pbs.lan.firestorm.space backup.firestorm.space
172.19.5.1 fire-srv-vmhost2.sana.firestorm.space
172.19.5.1 fss-srv-vmhost2.sana.firestorm.space
#172.20.10.2 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2
#172.20.10.34 fire-srv-vmhost2.lan.firestorm.space fire-srv-vmhost2

172.19.7.124 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
#172.20.10.3 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
#172.20.10.35 fss-srv-vmhost3.lan.firestorm.space fss-srv-vmhost3
172.19.5.3 fss-srv-vmhost3.sana.firestorm.space

172.19.7.125 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4
172.19.5.4 fss-srv-vmhost4.sana.firestorm.space
#172.20.10.4 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4
#172.20.10.36 fss-srv-vmhost4.lan.firestorm.space fss-srv-vmhost4

I seem to be missing some logs from around the time of the crashes, I don't think they're successfully getting flushed to disk when it crashes? In any case I'll attach what I've got.
Code:
journalctl --since "2025-05-18 22:00" --until "2025-05-19 02:00" | grep -v ledmon | grep -v postfix
All updates were done by 01:00.
I deliberately tested some network partitions after that which caused some further crashes.
ledmon excluded b/c chatty and irrelevant; postfix has logged PII and doesn't seem relevant.
vmhost4 appears to have logged nothing since 22:11, before the update (it was the first node updated) despite being up and seemingly running normally for the last 8 hours. Bounced the service and now it's logging again.
Code:
-- Boot c8c92a37b46f4d579a7d9d1167305400 --
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: systemd-journald.service: Deactivated successfully.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Stopped systemd-journald.service - Journal Service.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Starting systemd-journald.service - Journal Service...
May 19 10:51:24 FSS-SRV-VMHOST4 systemd-journald[341839]: Journal started
May 19 10:51:24 FSS-SRV-VMHOST4 systemd-journald[341839]: System Journal (/var/log/journal/70af026eeee445c0ad713a25dadaf54d) is 27.2M, max 4.0G, 3.9G free.
May 19 10:51:24 FSS-SRV-VMHOST4 systemd[1]: Started systemd-journald.service - Journal Service.
May 19 10:52:22 FSS-SRV-VMHOST4 cat[342420]: test log message

Let me know anything else you'd like described at more length. I omitted an exhaustive description of subnets and VLANs as it's currently all running through the same bond.

ETA: IPSec configs, vmhost2 network page screenshot.
 

Attachments

Last edited: