Entire cluster down when 1 of 5 is down

barndoor101

New Member
Jan 24, 2025
5
2
3
Hi, I run a 5 node cluster running disparate hardware. 1 is a NAS, 1 is a high power node (8/16 AMD ryzen), and 3 are n100 low power.

I am currently away from home, and have no way of getting back for several days, but the high power node appears to be down. I can login to each of the other nodes, but services on the entire cluster are down. I dont know if its blown up or what, but I can remote into the other 4 nodes.

My question, how can i stop a single node failure taking down the cluster?

Any help much appreciated.
 
Hi @barndoor101,
There is no reason for a single node outage to take out a 5 node healthy cluster. As such, there is no procedure to:
stop a single node failure taking down the cluster

You need to examine logs and system state of each surviving node in the cluster. There must be a secondary concurrent reason that, in combination with the failed node, caused your cluster issue.

pvecm status
pvecm nodes
corosync-quorumtool
corosync-cfgtool -s
systemctl status corosync
journalctl -u corosync -b
journalctl -u corosync -f
ip a
ip r
ping <peer-ip>
systemctl status pve-cluster
journalctl -u pve-cluster -b
ls -l /etc/pve
df -h
mount | grep pve
systemctl status pvedaemon
systemctl status pveproxy
systemctl status pvestatd

journalctl -u pvedaemon -b
journalctl -u pveproxy -b
journalctl -u pvestatd -b



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: UdoB and ebiss