Entire cluster rebooted

mwarchut

New Member
Jan 29, 2025
7
0
1
Yesterday we had an entire cluster of 8 nodes reboot at what appears to be the same exact time. There was no power loss of the servers. Has anyone ever seen this happen before? Is there anything in particular I should be looking at? Everything came back up cleanly thankfully.

TIA
 
Sounds like every node lost quorum (with HA enabled) at the same time. This suggest a network hick-up of some kind, which appears to have been temporary. Check the system logs of the nodes (just before the reboot) and your switches.
 
  • Like
Reactions: UdoB
Sounds like every node lost quorum (with HA enabled) at the same time. This suggest a network hick-up of some kind, which appears to have been temporary. Check the system logs of the nodes (just before the reboot) and your switches.
yeah that was what I was suspecting happened. Is a full reboot the normal SOP? Can it be disabled?
 
As @leesteken pointed out, if the network handling Corosync fails long enough, each node becomes isolated from the others and loses quorum. To avoid a split-brain situation (i.e., node1 attempting to recover services that are still running on node2), an isolated node will reboot to release its resources, assuming the remaining nodes are still operating normally. In your case, if all nodes were “not ok,” then each one would have rebooted.

Additionally, your cluster has eight nodes. This introduces another potential failure mode: a 4/4 split. Neither side would have quorum, so both sides would relinquish their resources - leading to a full cluster reboot. This is why an odd number of nodes is generally recommended for cluster designs.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thanks for the input. I've setup an alternat physical path for corosynce connections as well as added a qdevice so there are an odd number of votes.