Entire cluster rebooted

mwarchut · Dec 6, 2025

Yesterday we had an entire cluster of 8 nodes reboot at what appears to be the same exact time. There was no power loss of the servers. Has anyone ever seen this happen before? Is there anything in particular I should be looking at? Everything came back up cleanly thankfully.

TIA

leesteken · Dec 6, 2025

Sounds like every node lost quorum (with HA enabled) at the same time. This suggest a network hick-up of some kind, which appears to have been temporary. Check the system logs of the nodes (just before the reboot) and your switches.

mwarchut · Dec 6, 2025

leesteken said:
Sounds like every node lost quorum (with HA enabled) at the same time. This suggest a network hick-up of some kind, which appears to have been temporary. Check the system logs of the nodes (just before the reboot) and your switches.

yeah that was what I was suspecting happened. Is a full reboot the normal SOP? Can it be disabled?

leesteken · Dec 6, 2025

Fencing is indeed normal operation when using HA: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
Make sure your (corosync) network is not a single point of failure: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

bbgeek17 · Dec 6, 2025

As @leesteken pointed out, if the network handling Corosync fails long enough, each node becomes isolated from the others and loses quorum. To avoid a split-brain situation (i.e., node1 attempting to recover services that are still running on node2), an isolated node will reboot to release its resources, assuming the remaining nodes are still operating normally. In your case, if all nodes were “not ok,” then each one would have rebooted.

Additionally, your cluster has eight nodes. This introduces another potential failure mode: a 4/4 split. Neither side would have quorum, so both sides would relinquish their resources - leading to a full cluster reboot. This is why an odd number of nodes is generally recommended for cluster designs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mwarchut · Dec 6, 2025

Thanks for the input. I've setup an alternat physical path for corosynce connections as well as added a qdevice so there are an odd number of votes.

Search

Search

Entire cluster rebooted

mwarchut

New Member

leesteken

Distinguished Member

mwarchut

New Member

leesteken

Distinguished Member

bbgeek17

Distinguished Member

mwarchut

New Member

We value your privacy