Hi everyone,
last Friday we experienced a major outage in our Proxmox cluster. While performing maintenance on one of the nodes, that node unexpectedly became quorate on its own, which triggered HA to restart VMs. As a result, multiple instances of the same VMs ended up running simultaneously on different nodes, leading to data corruption.
We’re currently trying to understand how and why Corosync determined that this single node had quorum.
I’ve attached the relevant Corosync logs from the incident.
Best regards,
Julian
last Friday we experienced a major outage in our Proxmox cluster. While performing maintenance on one of the nodes, that node unexpectedly became quorate on its own, which triggered HA to restart VMs. As a result, multiple instances of the same VMs ended up running simultaneously on different nodes, leading to data corruption.
We’re currently trying to understand how and why Corosync determined that this single node had quorum.
I’ve attached the relevant Corosync logs from the incident.
What we know so far:
- Corosync was trying to reach other cluster members, so we believe the nodes were correctly listed in the Corosync configuration.
- The network/heartbeat interfaces were flapping – but only on the node that was under maintenance. All other nodes remained stable and were reachable among each other.
- At 12:51:28, Corosync logs state:
"This node is within the primary component and will provide service."
This appears to be the point where the trouble began – but how can this happen with only a single active node?
- What could cause Corosync to falsely assume quorum on a single node?
- How to prevent such a situation in the future?
Best regards,
Julian