Hi everyone,
I'm facing a quite tricky quorum/partitioning issue in a stretched Proxmox cluster across 3 datacenters, and I’d really appreciate some insights from people with experience in similar setups.
My expectation is that:
Any guidance, design recommendations, or similar experiences would be very helpful.
Thanks!
I'm facing a quite tricky quorum/partitioning issue in a stretched Proxmox cluster across 3 datacenters, and I’d really appreciate some insights from people with experience in similar setups.
Cluster topology
- 7-node Proxmox cluster
- DC1: 3 nodes
- DC2: 3 nodes
- DC3: 1 node (used as tie-breaker / quorum site)
- Corosync configured across all sites
- Ceph properly configured in stretch mode with quorum monitors distributed correctly
Observed behavior
Most failure scenarios are handled correctly:- Full DC failure (DC1 or DC2) →
cluster behaves as expected - Loss of connectivity between DC1 ↔ DC3 →
handled correctly - Loss of connectivity between DC2 ↔ DC3 →
handled correctly
Failure scenario: DC1 ↔ DC2 link down
- DC1 and DC2 lose communication with each other
- Both DC1 and DC2 still have connectivity to DC3
- Nodes form unexpected / almost random partitions
- Quorum decisions are not deterministic
- Cluster behavior is unstable and hard to predict
Expected behavior (my understanding)
In this topology:- DC1 partition = 3 nodes + DC3 (1) → 4 votes
- DC2 partition = 3 nodes + DC3 (1) → 4 votes
My expectation is that:
- Corosync / votequorum should deterministically choose one partition
- Or enforce a consistent tie-break mechanism via DC3
Questions
- Is this behavior expected when two partitions have equal vote weight (4 vs 4)?
- How does Corosync/votequorum handle this kind of “dual-majority via shared tie-breaker” scenario?
- Is it correct to assume that this topology is inherently ambiguous without additional constraints?
- Should DC3 be configured differently (e.g. qdevice/qnetd instead of a full node)?
- Are there recommended best practices for this kind of 3-site stretched cluster to avoid this ambiguity?
- Could this be related to timing/race conditions in membership formation?
Additional notes
- Ceph stretch mode behaves correctly in all scenarios
- The issue seems isolated to Corosync quorum / cluster membership
- No fencing/STONITH currently configured (not sure if relevant in this case)
Any guidance, design recommendations, or similar experiences would be very helpful.
Thanks!