Stretched Proxmox cluster across 3 datacenters

willybong

Well-Known Member
Apr 22, 2020
37
4
48
Hi everyone,

I'm facing a quite tricky quorum/partitioning issue in a stretched Proxmox cluster across 3 datacenters, and I’d really appreciate some insights from people with experience in similar setups.

Cluster topology​

  • 7-node Proxmox cluster
  • DC1: 3 nodes
  • DC2: 3 nodes
  • DC3: 1 node (used as tie-breaker / quorum site)
  • Corosync configured across all sites
  • Ceph properly configured in stretch mode with quorum monitors distributed correctly

Observed behavior​

Most failure scenarios are handled correctly:

  • Full DC failure (DC1 or DC2) → ✅ cluster behaves as expected
  • Loss of connectivity between DC1 ↔ DC3 → ✅ handled correctly
  • Loss of connectivity between DC2 ↔ DC3 → ✅ handled correctly
However, I consistently hit problems in this specific scenario:

❌ Failure scenario: DC1 ↔ DC2 link down​

  • DC1 and DC2 lose communication with each other
  • Both DC1 and DC2 still have connectivity to DC3
In this situation, the cluster becomes highly inconsistent:

  • Nodes form unexpected / almost random partitions
  • Quorum decisions are not deterministic
  • Cluster behavior is unstable and hard to predict
This is the only scenario where the system behaves incorrectly.

Expected behavior (my understanding)​

In this topology:

  • DC1 partition = 3 nodes + DC3 (1) → 4 votes
  • DC2 partition = 3 nodes + DC3 (1) → 4 votes
So both partitions would theoretically have equal quorum weight.

My expectation is that:

  • Corosync / votequorum should deterministically choose one partition
  • Or enforce a consistent tie-break mechanism via DC3
But instead, I observe split and inconsistent cluster states, rather than a clean decision.

Questions​

  1. Is this behavior expected when two partitions have equal vote weight (4 vs 4)?
  2. How does Corosync/votequorum handle this kind of “dual-majority via shared tie-breaker” scenario?
  3. Is it correct to assume that this topology is inherently ambiguous without additional constraints?
  4. Should DC3 be configured differently (e.g. qdevice/qnetd instead of a full node)?
  5. Are there recommended best practices for this kind of 3-site stretched cluster to avoid this ambiguity?
  6. Could this be related to timing/race conditions in membership formation?

Additional notes​

  • Ceph stretch mode behaves correctly in all scenarios
  • The issue seems isolated to Corosync quorum / cluster membership
  • No fencing/STONITH currently configured (not sure if relevant in this case)

Any guidance, design recommendations, or similar experiences would be very helpful.

Thanks!