Stretched Proxmox cluster across 3 datacenters

willybong · Apr 19, 2026

Hi everyone,

I'm facing a quite tricky quorum/partitioning issue in a stretched Proxmox cluster across 3 datacenters, and I’d really appreciate some insights from people with experience in similar setups.

Cluster topology

7-node Proxmox cluster
DC1: 3 nodes
DC2: 3 nodes
DC3: 1 node (used as tie-breaker / quorum site)
Corosync configured across all sites
Ceph properly configured in stretch mode with quorum monitors distributed correctly

Observed behavior

Most failure scenarios are handled correctly:

Full DC failure (DC1 or DC2) → cluster behaves as expected
Loss of connectivity between DC1 ↔ DC3 → handled correctly
Loss of connectivity between DC2 ↔ DC3 → handled correctly

However, I consistently hit problems in this specific scenario:

Failure scenario: DC1 ↔ DC2 link down

DC1 and DC2 lose communication with each other
Both DC1 and DC2 still have connectivity to DC3

In this situation, the cluster becomes highly inconsistent:

Nodes form unexpected / almost random partitions
Quorum decisions are not deterministic
Cluster behavior is unstable and hard to predict

This is the only scenario where the system behaves incorrectly.

Expected behavior (my understanding)

In this topology:

DC1 partition = 3 nodes + DC3 (1) → 4 votes
DC2 partition = 3 nodes + DC3 (1) → 4 votes

So both partitions would theoretically have equal quorum weight.

My expectation is that:

Corosync / votequorum should deterministically choose one partition
Or enforce a consistent tie-break mechanism via DC3

But instead, I observe split and inconsistent cluster states, rather than a clean decision.

Questions

Is this behavior expected when two partitions have equal vote weight (4 vs 4)?
How does Corosync/votequorum handle this kind of “dual-majority via shared tie-breaker” scenario?
Is it correct to assume that this topology is inherently ambiguous without additional constraints?
Should DC3 be configured differently (e.g. qdevice/qnetd instead of a full node)?
Are there recommended best practices for this kind of 3-site stretched cluster to avoid this ambiguity?
Could this be related to timing/race conditions in membership formation?

Additional notes

Ceph stretch mode behaves correctly in all scenarios
The issue seems isolated to Corosync quorum / cluster membership
No fencing/STONITH currently configured (not sure if relevant in this case)

Any guidance, design recommendations, or similar experiences would be very helpful.

Thanks!

spirit · Apr 19, 2026

Should DC3 be configured differently (e.g. qdevice/qnetd instead of a full node)?

yes. qdevice will choose 1 of the partition (by default the group with the lowest corosync nodeid)

Click to expand...

willybong · Apr 21, 2026

thanks for the clarification on the quorum behavior — that makes sense now.

I have a follow-up question regarding Ceph behavior during the same failure scenario.

New issue observed

When the link between DC1 and DC2 goes down (while both can still reach DC3):

- Ceph remains available (as expected)
- BUT for about 1–2 minutes, storage becomes:
- slow / partially unresponsive
- VMs experience I/O stalls (looks like blocked or degraded I/O)

After this period:

- everything stabilizes
- I/O resumes normally

@spirit any idea?

Thanks

Stretched Proxmox cluster across 3 datacenters

willybong

Well-Known Member

Cluster topology

Observed behavior

Failure scenario: DC1 ↔ DC2 link down

Expected behavior (my understanding)

Questions

Additional notes

spirit

Distinguished Member

willybong

Well-Known Member

We value your privacy

Stretched Proxmox cluster across 3 datacenters

willybong

Well-Known Member

Cluster topology​

Observed behavior​

Failure scenario: DC1 ↔ DC2 link down​

Expected behavior (my understanding)​

Questions​

Additional notes​

spirit

Distinguished Member

willybong

Well-Known Member

We value your privacy

Cluster topology

Observed behavior

Failure scenario: DC1 ↔ DC2 link down

Expected behavior (my understanding)

Questions

Additional notes