Routed stretched cluster

Jun 18, 2025
2
0
1
Bremen
We are currently setting up two separate PVE clusters. Both are HCI clusters running Ceph. The servers are distributed across two colocation facilities in the same city, connected via a low-latency data centre interconnect.

One cluster consists of 12 nodes in each data centre, while the second cluster consists of 3+3 nodes.

We also have access to a third data centre where we would like to deploy the tiebreaker nodes. As I understand it from the Corosync and Ceph documentation, connecting the tiebreaker nodes via routed networks should be possible.

However, the Proxmox configuration dialogs appear to require that the Corosync and Ceph IP addresses reside within the same Layer 2 network.

This raises several questions:

1. Is a stretched Layer 2 network across all three sites actually required?
2. Has anyone successfully deployed tiebreaker nodes over routed connections?
3. Are there any plans or recommendations from Proxmox development team to officially support or simplify this setup?
 
Thank you very much! So technically it works, but the stretched cluster topology is not officially supported.

One more question on this: Is your Corosync also routed? Are any special settings required for this as well?

Does the Proxmox team have any plans to support such scenarios? I think that clusters reaching a certain size have often been, and will continue to be, designed in such topologies with a tiebreaker node that is only accessible via routing.

And as Proxmox VE increasingly becomes a serious alternative to vSphere, Nutanix and many others, such configurations will become more and more common.
 
Hey @timo.reimann

The Q-Device and the witness CEPH monitor both reside on the same node. Both accessed via routing.
So yes, Corosync is also routed between my Witness and the cluster nodes.

As long as you have not added the Witness node as a full cluster / corosync member, routing should be fine.
According to the Proxmox Wiki, the latency requirements for corosync are <= 5ms. I normally would go with 2ms.
The Q-Device can be way more and even be on the Internet, as long as the connection is reliable in case of node failure and latency less than ~100ms (But I think it can even be more), you should be fine. The Q-Device vote only gets accounted for in case of uneven votes / sites loose connection. Otherwise it will not really count in quorum and as long as the connection between the sides is fine, a downtime of the Q-Device has no impact.

A big BUT is still present. If you have two sides with the Nodes, and a third side with the Witness (Q-Device and CEPH monitor), you currently have two separate Quorum / Witness Systems...

If the two sides loose connection to each other, but both sides can still reach the Witness (asymmetric network topology here:: https://pve.proxmox.com/wiki/Stretch_Cluster#Scope), you could potentially run into a Cluster down scenario, IF instance Q-Device decides for Site A and Instance CEPH monitor decides for Site B.
The cluster will eventually recover on its own (if all the nodes reset on the contrary site the Q-Device decided for, eventuall CEPH will failover to the remaining site then), but potentially with all VMs restarted. But that is not guaranteed and also could cause issues again, if the nodes come back afterwards. Normally you take one site offlie manually, in case the connection is offline longer.

I have not found a way to reliably set a deterministic behavior for such scenario, as this would require a coupling between the Q-Device and CEPH Monitor decision OR a priority option for which site the CEPH monitor should decide for => Against how the quorum system works in CEPH).
I am not a software developer, so this is sadly out of my scope.
You can set for which side (Proxmox Node ID) the Q-Device decides (by default to the lowest node id) for, but sadly not for the CEPH monitor.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_tie_breaking

Does the Proxmox team have any plans to support such scenarios?

If you can live with such limitiations and possible downtime, this solution works perfectly fine, despite only being halfway officially supported.
(External Q-Device outside LAN is supported and mentioned here. The external CEPH monitor should be a disallowed_leader but is also officially supported by CEPH, but not neatly implemented in Proxmox and requires manual configuration via Shell.)
You will still receive support by Proxmox, if you have a suitable subscription. They won't decline to help you, just because you have a more advanced setup.

Cheers.
 
  • Like
Reactions: timo.reimann