3 node cluster with nodes in two diferent server rooms

brucexx

Renowned Member
Mar 19, 2015
265
9
83
Does anybody have any experience with putting cluster nodes in different server rooms ? I have several buildings and was wondering what is acceptable latency for a cluster to operate without any issues. The buildings are connected via 10Gbps fiber and latency is very low 1-2 ms. What is max latency allowing cluster to operate properly ? the nodes would be within the same network/subnet/vlan. The placement of the nodes is to prevent power problem in one building from bringing everything down. I am planing for 3 nodes with replication.

Thank you
 
We have a stretched cluster in production with two nodes located in the main data center room and one node positioned outside the main building. All NICs are directly connected to the core switches, with each server equipped with 8 x SFP+ interfaces.
We use Ceph for storage with a replication factor of 3. The latency is consistently below 2 ms, and the system is operating reliably. We are aware that, in the event of a failure of two nodes, production would be affected. However, data integrity would remain intact. The acceptable latency should be < 5 ms
 
Ok, and what if there is no "primary" server room? In a scenario with two identical locations (e.g. two different rooms in to buildings next to each other) and I want each room to take over immediately. Imagine a power outage in one building, a fire, water, whatever. Or maybe a scheduled downtime in one room - electrician comes over to work on the electrical panel and has to switch off everything

How would I design such a setup? (Proxmox HCI with Ceph)

Do I put at least 2 servers in each room ? or 3? Do I have to set up a witness device in one of the rooms?
What if the room with the witness device has a power outage? How does the room without the witness know it has to take over?
Can I control it manually?
 
Last edited:
You would need a third server room and split them one node in each server room (and at least 3 fire compartments). I would not mess with the internal / corosync and also CEPH logic and just build the setup so that it's sufficient.
 
How would I design such a setup? (Proxmox HCI with Ceph)
Ceph needs at least 3 working servers, AFAIK: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
Do I put at least 2 servers in each room ? or 3? Do I have to set up a witness device in one of the rooms?
You need more than half of the servers in the room that did not have the accident. Or put half of the servers in each room and put the QDevice in another room (near the local router for example or other location where the networks of both rooms connect to each other).
What if the room with the witness device has a power outage? How does the room without the witness know it has to take over?
If more than half of the servers are running then you don't need the QDevice (which gives the deciding vote when exactly half are running/reachable) to have quorum. (The room without the witness will never automatically take over because that could cause a split-brain.)
Can I control it manually?
Yes you can and this can cause problems (like split-brain) if you set the expected number of votes too low.
 
Ceph needs at least 3 working servers, AFAIK: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

You need more than half of the servers in the room that did not have the accident. Or put half of the servers in each room and put the QDevice in another room (near the local router for example or other location where the networks of both rooms connect to each other).

If more than half of the servers are running then you don't need the QDevice (which gives the deciding vote when exactly half are running/reachable) to have quorum. (The room without the witness will never automatically take over because that could cause a split-brain.)

Yes you can and this can cause problems (like split-brain) if you set the expected number of votes too low.

I have two rooms and each of them CAN fail if there is a fire or flood. Not both because they are in separate buildings
If I had a crystal ball and knew before I could grab my fire extinguisher and prevent the fire from destroying the room. :-D

But of course I don't now what room will fail. (except for a scheduled downtime)

So how would i design it?

Building a third room is not an option....
 
So how would i design it?

Building a third room is not an option....
Then you cannot use PVE and/or CEPH. If you would have 2 of the 3 nodes in one room and that room "fails", you will not have a working setup anymore and the remaining node cannot act without quorum.

If you only have PVE without CEPH, you could run a qdevice in a third server room or closet or whatevery, as long as it is in a third fire compartment compared to the two other server rooms. Another question would rise as what storage you would use in such a setup.

A QDevice does not exist for ceph, so three is the minimum here and you need a odd number of nodes.
 
I know dozens of companies that have split up their server rooms into two separate locations. All major storage providers (Netapp, DataCore, HPE, DELL, etc.) have solutions for that. I think they all have some sort of synchronous mirror that can be put in two locations.

I can't believe that you cannot use PVE/CEPH not because there are not enough resources available to run all the VMs (that is just a matter of how many hosts you buy) but just because there has to be a odd number of hosts ? O_o
 
I can't believe that you cannot use PVE/CEPH not because there are not enough resources available to run all the VMs (that is just a matter of how many hosts you buy) but just because there has to be a odd number of hosts ? O_o
You need a majority (half is not good enough) of the nodes being able to reach each other (with low latency): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_quorum

EDIT: If half (or more) of the nodes disappear, you can manually set the expected number of votes lower (and remove those nodes). For safety, this is not done automatically.
 
Last edited:
You need a majority (half is not good enough) of the nodes being able to reach each other (with low latency): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_quorum

EDIT: If half (or more) of the nodes disappear, you can manually set the expected number of votes lower (and remove those nodes). For safety, this is not done automatically.

That is something I could live with. I mean if the building is on fire there has to be some "manual intervention" anyways... firefighting and the like :p

So imagine a setup with 6 hosts in total and 3 hosts on each side.

3 hosts alone have enough CPU+RAM to run all my VMs.

If one room goes dark, I would have to make sure that all my data does not only have at least two copies on two hosts but there has at least to be one copy in each room, I guess?

Is there a way to tell ceph to keep copies of my data that way?

If that i the case, I would be able to lower the number of votes in my surviving room to 3 and start up all my VMs again without losing any data?
 
That is something I could live with. I mean if the building is on fire there has to be some "manual intervention" anyways... firefighting and the like :p
Proxmox automatically handles losing a few nodes due to hardware failures. If you lose many nodes at once (like in your scenario) then manual intervention might be needed. Please note that losing lots of nodes is most commonly caused by a switch failure (preventing nodes from reaching each other) or a human making a (managed switch) configuration mistake, which is much more common than a fire or flood.
Building a third room is not an option....
I doubt that you have both rooms connecting to the internet via different modems. Most likely, there is a local router that's connected to a modem (or maybe two for a failover) in a closet somewhere. You could put a Raspberry Pi that runs the QDevice inside the same cupboard. It does not need to run VMs, it just need to detect which room did not burn down.
 
Which sort of clustering? There are several clustering technologies in PVE. Most clustering technologies that rely on a consensus or replication need three+ nodes and you have that.

You can create a PVE "cluster" and not use HA. Is it a cluster?

You have already had some answers here about specific latency limits which is odd because we have not even agreed about what we are on about!

Broad brush: you mention 10Gb/s links, so assuming the hosts also have 10Gb links then you will be fine. I run several "stretched clusters" over 1Gb/s copper for replication purposes. I have run live storage migrations over that too and all good.

Yes I am dealing with ex-VMware folk! Thankfully, PVE delivers.

Just to be clear: there is no stipulated minimum latency for any of the clustering technologies ... per se. However, depending on which one and what you are doing, you might like to impose your own minimums. A Ceph cluster will throw out an OSD after five minutes of absence by default. So that's a five minute latency constraint for Ceph.

You'll be fine.
 
EDIT: If half (or more) of the nodes disappear, you can manually set the expected number of votes lower (and remove those nodes). For safety, this is not done automatically.
This would work for PVE, yet not for CEPH, which uses another cluster technology. CEPH will be read only if it looses quorum on the data, or even fails if all nodes with the data went offline.

If it would allow operation with only one copy, what would happen if the other currently down cluster comes back online (e.g. after switch missconfiguration) and now was the majority vote on the data so that the more recent data will be overwritten by the majority vote?


I know dozens of companies that have split up their server rooms into two separate locations. All major storage providers (Netapp, DataCore, HPE, DELL, etc.) have solutions for that. I think they all have some sort of synchronous mirror that can be put in two locations.
Sure, then use it instead of CEPH and you're golden. CEPH is not designed to do this, so it's the wrong tool for the job for your requirements.


If one room goes dark, I would have to make sure that all my data does not only have at least two copies on two hosts but there has at least to be one copy in each room, I guess?
I you set it up like that in your crush map, this could be doable. This is nor part of PVE UI, you need to manually tune the crush map.


If that i the case, I would be able to lower the number of votes in my surviving room to 3 and start up all my VMs again without losing any data?
Yes a read only copy (if the crush map has the data on both "sub clusters"), but writing and working with it probably not.