6 nodes CEPH cluster same LAN different location

DynFi User · Jul 19, 2023

Hello,

We are working on a configuration where we will have 6 nodes spread on two (very close) sites, all linked on the same LAN (25G).

I wanted to know how you would design the solution with CEPH in order to have a working site in case one of the two sites fails ?
So idea is to have site A still up with the services while site B is out of order… or vice versa.

How would you design this to be efficient, optimized… and working !

Thanks for your answers.

aaron · Jul 19, 2023

What is the latency between the sites? Keep in mind, that every access to the disk, will include a network roundtrip, maybe even more. The latency quickly adds up.

Working with two sites in one cluster, usually 2 rooms / fire sections in the same DC to keep latency low, works for example, by editing your crush map. You add another layer, room, for example, and place the nodes in the rooms accordingly. Then have a dedicated CRUSH rule that places 2 copies in each room, so that even if one room fails completely, you still have min-size replicas.

Keep in mind, that you need a smaller Proxmox VE Node + Ceph MON in a third location to provide Quorum if one half is down.

There is also the Stretch Cluster, but think hard about if you want to use it and what the downsides are. Because once enabled, it cannot be disabled anymore.

The Crush rule in the stretch cluster documentation works in any way. We also have another rule that does the same thing. If you are interested, I can post it.

DynFi User · Jul 19, 2023

What is the latency between the sites? Keep in mind, that every access to the disk, will include a network roundtrip, maybe even more. The latency quickly adds up.

Both sites are 200m away and will be linked by OM4 fiber, so it should be really very low latency. No pb here.

Working with two sites in one cluster, usually 2 rooms / fire sections in the same DC to keep latency low, works for example, by editing your crush map. You add another layer, room, for example, and place the nodes in the rooms accordingly. Then have a dedicated CRUSH rule that places 2 copies in each room, so that even if one room fails completely, you still have min-size replicas.

This looks really like what we are looking for !

Keep in mind, that you need a smaller Proxmox VE Node + Ceph MON in a third location to provide Quorum if one half is down.

Ok, we will probably add a very small device to reach 7 nodes and make it odd.

There is also the Stretch Cluster, but think hard about if you want to use it and what the downsides are. Because once enabled, it cannot be disabled anymore.

The Crush rule in the stretch cluster documentation works in any way. We also have another rule that does the same thing. If you are interested, I can post it.

I want to KIS, and we are not in a WAN or distinct DC scenarios, but rather pure LAN.
So I think that It would be overkill and add more pain than gain.

Would be really interested in any info simplifying our work.
We will have a global PoC inhouse so we have time to test, crash and restart before things get serious (= in production).

And of course there will be subscription for this cluster ;-)

Thx !

aaron · Jul 21, 2023

Okay, so the overall layout will be something like this:

Code:

            ┌──────────┐
            │  Room 3  │
            │ ┌──────┐ │
     ┌──────┤ │Node 5│ ├──────┐
     │      │ └──────┘ │      │
     │      │          │      │
     │      └──────────┘      │
┌────┴─────┐            ┌─────┴────┐
│  Room 1  │            │  Room 2  │
│ ┌──────┐ │            │ ┌──────┐ │
│ │Node 1│ │            │ │Node 3│ │
│ └──────┘ │            │ └──────┘ │
│          ├────────────┤          │
│ ┌──────┐ │            │ ┌──────┐ │
│ │Node 2│ │            │ │Node 4│ │
│ └──────┘ │            │ └──────┘ │
│    ...   │            │    ...   │
└──────────┘            └──────────┘

If you use room, site or datacenter as separation level in the crush map, doesn't really matter, as long as you are consistent.

To create the hierarchy in the Crush map, first create the rooms:

Code:

ceph osd crush add-bucket {room name} room

Then move the rooms underneath the root (default) of the Crush map:

Code:

ceph osd crush move {room name} root=default

And last, move the nodes into their respective rooms:

Code:

ceph osd crush move {node} room={room}

After that, you should the the hierarchy in the Ceph->OSD panel, or if you run the ceph osd df tree command.

Then edit the Crush map (the Stretch cluster documentation covers how), and you can either use the Crush rule that they have in there. The following also works:

Code:

rule replicated_room_host {
        id {RULE ID}
        type replicated
        step take default
        step choose firstn 0 type room
        step chooseleaf firstn 2 type host
        step emit
}

What both do, is to make sure, that 2 replicas are stored per site/room.
The pools will need a size/min-size of 4/2.

Now, Proxmox VE specifics: we don't want any VMs to end up on the tie-breaker node. There are a few ways to accomplish that. You could limit the storages to the non-tie breaker nodes. This way, even if HA would place a node on the tie-breaker node, it won't be able to start, and HA will relocate it to another node.

Additionally, you can make use of HA groups to "restrict" the VMs to the nodes in each room/site. By giving out priorities, you can manage on which half of the cluster the VMs should run preferably.
So you could have 2 HA groups with the priorities set higher for one or the other room.

After you have set it up, you can confirm that the replicas are split between the room by investigating the placement groups. ceph pg ls will list all PGs and the OSDs they are on. You can then verify that the OSDs in use, are as expected, 2 OSDs from room 1, and 2 from room 2.

And before you take it into production, test the different failure scenarios. Including failurs in the network connection between the two locations in all possible variants. So you get a good understanding how the Proxmox VE + Ceph cluster behave. Some half broken networks constellations can for example show itself in weird behavior of Ceph.

6 nodes CEPH cluster same LAN different location

DynFi User

Renowned Member

aaron

Proxmox Staff Member

DynFi User

Renowned Member

aaron

Proxmox Staff Member

We value your privacy