Ceph konfiguration "which node is on which USV" for ideal replication?

Apollon77 · Jan 5, 2023

Hi All,

I have a 6 node proxmox setup with Intel Nucs in my house as HA cluster. Currently I use Glusterfs to provide a shared FS, but I think about changing to Ceph.
The Nucs are spread over 3 places in the house with own USVs and in sum provide a big "distributed replicated storage" - that means that I use differently sized disks, that sum up to 1TB per location.
So to simplify it imagine:
* location 1 (3 nodes): 250GB+250GB+500GB
* location 2 (2 nodes): 500GB+500GB
* location 3 (1 node): 1TB

I understoof that Ceph works fine with "differently sized storages" as basis and will normally handle all that itself.
If I would change that setup to Ceph and still want to use a 3 times replicated storage ... how (and where) would I tell Ceph about the "3 places with USVs", because in fact with this I define kind of "location separated storage groups" and he should ideally respect that when knowing how to replicate the stuff.
I was reading ceph docs and yes there is something menthioned that you can give infos about rack, datacenter-room and such in one of "the maps" it has ... but I was not able to find how to really define such a structure (and also not how/where to do that in proxmox UI)

it would be awesome if anyone could show me the right direction or provide examples.

Thank you,

Ingo

aaron · Jan 5, 2023

The default behavior of Ceph is to have each replica on a different node to make sure that you can tolerate the loss of a full node.

The CRUSH map tells Ceph how the cluster is arranged. By default, all nodes reside directly under the default root. But you can create more complicated hierarchies that reflect your cluster. Then you need to create a new rule that makes decisions upon the hierarchy where to place replicas.
Which hierarchy level you use is up to you. Room would be a fitting one.

There are a few things to consider and be aware of. First, you will be running two clusters. The Proxmox VE and the Ceph which is managed by PVE.

The Ceph Monitors (MONs) work similar to PVE because they also form a quorum (majority). You can place one MON in each room. That way you can lose one room and the Ceph cluster will still be working. MON wise and replica wise.

On the PVE side though, things are a bit more complicated. Each Node has a vote in the cluster and therefore, if you lose half of the nodes, you have a problem. How big depends on if you use HA or not.
If you lose room 3 or 2, you still have more than 50% of the votes, but room 1 contains half the nodes in the cluster...
I know, you placed them this way to have roughly the same disk space available in each room, but you should also consider available votes.

Creating the CRUSH map:
Checkout the Ceph docs about the CRUSH map: https://docs.ceph.com/en/quincy/rados/operations/crush-map/#add-a-bucket

Add your rooms:

Code:

ceph osd crush add-bucket <room> room

Move them to be part of root=default:

Code:

ceph osd crush move <room> root=default

Move your nodes into their rooms:

Code:

ceph osd crush move <room> room=<room>

You should be able to see the changed relations in the CRUSH map itself (Ceph->Configuration in the GUI) or if you check the OSDs in the GUI of when you run ceph osd df tree as well.

Next, you need a matching CRUSH rule. This is where it gets a bit more complicated.
To edit the CRUSH map directly as a text file:

Code:

ceph osd getcrushmap > crush.map.bin
crushtool -d crush.map.bin -o crush.map.txt

Then you can add a new, more complicated rule directly. One that should do what you want (place one replica per room) could look like this:

Code:

rule replicated_rooms {
    id <X>
    type replicated
    min_size 2
    max_size 3
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}

Make sure that you change the "id" of the rule to the next free number!
Once you are done, convert the text file back to binary format and apply it:

Code:

crushtool -c crush.map.txt -o crush2.map.bin
ceph osd setcrushmap -i crush2.map.bin

Apply the rule to your pool(s) and once Ceph is done rebalancing, check if the OSDs used by the PGs are actually in the different rooms to make sure.

Code:

ceph pg ls

or any of the other ls-by-xxx variants. The UP and ACTING columns are what you are interested in here.

Apollon77 · Jan 5, 2023

Thank @aaron you for all the details!

For PVE: The reality is a 3+3+1 setup, so Quorum wise for PVE it works all as expected (I just simplified it for above Cepth question) - and yes I already have HA here on this basis (just with glusterfs). And yes the goal is to allow one "room" (aka also power phase and USV) to fail and still have a working system. More I will not be able to really achieve

With your details I would also bring the Ceph MONs simply on each PVE host .... because beside "room" I still have the "single Nuc" level and just that one nuc goes offline loosing a big MON part doesn't feel that great. Or is anything speaking against that?

In general that sound like a plan

So maybe I take that after upgrading all PVEs to 7

(yes I collected a "backlog")

aaron · Jan 5, 2023

Apollon77 said:
Or is anything speaking against that?

Yes, it is not needed from a performance perspective and the service will eat CPU cycles and memory. If the MON on one NUC fails, you still have 2. If it persists for longer, you can easily create a new MON on another machine (in the same room).

Ceph konfiguration "which node is on which USV" for ideal replication?

Apollon77

Well-Known Member

aaron

Proxmox Staff Member

Apollon77

Well-Known Member

aaron

Proxmox Staff Member

We value your privacy