Understanding CEPH in a 3 Node Cluster with 12 OSDs

Mar 23, 2024
21
10
3
Hi,

I would like to understand Ceph better especially the usable storage size in a 3:2 ratio. On a customer system we have this total Ceph usage:
CleanShot 2025-02-04 at 13.28.50@2x.png


The Ceph calculator tells me that I could only use 10.24TB in this configuration (3 nodes, 15,36TB per node, 3 replicas):
CleanShot 2025-02-04 at 13.35.05@2x.png


Does this mean that we're already in the "unsure area" on the customer example above? The OSD Nearfull Ratio revommended here is 0.67 - meaning should get worried as soon as the usage is hitting 67%? That would be my interpretion but the usable storage size is much below...

Thanks!
 
How many OSD do you have in each host?

If you are only using size=3 replicated pools, your net space would be ~41.92TiB/3 = 13.97TiB. Keeping in mind that in general Ceph will start complaning when an OSD is at 85% full, the net usable capacity while still being within safe margins would be around 75% of 13.97TiB, so ~10.5TiB. This space will be further reduced if you want to allow Ceph to self-heal when a disk breaks: by default, it will start recreating the third replica stored in the failed drive in the remaining drive(s) of the host (if any, hence my first question), so you need to account for that overhead too or disable self-healing to force Ceph to wait for an admin to decide what to do. Whatever you choose, make sure you don't end up with your OSD too full, as it can get tricky to recover if can't add drives or delete data.
 
  • Like
Reactions: ceelight
Thanks a lot!

We have 4 OSD in each host, so 12 in total.

If you are only using size=3 replicated pools, your net space would be ~41.92TiB/3 = 13.97TiB. Keeping in mind that in general Ceph will start complaning when an OSD is at 85% full, the net usable capacity while still being within safe margins would be around 75% of 13.97TiB, so ~10.5TiB.
Ah, so the usable storage is per host/node!? Meaning in our example we're talking about 31.5TB in total for the cluster or am I missing something?

Self-Healing of course should be allowed.
 
Ah, so the usable storage is per host/node!? Meaning in our example we're talking about 31.5TB in total for the cluster or am I missing something?
If you mean about the total Ceph usage, then yes, 10.5TiB * 3 = ~31.5TiB. The usable capacity is for the whole cluster: you store 3 copies of every bit, one copy in each of your 3 servers. Thats why I divided by 3 the whole gross capacity of all drives. If you store 1GB of VM data, that will use 3GB in the ceph OSDs, 1GB in a OSD of each host.

If one server breaks, Ceph will not self-heal as it has nowhere no create a third replica in a third host (you only have 2 left). But if one of those 3.84T drives break, Ceph will automatically rebalance to the remaining 3 disks of that host. This is effectively the same as having just 3 disk on each host.

Math would be like this for your 3 node, 4 OSD per host, cluster:

Usable total capacity: ~41.92TiB/3 = 13.97TiB
Leave a 25% headroom to allow OSD to fill up to 75% on average: 13.97TiB * 0.75 = ~10.5TiB usable capacity
To allow Ceph to self-heal in case of one of four drive failure and remain under 75% usage: 10.5TiB * 0.75 = ~7.86TiB usable capacity

Yes, you can only use like ~45% of the total capacity with your configuration. More nodes (5+, as using four nodes doesn't really help capacity wise) and/or more drives and/or disabling self-healing do change the maths and allow for more usable capacity.
 
  • Like
Reactions: ceelight
If one server breaks, Ceph will not self-heal as it has nowhere no create a third replica in a third host (you only have 2 left). But if one of those 3.84T drives break, Ceph will automatically rebalance to the remaining 3 disks of that host. This is effectively the same as having just 3 disk on each host.
No, Ceph will self heal and start to create a 3rd copy of everything on the remaining 2 nodes (if noout was not set in Ceph). So, with 3 nodes, a maximum of ~66% usable, because if a complete node lost, the remaining ~33% capacity will be used for 3rd replicas. And with 3:2 setting, the pool will be read only until the 3rd replicas fully made on the remaining 2 hosts.

Math would be like this for your 3 node, 4 OSD per host, cluster:

Usable total capacity: ~41.92TiB/3 = 13.97TiB
Leave a 25% headroom to allow OSD to fill up to 75% on average: 13.97TiB * 0.75 = ~10.5TiB usable capacity
To allow Ceph to self-heal in case of one of four drive failure and remain under 75% usage: 10.5TiB * 0.75 = ~7.86TiB usable capacity

Yes, you can only use like ~45% of the total capacity with your configuration. More nodes (5+, as using four nodes doesn't really help capacity wise) and/or more drives and/or disabling self-healing do change the maths and allow for more usable capacity.
I think this calculation also not fully correct. You must leave a total of 33% empty because of above, so you have 33% space for self healing, to loose an OSD or a complete node (exactly: any 4 OSDs can be lost at the same time if all hosts have 4 OSDs each). If you loose a node _and_ one OSD on the remaining 2 nodes (total 5 OSDs lost), the the pool will be inaccessible because of not enough replicas (3 replicas : 2 minimum).

So with 3 nodes, 15 TB each, the total usable capacity with 3 replicas will be ~10 TB, safe nearfull ratio at 0.67, efficiency at 22%.
If you add a 4th node with 15 TB, the usable capacity will be ~15 TB and nearfull ratio can be raised to 0.75, efficiency at 25%.
With a 5th node with 15 TB, the usable capacity will be ~20 TB and nearfull ratio can be raised to 0.8, efficiency at 27%
Of course, with the addition of 4th and 5th node, the replias still at 3. And you still can only loose 1 complete node.

Higher storage efficiencies can be achieved with Erasure Coded pools, but that not an option below 7 nodes, and EC pools have other drawbacks than replicated pools, that must be calculated with.
 
I think this calculation also not fully correct. You must leave a total of 33% empty because of above, so you have 33% space for self healing,
The calculation is correct, and with three nodes its not possible to self heal in the way you describe since there wouldn't be three living nodes available. In any case, you don't need to operate your cluster with the expectation of node removal- at least I never heard of such a requirement. If you're going to down a node for an extended period simply set noout.
 
The calculation is correct, and with three nodes its not possible to self heal in the way you describe since there wouldn't be three living nodes available. In any case, you don't need to operate your cluster with the expectation of node removal- at least I never heard of such a requirement. If you're going to down a node for an extended period simply set noout.
I don't want to argue with you, I only want to understand what you said and why:
  • So, with 3 nodes, Ceph automatically select the failure domain to node, not to OSD. You say this is not designed to loose a complete node...
  • The calculation I said not correct said that the OP must calculate with a total of 50% capacity loss (25%+25% reduction) of one node's capacity (with 3 nodes and 3 replicas). If this is the correct calculation nobody willing to use Ceph to use 7 TB usable total on 3 nodes from 46 TB raw... That will be 15% efficiency only.
Bennet Gallein's Ceph capacity calculator
Another Ceph capacity calculator with explanation

We use Ceph with/because of failures in mind. We plan on loosing a node (this is why we use minimum 3 of them), this is the goal of planning redundancy - to survive this loss.
Ceph use failure domains to distribute data to ensure, if a failure domain entity completely lost, the data remains safe and the pool functional. You can plan failure domain not only to OSD or node, but to rack, row or whole datacenter. The failure domain is the block you calculate for loosing it. So, if you plan for datacenter and you have 3 datacenters and Ceph runs with a replica of 3 on these, you can loose a complete datacenter (f/example broken optical link) and Ceph will be self heal in the remaining 2 datacenters.
But when we talk about self healing, Ceph not interested in what failed and where (OSDs from different nodes or one whole node, etc.) when doing healing. Ceph only create the required 3rd replica for all PGs on the remaining free space. So Ceph will be create the 3rd replicas on the remaining 2 nodes (if one complete node fails in a 3 node cluster) to restore the required number of replicas. Because of this, we must calculate nearfull and full ratio to be safe with free space if this were to happen.

Chapter 14. Handling a node failure

Because of this, Ceph will not require more free capacity for healing than calculated from node number and required replicas. So, we not loose, not plan for the losing of the gross replicated capacity's half (45 TB raw with 3 replicas: 15 TB gross, we not plan on loosing 7.5 TB from 15 TB) to ensure healing requirements... If a 15 TB node fails (which contains the safe maximum of 10 TB data with 0.67 nearfull ratio) there will be 5 TB free space on the remaining 2 nodes each, and the lost 10 TB of 3rd replicas can be recreated on the remaining 2 nodes 5-5 TB free capacity (in this case the 2 nodes will be total full and must not used, of course, but we not loose the whole pool because of out of space errors).
If we plan to 50% instead of 67%, we loose 17% safely usable capacity for nothing. But it's not prohibited to loose more storage by over-planning, of course, simply not profitable.

But I'm very interested in links that support the other calculation method, because if I understand this wrong, I constantly under-spec my pools.
 
  • Like
Reactions: Johannes S
So, with 3 nodes, Ceph automatically select the failure domain to node, not to OSD. You say this is not designed to loose a complete node
Its only automatic in the sense that the default crush rule provided by pveceph will do. but you have to consider all the rest of the implications: in a 3:2 replication group, this means that you MUST have 3 nodes to have a complete healthy pg. If you only have two nodes you cannot be whole.

As for losing nodes- a 3 node cluster cannot sustain a node failure since there is nowhere for shards to redistribute to make the pgs whole. also, "losing" a node in context is meant PERMANENTLY; a node can be downed for maintenance or fault and the cluster will continue to function degraded but cannot heal, which is what the "3:2" rule means (3 osds are required per pg to be whole, 2 osds are the minimum required for write operation.)

The calculation I said not correct said that the OP must calculate with a total of 50% capacity loss (25%+25% reduction) of one node's capacity (with 3 nodes and 3 replicas). If this is the correct calculation nobody willing to use Ceph to use 7 TB usable total on 3 nodes from 46 TB raw... That will be 15% efficiency only.
Again, you cannot "regain" the capacity of a lost node with ONLY TWO SURVIVORS. it doesnt matter how much space it contained. if each node has 13.97T of space, and a usable capacity of 13.9TB total, this will remain the same with a node down. If you havent tripped your high water mark (typically 85% per osd) you will continue to function normally with the entire usable capacity as before.

We plan on loosing a node
A 3:2 crush rule provides FAULT TOLERANCE for a down node (and possibly 2) but that doesn't mean that its meant for removal of nodes without consequence. the implication is that a downed node is down for a short while and then returned to service. If you do mean to permanently remove a node, that really should be done in a planned deliberate manner or you will have a bad day. Even with more then three nodes, a redistribution of a node's worth of capacity doesn't just need remaining room in the surviving OSDs to take the capacity but your cluster also needs to be able to sustain the resulting rebalance storm.
 
  • Like
Reactions: ucholak and UdoB
3 node Ceph works beautifully, but has very limited recovery options and requires a lot of spare space in the OSD if you want it to self-heal. I will try to explain myself to complement @alexskysilk replies:

No, Ceph will self heal and start to create a 3rd copy of everything on the remaining 2 nodes
It can't: the default CRUSH rule tells Ceph to create one replica on an OSD of different hosts. If there's just two hosts, a third replica can't be created.

And with 3:2 setting, the pool will be read only until the 3rd replicas fully made on the remaining 2 hosts.
All PGs will be left in "active+undersized" state and fully accesible. If any other OSD fails while 1 host is down, some PGs will be left with just one replica and will become innaccesible, pausing I/O on the VMs using those PGs (probably most if not all). After the default 10 minutes the downed OSD will be marked OUT and Ceph will start recovery, creating a second replica on the remaining OSD of that host.

So with 3 nodes, 15 TB each, the total usable capacity with 3 replicas will be ~10 TB, safe nearfull ratio at 0.67, efficiency at 22%.
Not if you want Ceph to self-heal on the remaining OSDs of that host. To clarify, I mean without replacing the failed drive. On all my 3 host Ceph clusters I disable self-heal so I will start it manually once the failed drive is replaced and/or if there's enough free space in the OSDs.

If you add a 4th node with 15 TB, the usable capacity will be ~15 TB and nearfull ratio can be raised to 0.75, efficiency at 25%.
Not if you want Ceph to self-heal on the remaining 3 nodes. The fourth node is usually just added to provide full host redundancy, with the bonus of adding some usable capacity. Capacity will be equivalent to a 3 node cluster but adds the chance to keep 3 full replicas on full host failure. Also, planning on a single OSD failure will have much less impact on available capacity of the whole cluster, as those third replicas may be created on the remaining disks of the host with the failed OSD and in the fourth node. If you plan on self-healing while one OSD is down and one host is down, capacity will be exactly as with 3 nodes.

With a 5th node with 15 TB, the usable capacity will be ~20 TB and nearfull ratio can be raised to 0.8, efficiency at 27%
Fifth and next nodes do add capacity to a 3:2 replica pool, but the available space would be similar to what you calculated with 4 nodes (to allow recovery if one full node fails).

The calculation I said not correct said that the OP must calculate with a total of 50% capacity loss (25%+25% reduction) of one node's capacity (with 3 nodes and 3 replicas). If this is the correct calculation nobody willing to use Ceph to use 7 TB usable total on 3 nodes from 46 TB raw... That will be 15% efficiency only.
Seems like none of those calculators take the number of OSDs per host and their size into account and the space required for recovery if one or more of them fails. Seems they just provide the recomended capacity during normal usage.

But when we talk about self healing, Ceph not interested in what failed and where (OSDs from different nodes or one whole node, etc.) when doing healing
It absolutely cares about where to place the replicas with whatever you set in the CRUSH map. To get the behavior you wrote, failure domain must be set to OSD (and that will open the chance of having all 3 replicas in the same host and defeats the whole purpose of Ceph unless you really know what you are doing). Again, by default sets failure domain to host and creates one replica in different hosts and will never create it more than one replica in the same host.

If a 15 TB node fails (which contains the safe maximum of 10 TB data with 0.67 nearfull ratio) there will be 5 TB free space on the remaining 2 nodes each, and the lost 10 TB of 3rd replicas can be recreated on the remaining 2 nodes 5-5 TB free capacity (in this case the 2 nodes will be total full and must not used, of course, but we not loose the whole pool because of out of space errors).
On 3 node cluster, no recovery will be started automatically if one node fails due to the default CRUSH rule of "one replica on an OSD of a different host", so no chance for any other OSD to become full.

But I'm very interested in links that support the other calculation method, because if I understand this wrong, I constantly under-spec my pools.
Okey, let's do it with an example to show you why you must reserve space to plan for single OSD failure or disable automatic self-healing on 3 node Ceph clusters:

- 3 host
- 2 OSD each of 2TB (host1: OSD.0, OSD.1, host2: OSD.2, OSD.3, host3: OSD.4, OSD.5)
- 1 pool with 3 replicas, min_size 2
- 12TB gross space (3 hosts x 2 OSD x 2TB), 4TB total usable space (12TB / 3 replicas).
- Ceph is 50% fuil, so 6TB used in the OSD's and 2TB of data.
- OSD usage will be probably around 45-55%.
  • Imagine that OSD.0 from host1 fails. The OSD service will stop and it will get marked DOWN.
  • Pool is available even if ~half of it's PGs will be marked "undersized" because one of the three replicas is not available. Impact will be 1/2 of the data of the cluster.
  • Ceph will not recover anything just yet.
  • After mon_osd_down_out_interval (default 600 secs), OSD.0 will be marked OUT. A new CRUSH map will be generated with just one OSD on host1 and two OSD in host2 and host3.
  • All OSD will know that ~half the PGs are missing a third replica that must be created. Where shall it be created? On an OSD of 3 different hosts. Given that host2 and host3 already have a replica, the third replica must be created in the only working OSD of host1: OSD.1.
  • Given that OSD.1 was already at ~50% capacity, when creating the third replica it will reach near_full (85%), backfill_full (90%) and probably become 95% full because OSD.0 was too at ~50% capacity. When an OSD is full, no write is possible in any PG for which the OSD is Primary, which essentially means that most if not all of the VMs will halt I/O, taking down the service and rendering the cluster useless.
Having more OSD per host will allow to get more usable space: i.e. with 8 OSD per host, a single OSD failure will impact just ~1/8 of the data and recovery can be done on 7 remaining OSD. It's quite late for me to do the math, but hope you get the idea.