CEPH: small cluster with multiple OSDs per one NVMe drive

daubner

New Member
Jan 7, 2025
12
3
3
Hello community!

We have deployed our first small Proxmox cluster along with Ceph and so far we've had great experience with it. We're running traditional VM workload (most VMs are idling and most of the Ceph workload comes from bursts of small files with the exception of few SQL servers that can create some bigger transactions during peak business hours).

We're thinking about adding two more nodes to our three node cluster. The HW is almost the same, with the exception of NVMe drives which are bigger.
On the current three-node cluster we have two OSDs created per each NVMe (3 NVMes per node, 18 OSDs total). The additional NVMes would be 2-3 times the size of the current NVMe.

I'd like to hear your feedback and experience on whether or not we should continue setting up multiple OSDs for the drives and also how many OSDs per each drive (especially in terms of CPU and RAM overhead).

Since we have 2 OSDs per drive currently and the new drives will be ~2.5 times the size of current NVMes, we were thinking about setting up 5 OSDs for each of the new drive. Our concern is this could eat up most of the RAM and CPU power so loading the cluster with more VMs will lead to the resources being fought over.

Current setup for reference:
72 x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (2 Sockets), 512GB of RAM, 3x Intel P4610 3.2TB

Thank you very much, looking forward to reading any of your replies and have a nice rest of the day!
 
During testing we have noticed some performance gain with this setup (but this could be due to other factors like memory_target).
 
  • Like
Reactions: Johannes S
two OSDs created per each NVMe (3 NVMes per node, 18 OSDs total)
Using more than one OSD per drive allowed to workaround internal processing limits in the OSD services themselves. Your Intel P4610 aren't the fastest drives ever, so chances are that you won't hit those limits. With recent versions of Ceph this adds very little real world benefit, which is usually neglected by network capacity [1]. I stopped using such setups like two years ago because it added almost nothing for general use and felt like wasting ram for the extra OSD services and in the end the network was still the limiting factor.

We're thinking about adding two more nodes to our three node cluster. The HW is almost the same, with the exception of NVMe drives which are bigger.
Bad idea for two reasons:
  • If you are using size 3/min_size 2, you should have the same capacity at least each 3 servers. Think about what will happen when one of your new servers break and Ceph recovers all the replicas in those disks to the remaining hosts in the cluster. Depending on how full the cluster is, it may end up filling the remaining OSDs.
  • Ceph will set the weight based on disk capacity. Bigger disk->more PGs will be allocated in those disks to use the bigger space->more I/O will hit those drives, which in small clusters can limit the overall cluster performance.
Try to spread the drives in all five hosts so they all have the same capacity.

[1] https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
 
Currently we have the default osd_memory_target (4GB) so that would explain our performance gain between 1/2OSDs per NVMe.

Because our cluster is so small, we're currently working with size 2/min_size 1. We know the dangers of this but we have tested that it provides fault tolerance of one node (with the caveat of safely using only 66% of the total storage due to backfill replication if one node fails). Once we add two nodes we're planning on increasing the size to 3 and min_size to 2.

With 5 nodes, 9x3.2TB P4610 drives and 8x8TB P4510 drives I have calculated the optimal balance as follows:
Nodes 1,2,3 (19.2TB)
1x 3.2TB P4610
2x 8TB P4510

Nodes 4,5 (17.6TB)
3x 3.2TB P4610
1x 8TB P4510

totalling 17 OSDs and 92.8TB (with size = 3 that would mean usable 30TB storage)

RAM:
osd_memory_target = 10GB
expected load: 40GB OSDs, 2GB MON, 2GB MGR, 4GB PVE totaling 48GB

CPU:
documentation states 6 cores per OSD
https://docs.ceph.com/en/squid/start/hardware-recommendations/
expected load: 6 cores * 4 OSDs, 1 MON, 1 MGR totalling 26 cores

which leaves us with (each node has the same CPUs - 2x Xeon Gold 6240 and at least 512GB of RAM - two have 768GB):
RAM: 512 - 48 = 464GB for VMs per node
CPU: 72 - 26 = 46 cores for VMs per node

safe factor would be 3 nodes running at all times (with 2 nodes ceph won't have quorum anyways) so 1392GB of RAM and 138 cores for VMs

(for networking we have 2x10Gb LACP bond, both for cluster and public networks each on separate physical interfaces)

The only thing I haven't looked into yet is binding ceph to specific cores. I will do that as per the provided blog post
https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
I couldn't find direct performance comparison between bound and unbound cores, but I'm think it will be significant.
 
  • Like
Reactions: Johannes S
it provides fault tolerance of one node
It does not. At least not under every circumstance. There's zero guarantee that you will get that second copy done to any OSD: if the primary OSD fails/breaks/can't reach/whatever and/or the secondary OSD is unable to write the data and before that second copy is done your primary OSD fails, you will end up with inconsistent/inactive PGs.

Other edge cases worth mentioning too:
  • You won't be able to run I/O on the PG's held by a failed OSD while a recovery is going on (i.e. when the failed OSD is marked OUT and those copies are created elsewhere).
  • If there's metadata corruption at the rocksdb level / bit corruption at the disk level, Ceph won't be able to self repair itself in a 2/1 configuration.
Simply don't use 2/1 pools except for the most expendable lab tests. Try Ceph per-pool online compression if capacity is currently an issue and/or add disks.


With 5 nodes, 9x3.2TB P4610 drives and 8x8TB P4510 drives I have calculated the optimal balance as follows:
Nodes 1,2,3 (19.2TB)
1x 3.2TB P4610
2x 8TB P4510

Nodes 4,5 (17.6TB)
3x 3.2TB P4610
1x 8TB P4510

totalling 17 OSDs and 92.8TB (with size = 3 that would mean usable 30TB storage)
Usable capacity is nowhere near to 30TB... at least in a 3/2 pool, which is what you should use.

Think about how many disk failures you want Ceph to tolerate and still allow it to recover the lost copies to other surviving OSDs. Let's suppose we want Ceph to self heal from any host failure. We can use something like this:

Sum of the size of all the OSD of your biggest host * (How many nodes you have - 1) / Replica size

That would be 19,2 × 4 / 3 = ~25.6TB maximum addressable capacity.

Once that recovery happens, all your OSD would be 100% full. Ceph won't allow you to fill an OSD over 95% (full_ratio) and it suspends I/O when that happens. In fact Ceph starts complaining once an OSD is over 85% (nearfull_ratio) and stops creating replicas in an OSD over 90% full (backfillfull_ratio). To prevent all this, you must give a capacity margin to remain under ~80% OSD capacity even if any host fails. We could use something like this:

Maximum addressable capacity * (How many nodes you have - 1 / How many nodes you have)

That translates to 25,6 * 4 / 5 =~ 20.48TB maximum fully usable capacity that you could fill with your data and still allow both Ceph to self heal from any host failure and remain under nearfull_ratio. In this case it's similar to subtracting 20% from maximum addressable capacity, but on 6+ hosts clusters it's better to use the formula above as you have more hosts to recover replicas to with the same size=3, so you can fill them more and still allow recovery.


osd_memory_target = 10GB
That would benefit rocksdb/bluestore reads only, not object data reads (i.e. data from your VMs). I leave it at default 4G and only if the storage gets read I/O constricted I plan on checking if increasing it could help.


documentation states 6 cores per OSD
That is/was true for standalone Ceph clusters with tens of hosts with dozens of disks each. I've seldom seen OSD processes over 4 cores on PVE/Ceph clusters, and that's only during heavy recover/rebalance tasks: we are usually limited by the network capacity (just 2x10G in your case) of just a bunch of hosts, so the need of processing power on the OSD proceses is lower (i.e. can't run many IOps at once because they don't fit in the network).


The only thing I haven't looked into yet is binding ceph to specific cores. I will do that as per the provided blog post
https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
I couldn't find direct performance comparison between bound and unbound cores, but I'm think it will be significant.
Haven't tested in real world PVE workloads, but this would require a lot of fine tuning to get meaningful results and you would have to take into consideration that your VMs will use those cores too unless to pin them to different ones. Not sure if its worth, but let us know if you get some numbers!
 
  • Like
Reactions: Johannes S
if the primary OSD fails/breaks/can't reach/whatever and/or...
I've removed the rest of the quote where I believe you're no longer describing 1 node failure.

Here's how I'm looking at the situation:
1 node failure = anything from 1 drive to the whole host fails AND all of the rest of the nodes work 100% correctly. (I'm aware that drives tend to fail synchronously and that one node failure would trigger recovery replication)

We're not corporate. We cannot afford having a 5 node cluster where effectively only 2 nodes are working / can be working. We need 3/5 of the sum of the resources to be in active use in order for this to be viable in terms of business.
We're treating ceph as a storage solution, not a backup solution. We have multiple backups and recovery plans in place in case we loose the data in ceph. Not all ceph features and best approaches make sense to our use cases: our customers don't expect their VMs running 100% or 99% of the time, as long as we can get them running in a reasonable time (hours mostly, in case of catastrophic failure even days). All of these servers are running in our own datacenters so getting physical access to them in an hour (including driving to them) is not an issue.

Here are my use cases, in Layman's terms (for the 5 node cluster):
If a one node fails (anything from a drive failure up to a host failure) I don't want to be woken up at night that some VM is not working.
If two nodes fail I want to be able to change something (for example reconfigure 3/2 pools to 2/1 pools), get it running (even with degraded performance) go back to sleep and continue solving the situation in the morning (if ceph supports this, which I believe it should based on my own tests).

Planned settings for 5 nodes:
full_ratio 0.95
backfillfull_ratio 0.6
nearfull_ratio 0.45
 
I've removed the rest of the quote where I believe you're no longer describing 1 node failure.
In fact I gave you a lot of knowledge and experience regarding Ceph and describing much more than a one node failure but the general rules so you can estimate N failures and dimension the cluster appropiately.


We cannot afford having a 5 node cluster where effectively only 2 nodes are working / can be working
Never said such thing and dunno what made you think it works that way.


We're treating ceph as a storage solution, not a backup solution.
Backups have absolutely nothing to be with Ceph or any other storage you use to run VMs.


We have multiple backups and recovery plans in place in case we loose the data in ceph.
Best recovery plan is plan to avoid using a recovery plan :) and that starts by using Ceph appropiately with 3/2 pools. But hey, it's your bussiness: I'm fine with whatever you feel its ok, just don't say that you weren't warned ;)


If a one node fails (anything from a drive failure up to a host failure) I don't want to be woken up at night that some VM is not working.
If two nodes fail I want to be able to change something (for example reconfigure 3/2 pools to 2/1 pools), get it running (even with degraded performance) go back to sleep and continue solving the situation in the morning (if ceph supports this, which I believe it should based on my own tests).
Fine. For that to work you need 3/2 pools for which you must spare the amount of space i described above to allow Ceph do it's job.


Planned settings for 5 nodes:
full_ratio 0.95
backfillfull_ratio 0.6
nearfull_ratio 0.45
For 3/2 pools with five hosts backfillfull_ratio could be much higher and nearfull_ratio could be perfectly ~0.6 if you want Ceph to warn you with anticipation. Remember those ratios are per OSD, not per pool or per cluster.