CEPH Erasure Coded Configuration: Review/Confirmation

Sep 16, 2025
4
0
1
First, let me contextualize our set-up: We have a 3 node cluster, where we will be using CEPH for storage hyperconvergence.
We are familiarizing ourselves with CEPH and would love to have someone more experienced chiming in:
All of our storage hardware are SSDs. (24x 2TB NVMe, 8 per server)

We want to be able to tolerate 1 server going down, and have no downtime for our VMs.
The question I've been working at answering is: What's the most storage efficient configuration we can go with to maximize our available storage space?

After diving through the CEPH documentation, this is what I found regarding Erasure Coded Pools:

K is the number of OSDs worth of available storage we will have, and we can afford to lose M OSDs, the total OSD count being (K+M).
min_size should be set to K+1, and if we go below min_size, we cannot write to the CEPH RBDs any longer

If we aim for a 4+2 (16+8, 66% efficiency) Erasure Code pool, we can afford to lose 1/3rd of our drives, and recover from that without data loss.
But we will be having downtime, because of the min_size parameter. (K+1 would total 17).

Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).

Are any of my assumptions here wrong? Have I misinterpreted the CEPH Docs in any way?
Is anyone else running a 3 node cluster out there with CEPH?

I would love to hear some other opinions regarding my setup.
Thank you in advance,
 
Each K and M must be in a different host because you want your fault domain to be host (the default), not disk: i.e. if fault domain was disk you may end up with too many K or M (or both!) for some PGs in the same host and if that host goes down (i.e. a simple reboot) your VMs will hang because some PGs won't be available until either the host comes back or Ceph recovers from the remaining replicas (if at all possible, as they may had been all in drives of the same host).

Erasure coded pools makes sense with a very minimum of 5 nodes (K=3, M=2), but to get something reasonable performant you would need at least 8+ nodes. The way EC pools work make them quite underperforming for general VM workloads. It is expected to improve in the next release (AFAIR called Tentacle) with partial writes support and some minor enhancements).

With just 3 nodes the most efficient configuration would be 3/2 replicated pool(s) with inline compression were appropiate. Would be ideal if you could add a fourth node to allow Ceph to selfheal and recover its 3 replicas if a full host fails.
 
First, let me contextualize our set-up: We have a 3 node cluster, where we will be using CEPH for storage hyperconvergence.
We are familiarizing ourselves with CEPH and would love to have someone more experienced chiming in:
All of our storage hardware are SSDs. (24x 2TB NVMe, 8 per server)

We want to be able to tolerate 1 server going down, and have no downtime for our VMs.
The question I've been working at answering is: What's the most storage efficient configuration we can go with to maximize our available storage space?

After diving through the CEPH documentation, this is what I found regarding Erasure Coded Pools:

K is the number of OSDs worth of available storage we will have, and we can afford to lose M OSDs, the total OSD count being (K+M).
min_size should be set to K+1, and if we go below min_size, we cannot write to the CEPH RBDs any longer

If we aim for a 4+2 (16+8, 66% efficiency) Erasure Code pool, we can afford to lose 1/3rd of our drives, and recover from that without data loss.
But we will be having downtime, because of the min_size parameter. (K+1 would total 17).

Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).

Are any of my assumptions here wrong? Have I misinterpreted the CEPH Docs in any way?
Is anyone else running a 3 node cluster out there with CEPH?

I would love to hear some other opinions regarding my setup.
Thank you in advance,
the max you can do is 2+1 to give 0.66% efficiency and this is your only EC option since you only have 3 servers.
 
  • Like
Reactions: pdesouza
I started at my last employer with 3 nodes then expanded to 7 once we was passed the 12 month POC stage.
it works fine, but you can only ever have 1 server out taking 2 out will cause the ceph to take all the ec pool placement groups offline.

I had an employee that did it twice when she was told not too and then let he go as the users didn't like there VM's just stalling for the duration of the reboots.
 
Last edited:
  • Like
Reactions: pdesouza
Each K and M must be in a different host because you want your fault domain to be host (the default), not disk: i.e. if fault domain was disk you may end up with too many K or M (or both!) for some PGs in the same host and if that host goes down (i.e. a simple reboot) your VMs will hang because some PGs won't be available until either the host comes back or Ceph recovers from the remaining replicas (if at all possible, as they may had been all in drives of the same host).

Erasure coded pools makes sense with a very minimum of 5 nodes (K=3, M=2), but to get something reasonable performant you would need at least 8+ nodes. The way EC pools work make them quite underperforming for general VM workloads. It is expected to improve in the next release (AFAIR called Tentacle) with partial writes support and some minor enhancements).

With just 3 nodes the most efficient configuration would be 3/2 replicated pool(s) with inline compression were appropiate. Would be ideal if you could add a fourth node to allow Ceph to selfheal and recover its 3 replicas if a full host fails.
I see, I have been calculating with fault domain as OSDs but what I really want and need is a HOST fault domain. That will work for what I want, (being able to take one host down safely w/o interrupting operations) but it limits my options in how I want this redundancy to take place.

If I go replicating on the 3/2 scheme, I will be able to take one host down without losing operations OR data, and I would be able to tolerate 2 servers going down, but in that case there would be downtime involved, although I would not lose data. (it would recover when the servers came up)


the max you can do is 2+1 to give 0.66% efficiency and this is your only EC option since you only have 3 servers.
And this is a risky setup because if any 2 faults happen simultaneously in different nodes, I have nowhere to run and I am looking at data loss and downtime. BUT it is possible. If I take one node down for maintenance, the other 2 are running with no redundancy whatsoever. And i would have to run with min_size=2, going against whatever is recommended in the documentation to avoid data loss due to "split head"



I started at my last employer with 3 nodes then expanded to 7 once we was passed the 12 month POC stage.
it works fine, but you can only ever have 1 server out taking 2 out will cause the ceph to take all the ec pool placement groups offline.

I had an employee that did it twice when she was told not too and then let he go as the users didn't like there VM's just stalling for the duration of the reboots.
So you ran it for a year on a 2+1 EC setup with 66% storage efficiency?


at this point I am considering going back to the 3 way replication scheme (3/2) to be on the safer side. But it would be nice to have better storage efficiency by doing CE 2+1. I was hoping to find some middle ground but it seems that my Node Quantity is limiting that possibility. Thank you for your quick replies, if anyone has anything else to say I am all ears.
 
I see, I have been calculating with fault domain as OSDs but what I really want and need is a HOST fault domain. That will work for what I want, (being able to take one host down safely w/o interrupting operations) but it limits my options in how I want this redundancy to take place.

If I go replicating on the 3/2 scheme, I will be able to take one host down without losing operations OR data, and I would be able to tolerate 2 servers going down, but in that case there would be downtime involved, although I would not lose data. (it would recover when the servers came up)



And this is a risky setup because if any 2 faults happen simultaneously in different nodes, I have nowhere to run and I am looking at data loss and downtime. BUT it is possible. If I take one node down for maintenance, the other 2 are running with no redundancy whatsoever. And i would have to run with min_size=2, going against whatever is recommended in the documentation to avoid data loss due to "split head"




So you ran it for a year on a 2+1 EC setup with 66% storage efficiency?


at this point I am considering going back to the 3 way replication scheme (3/2) to be on the safer side. But it would be nice to have better storage efficiency by doing CE 2+1. I was hoping to find some middle ground but it seems that my Node Quantity is limiting that possibility. Thank you for your quick replies, if anyone has anything else to say I am all ears.
In my current role for production sizing of ceph for customers I recommend 6 nodes with and ec of 4.2 which is still 66% usable you could run 3.2 with 5 servers at 60% usable.

If you could add 2 servers you would get the resilience you need.

The production cluster I had at 2.1 encoding is still like it but had full daily backups of all systems as not a big issue and all systems where built by ansible so could be rebuilt quickly, nothing on the cluster was long term , though we did build another ceph cluster for long term retention as an external cluster to proxmox
 
Last edited:
  • Like
Reactions: pdesouza
Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).
The placement logic is per node, not per drive. the only sane EC config possible with 3 nodes is 2+1. But bear in mind that while you CAN do this, its not really supportable; on node down you're operating without checksum at all, and under normal circumstances the pool would go read only in that condition. you CAN override this behavior but there lies data corruption.
you should also be aware that EC performance is really poor for virtualization workloads, even in a more supported 4+2 configuration, and should only be considered for bulk storage. @UdoB link is well worth a read.
 
The placement logic is per node, not per drive. the only sane EC config possible with 3 nodes is 2+1. But bear in mind that while you CAN do this, its not really supportable; on node down you're operating without checksum at all, and under normal circumstances the pool would go read only in that condition. you CAN override this behavior but there lies data corruption.
you should also be aware that EC performance is really poor for virtualization workloads, even in a more supported 4+2 configuration, and should only be considered for bulk storage. @UdoB link is well worth a read.
Running the replicated 3/2 setup will still provide us with more than enough storage needed for our use case. Our applications will be more I/O heavy and because of this we have good specs on the CPU and RAM.

Thank you all for the information. This thread was really useful in illuminating the spots I hadn't understood properly from the documentation.
 
Running the replicated 3/2 setup will still provide us with more than enough storage needed for our use case. Our applications will be more I/O heavy and because of this we have good specs on the CPU and RAM.

Thank you all for the information. This thread was really useful in illuminating the spots I hadn't understood properly from the documentation.
I would recommend running ceph on it's own NIC's at least 40gb and NVMe if your IO heavy so that you have enough bandwidth for the ceph inter OSD traffic. as for every 1gb there is 2 x 1gb being generated for the additional replicas to be made.
 
Last edited:
I would recommend running ceph on it's own NIC's at least 40gb and NVMe if your IO heavy so that you have enough bandwidth for the ceph inter OSD traffic. as for every 1gb there is 2 x 1gb being generated for the additional replicas to be made.

I am running CEPH on its own dedicated NICs over 25GB/s links. I have a RSTP mesh right now (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) during my testing phase.

I am acquiring a pair of switches with 25GB/s ports to handle these connections from the new proxmox nodes.
 
It is absolutely possible to place multiple chunks on each server while guaranteeing resiliency to hosts going down. You just need to create your pools manually instead of using the proxmox UI. I've been doing this for years on a 3-node homelab. I have an EC pool with k=7,m=5. Definitely wouldn't recommend such a wide EC for any sort of production setup, but I needed to maximize space efficiency while still being able to take down a host while maintaining redundancy. With a placement rule that puts 4 chunks on each host, I can take a host down for maintenance and still be able to survive a drive failure.

I used to do this with a custom CRUSH rule[*1]:
Code:
rule 3host4osd {
    id 3
    type erasure
    step set_chooseleaf_tries 1000
    step set_choose_tries 1000
    step take default class hdd
    step choose indep 3 type host
    step chooseleaf indep 4 type osd
    step emit
}
This worked well, but obviously you need to know what you're doing as it's easy to mess it up.

But in Squid, this is much easier. You just use the new Multi-Step Retry CRUSH rules[*2]. No need to mess around writing your own CRUSH rules and injecting them into the mons. Just create an EC profile like this, and use it to create your pool:
Code:
 ceph osd erasure-code-profile set 3host4osd \
    k=7 \
    m=5 \
    crush-failure-domain=host \
    crush-osds-per-failure-domain=4 \
    crush-num-failure-domains=3

One thing to keep in mind is that MSR rules require client support, and the in-kernel CephFS client doesn't support them. So make sure you're using the FUSE client instead.

[*1] https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules
[*2] https://docs.ceph.com/en/latest/dev/crush-msr/#msr
 
3-node homelab. I have an EC pool with k=7,m=5.
As stated I am not a Ceph specialist. I do not understand what you want to achieve with this construct. And because I am curious I write this reply ;-)

but I needed to maximize space efficiency while still being able to take down a host
In my understanding the usable space is up to 7/12 = 58.3 percent. Right?

The "m=5" makes five calculated strips to be present on five OSD - two times two of them on the same node and the fifth one on the third node. The "k=7" will put two strips on a node and one time three strips on the third node.

When one node fails you may lose three data strips and two parity strips at the same time.

In my limited understanding "k=7,m=5" losing five elements should switch to readonly, stalling all (at lease several) running VMs.

Which detail of my (as said: limited) understanding is wrong?
 
As stated I am not a Ceph specialist. I do not understand what you want to achieve with this construct. And because I am curious I write this reply ;-)


In my understanding the usable space is up to 7/12 = 58.3 percent. Right?

The "m=5" makes five calculated strips to be present on five OSD - two times two of them on the same node and the fifth one on the third node. The "k=7" will put two strips on a node and one time three strips on the third node.

When one node fails you may lose three data strips and two parity strips at the same time.

In my limited understanding "k=7,m=5" losing five elements should switch to readonly, stalling all (at lease several) running VMs.

Which detail of my (as said: limited) understanding is wrong?
Your usable space calculation is correct, but not the placement.

Ceph doesn't place k and m chunks separately, it's all just chunks to the CRUSH algorithm. For each PG it will choose three hosts, and then for each of those hosts it will choose four OSDs. With four chunks on each host, any host going down will leave 8 chunks online. I can tolerate another drive failure even in that state.
 
  • Like
Reactions: UdoB
This alone doesn't provide enough information: how many OSD do you have in each of your hosts? Are they full disks or did you use partitions in each disk?
I'm not sure why that matters? I only wanted to point out that it's possible to configure more complicated placement rules than "this many copies, this is the failure domain" contrary to what some comments here say.

But since you ask, I use a full disk for each OSD. I have 24 drives spread out over these 3 nodes. This is obviously not good for performance, as each read and write will touch half the drives in the cluster. It also means I can't use snapshots on large pools since the snaptrim process will absolutely murder performance. But it works for my needs.
 
Last edited:
For each PG it will choose three hosts, and then for each of those hosts it will choose four OSDs. With four chunks on each host, any host going down will leave 8 chunks online.
I have tried to test it - and it works as you describe :-)

Code:
I have a virtual test cluster with six nodes, called pna, pnb... pnf. Three of them got 4 OSD for the following test. 

root@pna:~# ceph osd erasure-code-profile set 3host4osd \
    k=7 \
    m=5 \
    crush-failure-domain=host \
    crush-osds-per-failure-domain=4 \
    crush-num-failure-domains=3
    
Unrelated but required in my test:
root@pna:~# ceph osd set-require-min-compat-client squid

root@pna:~# pveceph pool create ec75b --erasure-coding k=7,m=5,profile=3host4osd
pool ec75b-data: applying allow_ec_overwrites = true
pool ec75b-data: applying application = rbd
pool ec75b-data: applying pg_autoscale_mode = warn
skipping 'pg_num', did not change
skipping 'size', did not change
pool ec75b-metadata: applying application = rbd
skipping 'min_size', did not change
pool ec75b-metadata: applying pg_autoscale_mode = warn
skipping 'pg_num', did not change

This results in a "ec75b-data" pool with "Size/min" = "12/8". The "Autoscaler Mode" tells me "warn", which may be expected. It took less than half an hour to scale down from 128 to 32, automatically.

root@pna:~# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         0.37427  root default                             
 -3               0      host pna                             
 -5         0.12476      host pnb                             
  0    hdd  0.03119          osd.0       up   1.00000  1.00000
  1    hdd  0.03119          osd.1       up   1.00000  1.00000
  2    hdd  0.03119          osd.2       up   1.00000  1.00000
  4    hdd  0.03119          osd.4       up   1.00000  1.00000
 -7               0      host pnc                             
 -9         0.12476      host pnd                             
  3    hdd  0.03119          osd.3       up   1.00000  1.00000
  6    hdd  0.03119          osd.6       up   1.00000  1.00000
  7    hdd  0.03119          osd.7       up   1.00000  1.00000
  9    hdd  0.03119          osd.9       up   1.00000  1.00000
-11               0      host pne                             
-13         0.12476      host pnf                             
  5    hdd  0.03119          osd.5       up   1.00000  1.00000
  8    hdd  0.03119          osd.8       up   1.00000  1.00000
 10    hdd  0.03119          osd.10      up   1.00000  1.00000
 11    hdd  0.03119          osd.11      up   1.00000  1.00000

I am running only one small test-VM = Trixiepup64:Wayland. Without installation.

Inside the VM I can write to sda, which lives on the ec75b pool, of course.
For simplicity I run somehing like "while true; do date; dd if=/dev/urandom of=/dev/sda count=1 bs=1M; sleep 5; done ".

Watching this live I shutdown one of the nodes which own those four OSD. I got a 10 seconds "hickup" - but then writing data still worked! Now I am at "Degraded data redundancy: 5791/17373 objects degraded (33.333%), 40 pgs degraded, 97 pgs undersized"...

Works better than expected :-)
 
  • Like
Reactions: davidsg
I have tried to test it - and it works as you describe :-)

Code:
I have a virtual test cluster with six nodes, called pna, pnb... pnf. Three of them got 4 OSD for the following test.

root@pna:~# ceph osd erasure-code-profile set 3host4osd \
    k=7 \
    m=5 \
    crush-failure-domain=host \
    crush-osds-per-failure-domain=4 \
    crush-num-failure-domains=3
   
Unrelated but required in my test:
root@pna:~# ceph osd set-require-min-compat-client squid

root@pna:~# pveceph pool create ec75b --erasure-coding k=7,m=5,profile=3host4osd
pool ec75b-data: applying allow_ec_overwrites = true
pool ec75b-data: applying application = rbd
pool ec75b-data: applying pg_autoscale_mode = warn
skipping 'pg_num', did not change
skipping 'size', did not change
pool ec75b-metadata: applying application = rbd
skipping 'min_size', did not change
pool ec75b-metadata: applying pg_autoscale_mode = warn
skipping 'pg_num', did not change

This results in a "ec75b-data" pool with "Size/min" = "12/8". The "Autoscaler Mode" tells me "warn", which may be expected. It took less than half an hour to scale down from 128 to 32, automatically.

root@pna:~# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         0.37427  root default                            
 -3               0      host pna                            
 -5         0.12476      host pnb                            
  0    hdd  0.03119          osd.0       up   1.00000  1.00000
  1    hdd  0.03119          osd.1       up   1.00000  1.00000
  2    hdd  0.03119          osd.2       up   1.00000  1.00000
  4    hdd  0.03119          osd.4       up   1.00000  1.00000
 -7               0      host pnc                            
 -9         0.12476      host pnd                            
  3    hdd  0.03119          osd.3       up   1.00000  1.00000
  6    hdd  0.03119          osd.6       up   1.00000  1.00000
  7    hdd  0.03119          osd.7       up   1.00000  1.00000
  9    hdd  0.03119          osd.9       up   1.00000  1.00000
-11               0      host pne                            
-13         0.12476      host pnf                            
  5    hdd  0.03119          osd.5       up   1.00000  1.00000
  8    hdd  0.03119          osd.8       up   1.00000  1.00000
 10    hdd  0.03119          osd.10      up   1.00000  1.00000
 11    hdd  0.03119          osd.11      up   1.00000  1.00000

I am running only one small test-VM = Trixiepup64:Wayland. Without installation.

Inside the VM I can write to sda, which lives on the ec75b pool, of course.
For simplicity I run somehing like "while true; do date; dd if=/dev/urandom of=/dev/sda count=1 bs=1M; sleep 5; done ".

Watching this live I shutdown one of the nodes which own those four OSD. I got a 10 seconds "hickup" - but then writing data still worked! Now I am at "Degraded data redundancy: 5791/17373 objects degraded (33.333%), 40 pgs degraded, 97 pgs undersized"...

Works better than expected :-)
Cool :)

I think this stuff is not well-known, because the Ceph tooling for the longest time only let you specify a failure domain and all the CRUSH rules were created with a single chooseleaf. And creating custom CRUSH rules isn't well documented.

With Squid this all became much easier since the tooling can now create those kinds of rules. But everything you find online still talks about placement only happening on one level based on the failure domain.

I'm also excitedly waiting for the Tentacle release, as the new FastEC improvements should make these kinds of wide EC stripes much more performant.
 
  • Like
Reactions: UdoB
Interesting solution. nearly the same storage efficiency of 2:1 but with enough parity to survive a node out. Are you using this for virtualization workloads? can you share some performance metrics?
I only use this for bulk file storage using CephFS, stored on HDDs. For VM images I use 3-way replication on NVMe.

Performance metrics wouldn't be representative right now as the pool is heavily loaded with PG splitting, but it would not be impressive. There is a definite cost in both latency and throughput for EC vs replication, and that cost is higher the wider the EC stripe is. It works very well for this use case though.

There are some extremely significant EC optimizations in the new Tentacle (20.2) release though, which should make this a viable option for virtualization, at least for VMs with low IO requirements. Huge improvements across the board, and especially with wide EC (where it goes from larger K resulting in significant slowdown to larger K improving performance, at least up to k=6).

There are even more EC performance optimizations coming in the next major release (Umbrella). Direct read support is said to make read performance the same for EC as for replication.
https://ceph.io/en/news/blog/2025/tentacle-fastec-performance-updates/