CEPH Erasure Coded Configuration: Review/Confirmation

Sep 16, 2025
4
0
1
First, let me contextualize our set-up: We have a 3 node cluster, where we will be using CEPH for storage hyperconvergence.
We are familiarizing ourselves with CEPH and would love to have someone more experienced chiming in:
All of our storage hardware are SSDs. (24x 2TB NVMe, 8 per server)

We want to be able to tolerate 1 server going down, and have no downtime for our VMs.
The question I've been working at answering is: What's the most storage efficient configuration we can go with to maximize our available storage space?

After diving through the CEPH documentation, this is what I found regarding Erasure Coded Pools:

K is the number of OSDs worth of available storage we will have, and we can afford to lose M OSDs, the total OSD count being (K+M).
min_size should be set to K+1, and if we go below min_size, we cannot write to the CEPH RBDs any longer

If we aim for a 4+2 (16+8, 66% efficiency) Erasure Code pool, we can afford to lose 1/3rd of our drives, and recover from that without data loss.
But we will be having downtime, because of the min_size parameter. (K+1 would total 17).

Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).

Are any of my assumptions here wrong? Have I misinterpreted the CEPH Docs in any way?
Is anyone else running a 3 node cluster out there with CEPH?

I would love to hear some other opinions regarding my setup.
Thank you in advance,
 
Each K and M must be in a different host because you want your fault domain to be host (the default), not disk: i.e. if fault domain was disk you may end up with too many K or M (or both!) for some PGs in the same host and if that host goes down (i.e. a simple reboot) your VMs will hang because some PGs won't be available until either the host comes back or Ceph recovers from the remaining replicas (if at all possible, as they may had been all in drives of the same host).

Erasure coded pools makes sense with a very minimum of 5 nodes (K=3, M=2), but to get something reasonable performant you would need at least 8+ nodes. The way EC pools work make them quite underperforming for general VM workloads. It is expected to improve in the next release (AFAIR called Tentacle) with partial writes support and some minor enhancements).

With just 3 nodes the most efficient configuration would be 3/2 replicated pool(s) with inline compression were appropiate. Would be ideal if you could add a fourth node to allow Ceph to selfheal and recover its 3 replicas if a full host fails.
 
First, let me contextualize our set-up: We have a 3 node cluster, where we will be using CEPH for storage hyperconvergence.
We are familiarizing ourselves with CEPH and would love to have someone more experienced chiming in:
All of our storage hardware are SSDs. (24x 2TB NVMe, 8 per server)

We want to be able to tolerate 1 server going down, and have no downtime for our VMs.
The question I've been working at answering is: What's the most storage efficient configuration we can go with to maximize our available storage space?

After diving through the CEPH documentation, this is what I found regarding Erasure Coded Pools:

K is the number of OSDs worth of available storage we will have, and we can afford to lose M OSDs, the total OSD count being (K+M).
min_size should be set to K+1, and if we go below min_size, we cannot write to the CEPH RBDs any longer

If we aim for a 4+2 (16+8, 66% efficiency) Erasure Code pool, we can afford to lose 1/3rd of our drives, and recover from that without data loss.
But we will be having downtime, because of the min_size parameter. (K+1 would total 17).

Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).

Are any of my assumptions here wrong? Have I misinterpreted the CEPH Docs in any way?
Is anyone else running a 3 node cluster out there with CEPH?

I would love to hear some other opinions regarding my setup.
Thank you in advance,
the max you can do is 2+1 to give 0.66% efficiency and this is your only EC option since you only have 3 servers.
 
  • Like
Reactions: pdesouza
I started at my last employer with 3 nodes then expanded to 7 once we was passed the 12 month POC stage.
it works fine, but you can only ever have 1 server out taking 2 out will cause the ceph to take all the ec pool placement groups offline.

I had an employee that did it twice when she was told not too and then let he go as the users didn't like there VM's just stalling for the duration of the reboots.
 
Last edited:
  • Like
Reactions: pdesouza
Each K and M must be in a different host because you want your fault domain to be host (the default), not disk: i.e. if fault domain was disk you may end up with too many K or M (or both!) for some PGs in the same host and if that host goes down (i.e. a simple reboot) your VMs will hang because some PGs won't be available until either the host comes back or Ceph recovers from the remaining replicas (if at all possible, as they may had been all in drives of the same host).

Erasure coded pools makes sense with a very minimum of 5 nodes (K=3, M=2), but to get something reasonable performant you would need at least 8+ nodes. The way EC pools work make them quite underperforming for general VM workloads. It is expected to improve in the next release (AFAIR called Tentacle) with partial writes support and some minor enhancements).

With just 3 nodes the most efficient configuration would be 3/2 replicated pool(s) with inline compression were appropiate. Would be ideal if you could add a fourth node to allow Ceph to selfheal and recover its 3 replicas if a full host fails.
I see, I have been calculating with fault domain as OSDs but what I really want and need is a HOST fault domain. That will work for what I want, (being able to take one host down safely w/o interrupting operations) but it limits my options in how I want this redundancy to take place.

If I go replicating on the 3/2 scheme, I will be able to take one host down without losing operations OR data, and I would be able to tolerate 2 servers going down, but in that case there would be downtime involved, although I would not lose data. (it would recover when the servers came up)


the max you can do is 2+1 to give 0.66% efficiency and this is your only EC option since you only have 3 servers.
And this is a risky setup because if any 2 faults happen simultaneously in different nodes, I have nowhere to run and I am looking at data loss and downtime. BUT it is possible. If I take one node down for maintenance, the other 2 are running with no redundancy whatsoever. And i would have to run with min_size=2, going against whatever is recommended in the documentation to avoid data loss due to "split head"



I started at my last employer with 3 nodes then expanded to 7 once we was passed the 12 month POC stage.
it works fine, but you can only ever have 1 server out taking 2 out will cause the ceph to take all the ec pool placement groups offline.

I had an employee that did it twice when she was told not too and then let he go as the users didn't like there VM's just stalling for the duration of the reboots.
So you ran it for a year on a 2+1 EC setup with 66% storage efficiency?


at this point I am considering going back to the 3 way replication scheme (3/2) to be on the safer side. But it would be nice to have better storage efficiency by doing CE 2+1. I was hoping to find some middle ground but it seems that my Node Quantity is limiting that possibility. Thank you for your quick replies, if anyone has anything else to say I am all ears.
 
I see, I have been calculating with fault domain as OSDs but what I really want and need is a HOST fault domain. That will work for what I want, (being able to take one host down safely w/o interrupting operations) but it limits my options in how I want this redundancy to take place.

If I go replicating on the 3/2 scheme, I will be able to take one host down without losing operations OR data, and I would be able to tolerate 2 servers going down, but in that case there would be downtime involved, although I would not lose data. (it would recover when the servers came up)



And this is a risky setup because if any 2 faults happen simultaneously in different nodes, I have nowhere to run and I am looking at data loss and downtime. BUT it is possible. If I take one node down for maintenance, the other 2 are running with no redundancy whatsoever. And i would have to run with min_size=2, going against whatever is recommended in the documentation to avoid data loss due to "split head"




So you ran it for a year on a 2+1 EC setup with 66% storage efficiency?


at this point I am considering going back to the 3 way replication scheme (3/2) to be on the safer side. But it would be nice to have better storage efficiency by doing CE 2+1. I was hoping to find some middle ground but it seems that my Node Quantity is limiting that possibility. Thank you for your quick replies, if anyone has anything else to say I am all ears.
In my current role for production sizing of ceph for customers I recommend 6 nodes with and ec of 4.2 which is still 66% usable you could run 3.2 with 5 servers at 60% usable.

If you could add 2 servers you would get the resilience you need.

The production cluster I had at 2.1 encoding is still like it but had full daily backups of all systems as not a big issue and all systems where built by ansible so could be rebuilt quickly, nothing on the cluster was long term , though we did build another ceph cluster for long term retention as an external cluster to proxmox
 
Last edited:
  • Like
Reactions: pdesouza
Following this logic, I am assuming that the most efficient CEPH configuration possible for a 3 node cluster with 24 OSDs is to have K=15 and M=9, with 62.5% storage efficiency, allowing us to operate normally with one server being down due to min_size=16 (K+1).
The placement logic is per node, not per drive. the only sane EC config possible with 3 nodes is 2+1. But bear in mind that while you CAN do this, its not really supportable; on node down you're operating without checksum at all, and under normal circumstances the pool would go read only in that condition. you CAN override this behavior but there lies data corruption.
you should also be aware that EC performance is really poor for virtualization workloads, even in a more supported 4+2 configuration, and should only be considered for bulk storage. @UdoB link is well worth a read.
 
The placement logic is per node, not per drive. the only sane EC config possible with 3 nodes is 2+1. But bear in mind that while you CAN do this, its not really supportable; on node down you're operating without checksum at all, and under normal circumstances the pool would go read only in that condition. you CAN override this behavior but there lies data corruption.
you should also be aware that EC performance is really poor for virtualization workloads, even in a more supported 4+2 configuration, and should only be considered for bulk storage. @UdoB link is well worth a read.
Running the replicated 3/2 setup will still provide us with more than enough storage needed for our use case. Our applications will be more I/O heavy and because of this we have good specs on the CPU and RAM.

Thank you all for the information. This thread was really useful in illuminating the spots I hadn't understood properly from the documentation.
 
Running the replicated 3/2 setup will still provide us with more than enough storage needed for our use case. Our applications will be more I/O heavy and because of this we have good specs on the CPU and RAM.

Thank you all for the information. This thread was really useful in illuminating the spots I hadn't understood properly from the documentation.
I would recommend running ceph on it's own NIC's at least 40gb and NVMe if your IO heavy so that you have enough bandwidth for the ceph inter OSD traffic. as for every 1gb there is 2 x 1gb being generated for the additional replicas to be made.
 
Last edited:
I would recommend running ceph on it's own NIC's at least 40gb and NVMe if your IO heavy so that you have enough bandwidth for the ceph inter OSD traffic. as for every 1gb there is 2 x 1gb being generated for the additional replicas to be made.

I am running CEPH on its own dedicated NICs over 25GB/s links. I have a RSTP mesh right now (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) during my testing phase.

I am acquiring a pair of switches with 25GB/s ports to handle these connections from the new proxmox nodes.