CEPH Erasure Coded Configuration: Review/Confirmation

It is absolutely possible to place multiple chunks on each server while guaranteeing resiliency to hosts going down. You just need to create your pools manually instead of using the proxmox UI. I've been doing this for years on a 3-node homelab. I have an EC pool with k=7,m=5. Definitely wouldn't recommend such a wide EC for any sort of production setup, but I needed to maximize space efficiency while still being able to take down a host while maintaining redundancy. With a placement rule that puts 4 chunks on each host, I can take a host down for maintenance and still be able to survive a drive failure.

I used to do this with a custom CRUSH rule[*1]:
Code:
rule 3host4osd {
    id 3
    type erasure
    step set_chooseleaf_tries 1000
    step set_choose_tries 1000
    step take default class hdd
    step choose indep 3 type host
    step chooseleaf indep 4 type osd
    step emit
}
This worked well, but obviously you need to know what you're doing as it's easy to mess it up.

But in Squid, this is much easier. You just use the new Multi-Step Retry CRUSH rules[*2]. No need to mess around writing your own CRUSH rules and injecting them into the mons. Just create an EC profile like this, and use it to create your pool:
Code:
 ceph osd erasure-code-profile set 3host4osd \
    k=7 \
    m=5 \
    crush-failure-domain=host \
    crush-osds-per-failure-domain=4 \
    crush-num-failure-domains=3

One thing to keep in mind is that MSR rules require client support, and the in-kernel CephFS client doesn't support them. So make sure you're using the FUSE client instead.

[*1] https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules
[*2] https://docs.ceph.com/en/latest/dev/crush-msr/#msr
Hi there!

First off, thanks so much for sharing your experience with custom CRUSH rules and MSR in Ceph—this is super valuable for someone like me who’s still new to Ceph (total beginner here ).

I set up a 4-node test cluster (Squid 19.2.2) back in April this year by following the official docs, with 4 OSDs per node. I created erasure-coded (EC) pools to test RGW, CephFS, and RBD separately, using this EC profile:

Code:
ceph osd erasure-code-profile set 4n16d \
  plugin=isa \
  technique=reed_sol_van \
  k=5 \
  m=3 \
  crush-failure-domain=host \
  crush-osds-per-failure-domain=2 \
  crush-num-failure-domains=4


However, I ran into a persistent issue: I could never mount CephFS or use RBD with the in-kernel client—kept getting errors about missing features in the kernel client. I even upgraded my system kernel to 6.17, but the problem remained. I posted about this online for help, but didn’t get any effective responses. Eventually, I deleted the EC pool and switched to a replicated pool, and then the kernel-based mounting of CephFS/RBD worked perfectly.

Your note about the in-kernel CephFS client not supporting MSR rules really resonated with me! I’m wondering if this limitation is explicitly stated in the official Ceph documentation? I remember reading in the docs that the FUSE client has poor performance and is only recommended for testing purposes, so I was confused why the kernel client (which I assumed was the "production-grade" option) wouldn’t work with my EC pool setup.

Again, thanks for sharing your homelab experience—your custom CRUSH rule and MSR insights are really helpful as I try to wrap my head around how EC pools and CRUSH work in practice. Any additional pointers for a newbie like me would be hugely appreciated!
 
  • Like
Reactions: Johannes S
Hi there!

First off, thanks so much for sharing your experience with custom CRUSH rules and MSR in Ceph—this is super valuable for someone like me who’s still new to Ceph (total beginner here ).

I set up a 4-node test cluster (Squid 19.2.2) back in April this year by following the official docs, with 4 OSDs per node. I created erasure-coded (EC) pools to test RGW, CephFS, and RBD separately, using this EC profile:

Code:
ceph osd erasure-code-profile set 4n16d \
  plugin=isa \
  technique=reed_sol_van \
  k=5 \
  m=3 \
  crush-failure-domain=host \
  crush-osds-per-failure-domain=2 \
  crush-num-failure-domains=4


However, I ran into a persistent issue: I could never mount CephFS or use RBD with the in-kernel client—kept getting errors about missing features in the kernel client. I even upgraded my system kernel to 6.17, but the problem remained. I posted about this online for help, but didn’t get any effective responses. Eventually, I deleted the EC pool and switched to a replicated pool, and then the kernel-based mounting of CephFS/RBD worked perfectly.

Your note about the in-kernel CephFS client not supporting MSR rules really resonated with me! I’m wondering if this limitation is explicitly stated in the official Ceph documentation? I remember reading in the docs that the FUSE client has poor performance and is only recommended for testing purposes, so I was confused why the kernel client (which I assumed was the "production-grade" option) wouldn’t work with my EC pool setup.

Again, thanks for sharing your homelab experience—your custom CRUSH rule and MSR insights are really helpful as I try to wrap my head around how EC pools and CRUSH work in practice. Any additional pointers for a newbie like me would be hugely appreciated!

No problem, I'm glad it helped someone. :)

I think the best way to learn ceph is to use it. I've learned a lot since I setup my little homelab - enough that these days my dayjob involves maintaining three mission-critical ceph clusters. The experience I've gained from testing different features and setups and diagnosing resulting performance issues has been very useful there. Wish I had a more specific recommendation but nothing comes to mind right now.

Regarding the kernel ceph client, it's not just MSR. Any time you consider using a new ceph feature you need to verify if it will work with the kernel client, as it can take a *very* long time for feature support to make it into the kernel tree - if it happens at all. The kernel client is a separate implementation, developed on a separate release cycle from ceph and the kernel maintainers tend to be conservative in what they accept. Getting a new feature into a state where it's accepted for inclusion takes a ton of work.

The FUSE client is definitely slower than a native mount using the kernel client. Whether that's a problem depends on your setup and workload. In my case I use CephFS as storage for fileservers. I don't want to let end clients speak to Ceph directly, partially for security isolation but also because I know from experience that a badly-behaved CephFS client can mess up access for everyone else. Had a buggy client once that didn't properly release locks to the MDS and it would cause the whole FS to lock up for several minutes until they timed out. So the only hosts that speak directly to Ceph are the proxmox hosts and the fileservers. The fileservers expose SMB and NFS to the end clients, and samba and nfs-ganesha both have VFS/FSAL modules that can talk directly to Ceph so there would be no benefit from using the kernel client.

For RBD use in VMs, since qemu talks directly to Ceph with librbd, I don't think there is a big performance difference between that and having the in-kernel client expose a device and letting qemu use it. I haven't done any benchmarks, but we have loads of fast NVMe storage on the work clusters and qemu+librbd performs very well. Using the kernel client in my opinion just introduces additional complexity that's not worth it.
 
  • Like
Reactions: Johannes S
No problem, I'm glad it helped someone. :)

I think the best way to learn ceph is to use it. I've learned a lot since I setup my little homelab - enough that these days my dayjob involves maintaining three mission-critical ceph clusters. The experience I've gained from testing different features and setups and diagnosing resulting performance issues has been very useful there. Wish I had a more specific recommendation but nothing comes to mind right now.

Regarding the kernel ceph client, it's not just MSR. Any time you consider using a new ceph feature you need to verify if it will work with the kernel client, as it can take a *very* long time for feature support to make it into the kernel tree - if it happens at all. The kernel client is a separate implementation, developed on a separate release cycle from ceph and the kernel maintainers tend to be conservative in what they accept. Getting a new feature into a state where it's accepted for inclusion takes a ton of work.

The FUSE client is definitely slower than a native mount using the kernel client. Whether that's a problem depends on your setup and workload. In my case I use CephFS as storage for fileservers. I don't want to let end clients speak to Ceph directly, partially for security isolation but also because I know from experience that a badly-behaved CephFS client can mess up access for everyone else. Had a buggy client once that didn't properly release locks to the MDS and it would cause the whole FS to lock up for several minutes until they timed out. So the only hosts that speak directly to Ceph are the proxmox hosts and the fileservers. The fileservers expose SMB and NFS to the end clients, and samba and nfs-ganesha both have VFS/FSAL modules that can talk directly to Ceph so there would be no benefit from using the kernel client.

For RBD use in VMs, since qemu talks directly to Ceph with librbd, I don't think there is a big performance difference between that and having the in-kernel client expose a device and letting qemu use it. I haven't done any benchmarks, but we have loads of fast NVMe storage on the work clusters and qemu+librbd performs very well. Using the kernel client in my opinion just introduces additional complexity that's not worth it.
Thanks again for sharing your experience and insights. This really helps a Ceph newbie like me avoid a lot of detours.

I haven’t used VFS/FSAL modules or qemu+librbd yet, and I always thought they relied on the kernel driver (especially librbd – I seem to have seen it mentioned in kernel mount errors).

From what I’ve read online and heard from other practitioners, replicated pools have long been the de facto production configuration for Ceph, while EC pools in production setups still seem relatively rare.

For my current Ceph setup, I’m still prioritizing erasure-coded pools – I plan to configure them for the 62.5% to 66.6% storage efficiency – and only a small portion of more critical data will use 3-way replicated pools.

Perhaps things will improve with the next release (Tentacle), just as you mentioned.
 
I haven’t used VFS/FSAL modules or qemu+librbd yet, and I always thought they relied on the kernel driver (especially librbd – I seem to have seen it mentioned in kernel mount errors).

Proxmox uses qemu+librbd by default for VMs. It only uses the kernel RBD client if you set krbd=1 on the storage pool.

Containers always use the kernel driver though.