CEPH Erasure Coded Configuration: Review/Confirmation

It is absolutely possible to place multiple chunks on each server while guaranteeing resiliency to hosts going down. You just need to create your pools manually instead of using the proxmox UI. I've been doing this for years on a 3-node homelab. I have an EC pool with k=7,m=5. Definitely wouldn't recommend such a wide EC for any sort of production setup, but I needed to maximize space efficiency while still being able to take down a host while maintaining redundancy. With a placement rule that puts 4 chunks on each host, I can take a host down for maintenance and still be able to survive a drive failure.

I used to do this with a custom CRUSH rule[*1]:
Code:
rule 3host4osd {
    id 3
    type erasure
    step set_chooseleaf_tries 1000
    step set_choose_tries 1000
    step take default class hdd
    step choose indep 3 type host
    step chooseleaf indep 4 type osd
    step emit
}
This worked well, but obviously you need to know what you're doing as it's easy to mess it up.

But in Squid, this is much easier. You just use the new Multi-Step Retry CRUSH rules[*2]. No need to mess around writing your own CRUSH rules and injecting them into the mons. Just create an EC profile like this, and use it to create your pool:
Code:
 ceph osd erasure-code-profile set 3host4osd \
    k=7 \
    m=5 \
    crush-failure-domain=host \
    crush-osds-per-failure-domain=4 \
    crush-num-failure-domains=3

One thing to keep in mind is that MSR rules require client support, and the in-kernel CephFS client doesn't support them. So make sure you're using the FUSE client instead.

[*1] https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules
[*2] https://docs.ceph.com/en/latest/dev/crush-msr/#msr
Hi there!

First off, thanks so much for sharing your experience with custom CRUSH rules and MSR in Ceph—this is super valuable for someone like me who’s still new to Ceph (total beginner here ).

I set up a 4-node test cluster (Squid 19.2.2) back in April this year by following the official docs, with 4 OSDs per node. I created erasure-coded (EC) pools to test RGW, CephFS, and RBD separately, using this EC profile:

Code:
ceph osd erasure-code-profile set 4n16d \
  plugin=isa \
  technique=reed_sol_van \
  k=5 \
  m=3 \
  crush-failure-domain=host \
  crush-osds-per-failure-domain=2 \
  crush-num-failure-domains=4


However, I ran into a persistent issue: I could never mount CephFS or use RBD with the in-kernel client—kept getting errors about missing features in the kernel client. I even upgraded my system kernel to 6.17, but the problem remained. I posted about this online for help, but didn’t get any effective responses. Eventually, I deleted the EC pool and switched to a replicated pool, and then the kernel-based mounting of CephFS/RBD worked perfectly.

Your note about the in-kernel CephFS client not supporting MSR rules really resonated with me! I’m wondering if this limitation is explicitly stated in the official Ceph documentation? I remember reading in the docs that the FUSE client has poor performance and is only recommended for testing purposes, so I was confused why the kernel client (which I assumed was the "production-grade" option) wouldn’t work with my EC pool setup.

Again, thanks for sharing your homelab experience—your custom CRUSH rule and MSR insights are really helpful as I try to wrap my head around how EC pools and CRUSH work in practice. Any additional pointers for a newbie like me would be hugely appreciated!