Is anyone dynamically updating CRUSH rules to maintain storage/compute proximity in HCI setups?

Apr 30, 2025
10
0
1
I'm just learning about all this stuff, and from what I can tell Ceph and Proxmox have no communication regarding where a VM is in relation to its storage.

This is what CRUSH rules are supposed to be used for, but they're not defined by default, and they're static so they need to be updated after each VM migration.

Perhaps even maintained through some sort of daemon that regularly polls and keeps the CRUSH rules matching the VM state.

Is anyone doing this? Are there 3rd party or 1st party tools to do it?
 
Interesting way to think about that, never gave it a thought really.

First, I don't think CRUSH rules can be based on where the volumes of the VMs are located. The lowest you can go is OSD level, the daemon that directly manages the physical storage device. I am not aware of any features where you can manually determine which data goes on which PG. Especially because Ceph always rebalances data and it won't stay static.
I think you can fetch which volumes are associated to which PG / primary OSDs though.

Let's say we still do it...

Honestly I don't think this would even be suitable for any cluster setup. Especially in case of VM migrations it could lead to heavy additional traffic, be it for data migration of Ceph (in case of modifying the placement of the PGs) or live migrations of the VMs (in case we move the VMs to the majority of OSD primaries for the associated volumes). This additional traffic would compete with normal Ceph replication and rebalancing traffic. So you basically should just scale your network big enough.

With a replication factor of 3, each PG has one primary OSD (which serves all I/O) and two replica OSDs. In a 3-node cluster you're likely to have the primary OSD local to the VM's host yes, but in larger clusters that's not guaranteed. And even if it were, Ceph's continuous rebalancing means it won't stay that way. The primary can shift between OSDs as the cluster rebalances, so any locality you had disappears without warning. You'd need to constantly re-pin PG primaries as VMs migrate, which Ceph doesn't natively support at that granularity, as far as I know. In a larger cluster there's also an upside to this. A volume's data may be spread across many OSDs on many nodes simultaneously, meaning multiple nodes contribute to serving it, potentially delivering more aggregate throughput than any single "local" PG ever could. Either way, the real answer is the same:
Scale your storage network appropriately for your workload, and let Ceph do its job.

That's just my opinion, but I'm open to discussion, and I'd be happy to be proven wrong ^^
 
  • Like
Reactions: UdoB
That's just my opinion, but I'm open to discussion, and I'd be happy to be proven wrong
AFAIK that's all right. Which OSD you talk to for each chunk of data is supposed to kinda look random, so you spread the load out around the cluster. The more disks you have the more simultaneous reads you can do with only some small percentage of reads needing to wait in line.

But, assuming I've understood what you want to accomplish with this change of the crush rules, there is an option that could let you get close to what you are asking for. rbd_read_from_replica_policy see https://docs.ceph.com/en/pacific/rbd/rbd-config-ref/

You can set that setting to "localize". I don't think that does anything for EC pools, but in a replication it changes how reads are done. Instead of going to the primary OSD for the PG, it will try to find one that is "closest". Presumably that means reading from an OSD in the local machine instead of going over the network whenever possible. This won't really change anything for writes, since a write isn't done until the primary OSD says it is, which means it is committed to at least some minimal number of OSDs, so writes are going to have to use some network time.

Also you should carefully consider the side-effects of a setting like this. It might make reads faster and lower latency while the cluster is healthy, but when an OSD has failed your data could be forced over the network again. And if you've come to rely on the local read, this degraded state could give you a bad day. Not saying don't do it, just saying you should keep that in mind and maybe occasionally test.
 
I did not know about this setting! Thanks. But even tho it sounds promising at first... I can't get my head around.

This setting does only affect rbd_reads, and does not impact any balancing of the PGs.
So in best case you might have some read improvements, by always choosing the local PGs for that, or rather the "closest" OSD... Whatever that means. Probably in the same group or the nearest group according to the crush map (Same host or same rack etc.). This can be on the same host... In a 3 node cluster this is always the case. But who says that the closest OSD for this PG Groups even resides on the local node where the data is wanted? As soon as you have a 4 Node or bigger, the network will be used as well. Sure you can offload some on the local node, but in bigger clusters, the most will go over the network anyway. I think this setting is rather designed to localize the reads for stretched cluster / Nodes that are further away to minimize latency.

Also a big BUT I can't solve in my head, what happens when you have several high load VMs and they are not evenly distributed over the cluster.
This could lead to performance issues, in regards of overloading the local node / closest OSDs. CEPH normally distributes the primary OSDs over the whole Cluster for a reason, to load balance it as best as possible... With this settings you will rather build yourself a bottleneck than really have a heavy improvement. Especially with small clusters.

I might only use this settings in case of bad network design, but then like you already mentioned @tomservo:
Also you should carefully consider the side-effects of a setting like this. It might make reads faster and lower latency while the cluster is healthy, but when an OSD has failed your data could be forced over the network again. And if you've come to rely on the local read, this degraded state could give you a bad day. Not saying don't do it, just saying you should keep that in mind and maybe occasionally test.