Proxmox / Ceph / Backups & Replica Policy

Nov 28, 2016
411
188
108
Hamburg
uniquoo.com
Hello everyone!

We've recently upgraded our backbone to 50G and are having some interesting findings in our (3 node) cluster . We're running on latest Proxmox 8.3 with Ceph 18.2.
Ceph VM-Pool is configured with 3x replication over all 3 nodes (so one copy resides on each node).

When we're running backups (both LXC and KVM), CEPH reads the VM-image from the blockdevice / placement group which has been set as primary. This primary group my reside in either the local server or one of the other two.

To prevent this behaviour from happening, we've now set
Code:
rbd_read_from_replica_policy
to
Code:
localize
. The default behaviour prefers the primary placement groups, the localize setting prefers the location closest to the server with the VM residing on.

For a 3-node 3x replication cluster this eliminates any network-usage while doing backups (reads are all done locally), on our bigger clusters (20-50 nodes) have a noticeably lower network usage while doing backups.

Question: Why is this setting set to default and not localize? @fabian (sorry for tagging you directly here but we're doing awesome playing ping-pong together) ;-)

Cheerio

Florian
 
the default one is probably better at distributing the load across disks, but I am not a ceph expert. @aaron ? ;)
 
Please file a feature request at https://bugzilla.proxmox.com, ideally with some numbers that you have seen in your cluster(s).
We can then think about either making this the default or easy to enable/disable from the Proxmox VE tooling.
 
Hello everyone!

We've recently upgraded our backbone to 50G and are having some interesting findings in our (3 node) cluster . We're running on latest Proxmox 8.3 with Ceph 18.2.
Ceph VM-Pool is configured with 3x replication over all 3 nodes (so one copy resides on each node).

When we're running backups (both LXC and KVM), CEPH reads the VM-image from the blockdevice / placement group which has been set as primary. This primary group my reside in either the local server or one of the other two.

To prevent this behaviour from happening, we've now set
Code:
rbd_read_from_replica_policy
to
Code:
localize
. The default behaviour prefers the primary placement groups, the localize setting prefers the location closest to the server with the VM residing on.

For a 3-node 3x replication cluster this eliminates any network-usage while doing backups (reads are all done locally), on our bigger clusters (20-50 nodes) have a noticeably lower network usage while doing backups.

Question: Why is this setting set to default and not localize? @fabian (sorry for tagging you directly here but we're doing awesome playing ping-pong together) ;-)

Cheerio

Florian
Does the change effect(increase/decrease) the performance of the VM (bandwidth/throughput) ?
 
Does the change effect(increase/decrease) the performance of the VM (bandwidth/throughput) ?
I'll be able to supply a post-mortem here soon. I've been testing this in our test-lab (development center) quite extensivly.
 
@fstrankowski I'm looking to add the option in the proxmox gui, to be sure, how do you set the value ?

"ceph config set client.admin rbd_read_from_replica_policy localize"

?
Code:
rbd config pool set POOLNAME rbd_read_from_replica_policy localize

Regarding the post mortem: I had to delay my work on that because i have to deal with lots of other stuff with higher inhouse priority at the moment. Hopefully i'll be able to prepare something within Q4/2025. I didnt forget you guys ;)
 
Last edited:
ah you can do it also on the pool, great :)
If you have a pull request for Proxmox please be so kind to link it here so i can review/improve it before there is a chance Proxmox will merge it.
I'd add the option into the CEPH pool configuration UI because its linked on a per-pool-basis and not globally.

Ceph -> Pool -> <Poolname> -> Advanced Config

Thats where i would put it at.
 
  • Like
Reactions: aaron
Isn't the issue here that the client process requesting the data from the OSDs needs to know where in the CRUSH topology it runs?
In a hyperconverged cluster with only three nodes this may be a very good optimization, because there all data is stored on the local node.
In larger clusters this is not the case any more.
 
  • Like
Reactions: Johannes S
Isn't the issue here that the client process requesting the data from the OSDs needs to know where in the CRUSH topology it runs?
In a hyperconverged cluster with only three nodes this may be a very good optimization, because there all data is stored on the local node.
In larger clusters this is not the case any more.
Exactly why i proposed the change back in 2025.
 
  • Like
Reactions: EllerholdAG
Hijacking the thread with almost the same question as above.

We have an external ceph cluster in a stretched setup and PVE, also stretched setup, across two datacenters and was looking into setting the read_from_local_replica(which is no issue on the Ceph side). But if I understand Ceph documentation correct, and this thread, the client needs to specify location which afaik is not currently possible in PVE?

Is there any workaround as of now, or is it implemented in the patch mentioned above? Will the patch work for both HCI clusters and for thoes running external Ceph?
 
This isnt a pve question. PVE doesnt do anything outside of what ceph can do.

As long as you build your crush rule to do what you want, the client will respect it too. Assuming you have the following hierarchy:

DC
Node
OSD

you just need to stipulate

SITE A ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcA

SITE B ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcB
 
This isnt a pve question. PVE doesnt do anything outside of what ceph can do.

As long as you build your crush rule to do what you want, the client will respect it too. Assuming you have the following hierarchy:

DC
Node
OSD

you just need to stipulate

SITE A ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcA

SITE B ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcB
I understand it might be more ceph specific, but how would one set that config in a PVE cluster?

Afaik acording to the docs you can have a custom ceph.conf at /etc/pve/priv/ceph/<STORAGE_ID>.conf to change client configuration that is then merged with the storage configuration.

And since /etc/pve is synced across nodes setting crush_location here wont work since we need a different crush location depending on if the host is in datacenter A or B. E.g:

crush_location = datacenter=A
Code:
datacenter A
  host-1
  host-2
  host-3

crush_location= datacenter=B
Code:
datacenter B
  host-4
  host-5
  host-6


But setting this in PVE dosn't seem possible?
 
Again, PVE doesnt need to be involved. This is a matter of ceph configuration.

You ARE running into the limitation of using a centrally managed ceph.conf, which adds a technical wrinkle. You would need to decouple the nodes individual ceph.conf into separate files PER NODE, at least for one of the sites- and you'd have to be careful to replicate any further changes you make to the central configuration (/etc/pve/ceph.conf) individually to the remote site.

--edit- this can actually still be deployed as a cluster resource. create a copy of /etc/pve/ceph.conf (eg, /etc/pve/ceph-s2.conf ) and change the symlinks on the remote nodes to point to it instead. That still means keeping two configurations, but thats better then 10.

--edit2- to make sure qemu respects your split configuration, you would want to add

config /etc/ceph/ceph.conf

to your pvesm.conf rbd stanza.
 
Last edited: