Proxmox / Ceph / Backups & Replica Policy

fstrankowski · Mar 4, 2025

Hello everyone!

We've recently upgraded our backbone to 50G and are having some interesting findings in our (3 node) cluster . We're running on latest Proxmox 8.3 with Ceph 18.2.
Ceph VM-Pool is configured with 3x replication over all 3 nodes (so one copy resides on each node).

When we're running backups (both LXC and KVM), CEPH reads the VM-image from the blockdevice / placement group which has been set as primary. This primary group my reside in either the local server or one of the other two.

To prevent this behaviour from happening, we've now set

Code:

rbd_read_from_replica_policy

to

Code:

localize

. The default behaviour prefers the primary placement groups, the localize setting prefers the location closest to the server with the VM residing on.

For a 3-node 3x replication cluster this eliminates any network-usage while doing backups (reads are all done locally), on our bigger clusters (20-50 nodes) have a noticeably lower network usage while doing backups.

Question: Why is this setting set to default and not localize? @fabian (sorry for tagging you directly here but we're doing awesome playing ping-pong together) ;-)

Cheerio

Florian

fabian · Mar 4, 2025

the default one is probably better at distributing the load across disks, but I am not a ceph expert. @aaron ?

aaron · Mar 4, 2025

Please file a feature request at https://bugzilla.proxmox.com, ideally with some numbers that you have seen in your cluster(s).
We can then think about either making this the default or easy to enable/disable from the Proxmox VE tooling.

sensei_pv · Jul 17, 2025

fstrankowski said:
Hello everyone!

We've recently upgraded our backbone to 50G and are having some interesting findings in our (3 node) cluster . We're running on latest Proxmox 8.3 with Ceph 18.2.
Ceph VM-Pool is configured with 3x replication over all 3 nodes (so one copy resides on each node).

When we're running backups (both LXC and KVM), CEPH reads the VM-image from the blockdevice / placement group which has been set as primary. This primary group my reside in either the local server or one of the other two.

To prevent this behaviour from happening, we've now set

Code:

rbd_read_from_replica_policy

to

Code:

localize

. The default behaviour prefers the primary placement groups, the localize setting prefers the location closest to the server with the VM residing on.

For a 3-node 3x replication cluster this eliminates any network-usage while doing backups (reads are all done locally), on our bigger clusters (20-50 nodes) have a noticeably lower network usage while doing backups.

Question: Why is this setting set to default and not localize? @fabian (sorry for tagging you directly here but we're doing awesome playing ping-pong together) ;-)

Cheerio

Florian

Does the change effect(increase/decrease) the performance of the VM (bandwidth/throughput) ?

fstrankowski · Aug 5, 2025

sensei_pv said:
Does the change effect(increase/decrease) the performance of the VM (bandwidth/throughput) ?

I'll be able to supply a post-mortem here soon. I've been testing this in our test-lab (development center) quite extensivly.

spirit · Nov 14, 2025

@fstrankowski I'm looking to add the option in the proxmox gui, to be sure, how do you set the value ?

"ceph config set client.admin rbd_read_from_replica_policy localize"

?

fstrankowski · Nov 17, 2025

spirit said:
@fstrankowski I'm looking to add the option in the proxmox gui, to be sure, how do you set the value ?

"ceph config set client.admin rbd_read_from_replica_policy localize"

?

Code:

rbd config pool set POOLNAME rbd_read_from_replica_policy localize

Regarding the post mortem: I had to delay my work on that because i have to deal with lots of other stuff with higher inhouse priority at the moment. Hopefully i'll be able to prepare something within Q4/2025. I didnt forget you guys

spirit · Nov 17, 2025

ah you can do it also on the pool, great

fstrankowski · Nov 17, 2025

spirit said:
ah you can do it also on the pool, great

If you have a pull request for Proxmox please be so kind to link it here so i can review/improve it before there is a chance Proxmox will merge it.
I'd add the option into the CEPH pool configuration UI because its linked on a per-pool-basis and not globally.

Ceph -> Pool -> <Poolname> -> Advanced Config

Thats where i would put it at.

EllerholdAG · Jun 5, 2026

There is a patch here: https://lore.proxmox.com/all/20260325035104.2264118-1-k.chai@proxmox.com/T/

Is it safe to just run "ceph config set client rbd_read_from_replica_policy localize" ? The patch implies that some other script must be run for it to work properly? Is this patch still on track for an upcoming release?

fstrankowski · Jun 5, 2026

EllerholdAG said:
There is a patch here: https://lore.proxmox.com/all/20260325035104.2264118-1-k.chai@proxmox.com/T/

Is it safe to just run "ceph config set client rbd_read_from_replica_policy localize" ? The patch implies that some other script must be run for it to work properly? Is this patch still on track for an upcoming release?

Wait for 20.2.1 and a fix is shipped within. And yes, you can use said command.

Sad that no credits were given for my work here.

gurubert · Jun 6, 2026

Isn't the issue here that the client process requesting the data from the OSDs needs to know where in the CRUSH topology it runs?
In a hyperconverged cluster with only three nodes this may be a very good optimization, because there all data is stored on the local node.
In larger clusters this is not the case any more.

fstrankowski · Jun 8, 2026

gurubert said:
Isn't the issue here that the client process requesting the data from the OSDs needs to know where in the CRUSH topology it runs?
In a hyperconverged cluster with only three nodes this may be a very good optimization, because there all data is stored on the local node.
In larger clusters this is not the case any more.

Exactly why i proposed the change back in 2025.

EllerholdAG · Jun 8, 2026

So Just using the command above would do nothing? Because the clients dont know which osd to contact?

AntonJ · Jun 11, 2026

Hijacking the thread with almost the same question as above.

We have an external ceph cluster in a stretched setup and PVE, also stretched setup, across two datacenters and was looking into setting the read_from_local_replica(which is no issue on the Ceph side). But if I understand Ceph documentation correct, and this thread, the client needs to specify location which afaik is not currently possible in PVE?

Is there any workaround as of now, or is it implemented in the patch mentioned above? Will the patch work for both HCI clusters and for thoes running external Ceph?

alexskysilk · Jun 11, 2026

This isnt a pve question. PVE doesnt do anything outside of what ceph can do.

As long as you build your crush rule to do what you want, the client will respect it too. Assuming you have the following hierarchy:

DC
Node
OSD

you just need to stipulate

SITE A ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcA

SITE B ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcB

AntonJ · Jun 12, 2026

alexskysilk said:
This isnt a pve question. PVE doesnt do anything outside of what ceph can do.

As long as you build your crush rule to do what you want, the client will respect it too. Assuming you have the following hierarchy:

DC
Node
OSD

you just need to stipulate

SITE A ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcA

SITE B ceph.conf:

[global]
rbd_read_from_replica_policy = localize
[client]
# Match the bucket type and name exactly as defined in your external Ceph CRUSH map
crush_location = datacenter=dcB

I understand it might be more ceph specific, but how would one set that config in a PVE cluster?

Afaik acording to the docs you can have a custom ceph.conf at /etc/pve/priv/ceph/<STORAGE_ID>.conf to change client configuration that is then merged with the storage configuration.

And since /etc/pve is synced across nodes setting crush_location here wont work since we need a different crush location depending on if the host is in datacenter A or B. E.g:

crush_location = datacenter=A

Code:

datacenter A
  host-1
  host-2
  host-3

crush_location= datacenter=B

Code:

datacenter B
  host-4
  host-5
  host-6

But setting this in PVE dosn't seem possible?

alexskysilk · Jun 12, 2026

Again, PVE doesnt need to be involved. This is a matter of ceph configuration.

You ARE running into the limitation of using a centrally managed ceph.conf, which adds a technical wrinkle. You would need to decouple the nodes individual ceph.conf into separate files PER NODE, at least for one of the sites- and you'd have to be careful to replicate any further changes you make to the central configuration (/etc/pve/ceph.conf) individually to the remote site.

--edit- this can actually still be deployed as a cluster resource. create a copy of /etc/pve/ceph.conf (eg, /etc/pve/ceph-s2.conf ) and change the symlinks on the remote nodes to point to it instead. That still means keeping two configurations, but thats better then 10.

--edit2- to make sure qemu respects your split configuration, you would want to add

config /etc/ceph/ceph.conf

to your pvesm.conf rbd stanza.

Proxmox / Ceph / Backups & Replica Policy

fstrankowski

Famous Member

fabian

Proxmox Staff Member

aaron

Proxmox Staff Member

sensei_pv

New Member

fstrankowski

Famous Member

spirit

Distinguished Member

fstrankowski

Famous Member

spirit

Distinguished Member

fstrankowski

Famous Member

EllerholdAG

Member

fstrankowski

Famous Member

gurubert

Distinguished Member

fstrankowski

Famous Member

EllerholdAG

Member

AntonJ

New Member

alexskysilk

Distinguished Member

AntonJ

New Member

alexskysilk

Distinguished Member

We value your privacy