Slow IOPS while change OSD to up & in

Sep 18, 2022
33
1
13
Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!
 
Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!
since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"
 
Since quincy I get a lot of performance issues, I see the "osd_recovery_max_active" and "osd_max_backfills" have default of "1000" - IT IS CRAZY!
- Do you know what was the default before quincy?
- If I change "osd_op_queue" to wpq, I can change these values, so I just need the the default values before quincy.
- Why proxmox not make "wpq" as default and set more lower values? It big performance impact!
 
since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"

We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes? Is there a way to see if the new setting has been applied?
 
Last edited:
We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes?
yep
Is there a way to see if the new setting has been applied?
ceph config dump


I don't known it's it's working with the new qos scheduler
but they are also also knob 100% working
https://docs.ceph.com/en/quincy/dev..._cmp_study/#non-default-ceph-recovery-options


should slowdown to minimum recovery speed.