Slow IOPS while change OSD to up & in

maorbari · Jan 2, 2023

Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!

spirit · Jan 2, 2023

maorbari said:
Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!

since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"

maorbari · Jan 3, 2023

Since quincy I get a lot of performance issues, I see the "osd_recovery_max_active" and "osd_max_backfills" have default of "1000" - IT IS CRAZY!
- Do you know what was the default before quincy?
- If I change "osd_op_queue" to wpq, I can change these values, so I just need the the default values before quincy.
- Why proxmox not make "wpq" as default and set more lower values? It big performance impact!

uk_user · Mar 12, 2023

spirit said:
since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"

We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes? Is there a way to see if the new setting has been applied?

spirit · Mar 12, 2023

uk_user said:
We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes?

yep

uk_user said:
Is there a way to see if the new setting has been applied?

ceph config dump

I don't known it's it's working with the new qos scheduler
but they are also also knob 100% working
https://docs.ceph.com/en/quincy/dev..._cmp_study/#non-default-ceph-recovery-options

osd_max_backfills = 1
osd_recovery_max_active = 1

should slowdown to minimum recovery speed.

Search

Search

Slow IOPS while change OSD to up & in

maorbari

Member

spirit

Distinguished Member

maorbari

Member

uk_user

Member

spirit

Distinguished Member