Slow IOPS while change OSD to up & in

Sep 18, 2022
33
1
13
Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!
 
Hi,

I have cluster with 6 nodes, and 12 NVMe disks.
I have 10 active osds and 2 down.
When I change osds to up & in, all vms have very low IOPS and starting stuck.
The Ceph recovery is about 96 m/s and so slow too, it says 6 hours to done.
What I can do for up these OSDs without the performace impact?

Thank you!
since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"
 
Since quincy I get a lot of performance issues, I see the "osd_recovery_max_active" and "osd_max_backfills" have default of "1000" - IT IS CRAZY!
- Do you know what was the default before quincy?
- If I change "osd_op_queue" to wpq, I can change these values, so I just need the the default values before quincy.
- Why proxmox not make "wpq" as default and set more lower values? It big performance impact!
 
since ceph quincy, they are a auto qos management, which priorize client iops vs replication.

before quincy, the only ways is to reduce number of parallel pg recovery and also add some sleep.

personnaly, I'm using

" ceph config set global osd_recovery_sleep_ssd 0.01"
"ceph config set global osd_recovery_max_active_ssd 3"
"ceph config set global osd_recovery_op_priority 1"
"ceph config set global osd_scrub_during_recovery false"

We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes? Is there a way to see if the new setting has been applied?
 
Last edited:
We are struggling with the same issue here... very high IO across the cluster during the rebalance when an OSD is added/removed. We're on Ceph 17.2.5

Do I just run the above 4 commands as root on any of the nodes?
yep
Is there a way to see if the new setting has been applied?
ceph config dump


I don't known it's it's working with the new qos scheduler
but they are also also knob 100% working
https://docs.ceph.com/en/quincy/dev..._cmp_study/#non-default-ceph-recovery-options


should slowdown to minimum recovery speed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!