Slow CEPH performance

deepcloud · May 22, 2024

Hi,

I have proxmox 8.1 running, will be upgrading to proxmox 8.2.2 soon.

We had one of the nodes crash and we have very slow rebuild speeds

We run AMD EPYC 7002 Series CPU with 64 Cores * 2, 2TB RAM, 15.36TB SN650 NVME SSD - WD Enterprise grade * 4 per node and we have 10G for interVM communication and a dedicated dual redundant active/passive 100G Network for CEPH Cluster

So there is enough horsepower i am sure, would want to understand how to speed up this.

Thanks in advance

alexskysilk · May 22, 2024

This always happens toward the end of a rebalance as there are fewer and fewer osd targets left. the default tuning is to assure that rebalance doesnt clobber client IO, so toward the end it ends up being very conservative.

in your case, there is two things you can do.

1. increase your in flight rebalance ios. the tunables are osd_max_backfills and osd_recovery_max_active. If you have your devices set with proper device classes they should already be set a bit higher than default, but you can continue uppling the values until your guest io begins to suffer.

https://www.thomas-krenn.com/en/wiki/Ceph_-_increase_maximum_recovery_&_backfilling_speed

2. redeploy your OSDs with multiple OSDs per drive. NVMEs can handle a lot of io; as long as we're still limited to bluestore as the storage back for OSD, that would limit your drives to a single logical queue. you can benefit by splitting your drives to 4-8 OSDs which would make your "last OSD" that much smaller- and allow you to more fully utilize your nvme's throughput.

deepcloud · May 22, 2024

alexskysilk said:
This always happens toward the end of a rebalance as there are fewer and fewer osd targets left. the default tuning is to assure that rebalance doesnt clobber client IO, so toward the end it ends up being very conservative.

in your case, there is two things you can do.

1. increase your in flight rebalance ios. the tunables are osd_max_backfills and osd_recovery_max_active. If you have your devices set with proper device classes they should already be set a bit higher than default, but you can continue uppling the values until your guest io begins to suffer.

https://www.thomas-krenn.com/en/wiki/Ceph_-_increase_maximum_recovery_&_backfilling_speed

2. redeploy your OSDs with multiple OSDs per drive. NVMEs can handle a lot of io; as long as we're still limited to bluestore as the storage back for OSD, that would limit your drives to a single logical queue. you can benefit by splitting your drives to 4-8 OSDs which would make your "last OSD" that much smaller- and allow you to more fully utilize your nvme's throughput.

Hi Alex,
Thanks for the above info. How can we redeploy with multipe OSD per drive any inputs on this ?

alexskysilk · May 22, 2024

the RIGHT way is to remove OSDs one by one, but in your case that is not practical (your pool is too full for that.)

So what remains is either you remove about 15TB of raw data (~5.5TB used) before you start, or change WHOLE NODES at a time- which would have you operating degraded until rebalance is complete. If an option, wiping the whole storage and starting from scratch/restoring from backup would be the safest and probably quickest to full health, but would require some downtime.

I'll draw up the basic process here:
1. ~~down~~ out an OSD. wait for rebalance to complete.
2. ~~out~~ down and destroy the OSD once empty.
3. ceph-volume lvm batch --osds-per-device 4 /dev/nvmeXn1
4. if necessary, ceph-volume lvm activate --all
5. wait till rebalance is complete

repeat for all drives.

edit whoops, flipped down and out. queue George Carlin football and baseball routine.

deepcloud · May 22, 2024

Thanks alex

Search

Search

Slow CEPH performance

deepcloud

Member

alexskysilk

Distinguished Member

deepcloud

Member

alexskysilk

Distinguished Member

deepcloud

Member