Slow CEPH Rebuild

deepcloud

Member
Feb 12, 2021
128
17
23
India
deepcloud.in
Hi,

I may sound very demanding as I am cribbing with a 500+ MBps CEPH Rebuild speed. but considering the hardware that we have it seems too slow. Am i wrong ? or doing something wrong !!

The hardware we have is 6X 6.4TB WD SN630 Enterprise NVME SSD - 2 nos per node X 3 nodess = 6 Nodes. Each of these SSD can do over 2000 MBps of Write (its Bytes and not Bits, and I know that's the sequential throughput)

The network is a 100G Ethernet Mellanox CX455 with a Arista CX7060CX2 - so its not a network bottleneck in any way.

So my question is that how do i get better throughput. I am adding 2 disks (on the 3rd node) to my existing 4 disks (on 2 nodes)

any suggestions.

1621787161376.png
 
Ceph ensures that whenever recovery operation is happening, it shall not choke the cluster network with recovery data. The parameters are controlled by this flags

  • osd max backfills: This is the maximum number of backfill operations allowed to/from OSD. The higher the number, the quicker the recovery, which might impact overall cluster performance until recovery finishes.
  • osd recovery max active: This is the maximum number of active recover requests. Higher the number, quicker the recovery, which might impact the overall cluster performance until recovery finishes.
  • osd recovery op priority: This is the priority set for recovery operation. Lower the number, higher the recovery priority. Higher recovery priority might cause performance degradation until recovery completes.
Keep in mind that changing these values can impact the performance of cluster. Clients may see slower response.

These are default values
osd_max_backfills = 1
osd_recovery_max_active = 3
osd_recovery_op_priority = 3

Changing this values shall result in faster recovery
 
  • Like
Reactions: Fabian Goebel
The following command appears to be sufficient to speed up backfilling/recovery. On the Admin node run:
ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6
or
ceph tell 'osd.*' injectargs --osd-max-backfills=3 --osd-recovery-max-active=9

To set back to default, run:
ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=3

"ceph config set" also works with SES 6:
ceph config set osd osd_max_backfills 2
ceph config set osd osd_recovery_max_active 3

To set back to default run:
ceph config rm osd osd_recovery_max_active
ceph config rm osd osd_max_backfills

Setting the values to high can cause osd's to restart, causing the cluster to become unstable.

Monitor with "ceph -s".
If osd's start restarting, then reduce the values.
If clients are impacted by the recovery, reduce the values.
To slow down recovery, reduce values to default.
When cluster is healty set values back to default.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!