CEPH rebalancing soooo slooooowwwww

proxwolfe

Well-Known Member
Jun 20, 2020
501
52
48
49
Hi,

I am running a three node hyper-converged PVE cluster where all three nodes are also CEPH nodes. Each node has an NVMe OSD organized in a CEPH pool for VM storage. There is a dedicated 10Gbe network for CEPH (and one for the PVE cluster).

For reasons still to be found, two of my OSDs went down and out. Since I couldn't get them back online, I destroyed them and recreated them. That went well.

Now, CEPH is rebalancing at around 10 MiB/s. Given that the OSDs are NVMes and the network is 10gbe, I find this surprisingly slow. What might be the reason and what could I do to speed things up?

Thanks!
 
Hi,

since you completely destroyed the old OSDs and added them again, the operation currently going on is called 'backfilling'.
Backfills are rate-limited to not disturb the normal operation of a cluster. To increase the speed, you can modify the osd-max-backfills and osd-recovery-max-active parameters. Take a look at this article [1] for more information about this.

[1] https://www.thomas-krenn.com/en/wiki/Ceph_-_increase_maximum_recovery_&_backfilling_speed
 
Thanks - I wasn't aware of that limitation.

In my case, nothing is working at the moment anyway. So I guess there is no harm in trying.

Unfortunately, however, changing the values didn't work for me: For both parameters the setting was 1000. As per the article you linked, I tried setting it to 2000 but although no error was reported, when checking the values again, they were back to 1000.

Is 1000 the maximum already? Or what is going on?
 
Oh great! Now the rebalancing is stuck at 92% and won't complete.

Code:
data:
    volumes: 1/1 healthy
    pools:   5 pools, 417 pgs
    objects: 525.47k objects, 2.0 TiB
    usage:   5.5 TiB used, 4.0 TiB / 9.6 TiB avail
    pgs:     1.199% pgs not active
             117735/1576395 objects degraded (7.469%)
             289 active+clean
             123 active+undersized+degraded
             5   undersized+degraded+remapped+backfill_toofull+peered

So when two of the three OSDs in one pool failed, it probably started backfilling the one OSD that still was healthy until it wasn't (but became full).

And now?
 
Ah, found the error:

I have two classes of disks: nvme (for vm storage) and hdd (for data), each building one pool. When recreating one of the failed nvme osds I missed that it was not detected as an nvme but classed as an ssd. So it was no reallocated to the nvme pool. And so the nvme pool only consisted of two osds one which was toofull and one that was filled up to full.

I destroyed the wrongly classed ssd osd and recreated it as nvme osd. Then it was automatically reallocated to the nvme pool and now the rebalancing has started again. (Still slow).
 
...actually, I think recovery has become slower.

Do you happen to know what these parameters mean and how one would tweak them to speed up recovery?

Thanks!
 
I would check the network with e.g.: iperf3, to rule that out.

But I can not shake the suspicion, that the problem are the physical disks.
What is the exact model number of those?
It might be an idea to test the performance of the disks in question with another (non-shared) filesystem (would suggest ZFS, since it has similar hardware requirements in regards to disks as Ceph) on them and some benchmark tool(s) to verify, that they do not have a performance problem in general.

Unfortunately, I have no experience with Ceph; so I do not know, how to get any meaningful performance metrics to diagnose this.
 
It might be an idea to test the performance of the disks in question with another (non-shared) filesystem (would suggest ZFS, since it has similar hardware requirements in regards to disks as Ceph) on them and some benchmark tool(s) to verify, that they do not have a performance problem in general.
In order to do that, I would need to destroy one of them again, instead of having the pool rebalanced. This has been going on for too long. By tomorrow morning, I need the cluster operational again. But I was going to replace the nvmes with larger ones in the coming weeks anyway. After that, I could test them under ZFS.

However, while this is no hard evidence, when I migrate disks between nodes, I reach between 200 and 400 MiB/s. So I don't think it is the network or the performance of those nvmes as such.
 
Edited.

I opened a new thread as what I have now now is essentially a new topic.
 
Last edited:
There is a new clue.

I added another disk in one of the nodes and it is currently rebalancing at speeds between 90 and 120 MiB/s. While still not as fast as migration, this is at least acceptable. My guess is that these speeds are achieved between the two disks in one node.

Now I am wondering why it is so much slower over the network. Could it be that rebalancing does not use the same network as migration does? Because not all of the networks are 10gbe. Only the PVE cluster network and the Ceph network are. Could it be that rebalancing happens over another network (which then would be 1gbe)?
 
I added a new NVMe OSD to each node (as the existing OSDs were too full and the pool was failing). When the rebalancing started, it shortly peaked at 2.7GiB/s and continued at between 400MiB/s and 700MiB/s. That probably was within a node and between two NVMes. But it shows that higher speeds are possible even in my cluster.

The closer the rebalancing gets to completion (now at 99.xy%), the slower it is getting (currently approx. 15MiB/s) but it had also gone down again into one digit MiB/s territory already.

So maybe the rebalancing speed has to to with the what is being rebalanced and the closer completion is the slower the rebalancing is (for whatever reason).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!