CEPH rebalancing soooo slooooowwwww

proxwolfe · Jan 26, 2023

Hi,

I am running a three node hyper-converged PVE cluster where all three nodes are also CEPH nodes. Each node has an NVMe OSD organized in a CEPH pool for VM storage. There is a dedicated 10Gbe network for CEPH (and one for the PVE cluster).

For reasons still to be found, two of my OSDs went down and out. Since I couldn't get them back online, I destroyed them and recreated them. That went well.

Now, CEPH is rebalancing at around 10 MiB/s. Given that the OSDs are NVMes and the network is 10gbe, I find this surprisingly slow. What might be the reason and what could I do to speed things up?

Thanks!

Lukas Wagner · Jan 26, 2023

Hi,

since you completely destroyed the old OSDs and added them again, the operation currently going on is called 'backfilling'.
Backfills are rate-limited to not disturb the normal operation of a cluster. To increase the speed, you can modify the osd-max-backfills and osd-recovery-max-active parameters. Take a look at this article [1] for more information about this.

[1] https://www.thomas-krenn.com/en/wiki/Ceph_-_increase_maximum_recovery_&_backfilling_speed

proxwolfe · Jan 26, 2023

Thanks - I wasn't aware of that limitation.

In my case, nothing is working at the moment anyway. So I guess there is no harm in trying.

Unfortunately, however, changing the values didn't work for me: For both parameters the setting was 1000. As per the article you linked, I tried setting it to 2000 but although no error was reported, when checking the values again, they were back to 1000.

Is 1000 the maximum already? Or what is going on?

Neobin · Jan 26, 2023

Maybe this helps:
https://forum.proxmox.com/threads/ceph-osd_max_backfills-being-overridden-after-changing-it.120608

proxwolfe said:
For reasons still to be found, two of my OSDs went down and out. Since I couldn't get them back online, I destroyed them and recreated them.

proxwolfe said:
Now, CEPH is rebalancing at around 10 MiB/s.

Are you sure, that the physical disks are okay?

proxwolfe · Jan 26, 2023

Neobin said:
Are you sure, that the physical disks are okay?

The SMART test is shown as passed. And I am not getting read or write errors (it's just taking very long). So I am guessing they are.

proxwolfe · Jan 26, 2023

Oh great! Now the rebalancing is stuck at 92% and won't complete.

Code:

data:
    volumes: 1/1 healthy
    pools:   5 pools, 417 pgs
    objects: 525.47k objects, 2.0 TiB
    usage:   5.5 TiB used, 4.0 TiB / 9.6 TiB avail
    pgs:     1.199% pgs not active
             117735/1576395 objects degraded (7.469%)
             289 active+clean
             123 active+undersized+degraded
             5   undersized+degraded+remapped+backfill_toofull+peered

So when two of the three OSDs in one pool failed, it probably started backfilling the one OSD that still was healthy until it wasn't (but became full).

And now?

proxwolfe · Jan 26, 2023

Ah, found the error:

I have two classes of disks: nvme (for vm storage) and hdd (for data), each building one pool. When recreating one of the failed nvme osds I missed that it was not detected as an nvme but classed as an ssd. So it was no reallocated to the nvme pool. And so the nvme pool only consisted of two osds one which was toofull and one that was filled up to full.

I destroyed the wrongly classed ssd osd and recreated it as nvme osd. Then it was automatically reallocated to the nvme pool and now the rebalancing has started again. (Still slow).

alexskysilk · Jan 26, 2023

proxwolfe said:
I destroyed the wrongly classed ssd osd and recreated it as nvme osd. Then it was automatically reallocated to the nvme pool and now the rebalancing has started again. (Still slow).

you dont need to do that. you can always set it after the fact.

ceph osd crush set-device-class [type] osd.[number]

proxwolfe · Jan 26, 2023

Even better - thanks for the tip! Will keep that in mind.

proxwolfe · Jan 26, 2023

Neobin said:
Maybe this helps:
https://forum.proxmox.com/threads/ceph-osd_max_backfills-being-overridden-after-changing-it.120608

I hadn't seen that part of your post before and now I gave it a try.

Using the parameters from that other thread

osd_mclock_scheduler_background_recovery_res=1
osd_mclock_scheduler_background_recovery_lim=100

didn't change anything for me, unfortunately. Still recovering at approx. 10 MiB/s.

proxwolfe · Jan 26, 2023

...actually, I think recovery has become slower.

Do you happen to know what these parameters mean and how one would tweak them to speed up recovery?

Thanks!

Neobin · Jan 26, 2023

I would check the network with e.g.: iperf3, to rule that out.

But I can not shake the suspicion, that the problem are the physical disks.
What is the exact model number of those?
It might be an idea to test the performance of the disks in question with another (non-shared) filesystem (would suggest ZFS, since it has similar hardware requirements in regards to disks as Ceph) on them and some benchmark tool(s) to verify, that they do not have a performance problem in general.

Unfortunately, I have no experience with Ceph; so I do not know, how to get any meaningful performance metrics to diagnose this.

proxwolfe · Jan 26, 2023

Neobin said:
It might be an idea to test the performance of the disks in question with another (non-shared) filesystem (would suggest ZFS, since it has similar hardware requirements in regards to disks as Ceph) on them and some benchmark tool(s) to verify, that they do not have a performance problem in general.

In order to do that, I would need to destroy one of them again, instead of having the pool rebalanced. This has been going on for too long. By tomorrow morning, I need the cluster operational again. But I was going to replace the nvmes with larger ones in the coming weeks anyway. After that, I could test them under ZFS.

However, while this is no hard evidence, when I migrate disks between nodes, I reach between 200 and 400 MiB/s. So I don't think it is the network or the performance of those nvmes as such.

proxwolfe · Jan 27, 2023

Edited.

I opened a new thread as what I have now now is essentially a new topic.

proxwolfe · Jan 27, 2023

There is a new clue.

I added another disk in one of the nodes and it is currently rebalancing at speeds between 90 and 120 MiB/s. While still not as fast as migration, this is at least acceptable. My guess is that these speeds are achieved between the two disks in one node.

Now I am wondering why it is so much slower over the network. Could it be that rebalancing does not use the same network as migration does? Because not all of the networks are 10gbe. Only the PVE cluster network and the Ceph network are. Could it be that rebalancing happens over another network (which then would be 1gbe)?

proxwolfe · Jan 29, 2023

I added a new NVMe OSD to each node (as the existing OSDs were too full and the pool was failing). When the rebalancing started, it shortly peaked at 2.7GiB/s and continued at between 400MiB/s and 700MiB/s. That probably was within a node and between two NVMes. But it shows that higher speeds are possible even in my cluster.

The closer the rebalancing gets to completion (now at 99.xy%), the slower it is getting (currently approx. 15MiB/s) but it had also gone down again into one digit MiB/s territory already.

So maybe the rebalancing speed has to to with the what is being rebalanced and the closer completion is the slower the rebalancing is (for whatever reason).

Search

Search

CEPH rebalancing soooo slooooowwwww

proxwolfe

Well-Known Member

Lukas Wagner

Proxmox Staff Member

proxwolfe

Well-Known Member

Neobin

Distinguished Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

alexskysilk

Distinguished Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

Neobin

Distinguished Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member