Increasing Ceph pg count / Experience report on performance impact and duration

Apr 22, 2025
2
0
1
Dear all,

I am about to increase the # of Placement groups on our 6 node cluster (soon to be expanded by a seventh node). Since I know the rebalancing is very demanding on the overall Cluster performance I would like to ask about your experiences regarding how long this procedure "usually" takes. Our setup is as follows:

  • 6 nodes with 7 OSDs each (totallng 42 OSDs, a mix of Micron 7400 and Micron 7450 NVMEs)
    • all NVMEs are of the same size (3,2 TB)
  • Ceph is setup with a single pool for VM disks, Size/min is 3/2
    • Usage of the pool is at 10% (12.7 TiB of 122 TiB)
  • Currently 5 monitors and 6 managers
  • All nodes are redundantly connected to a 25 GBit/s network / VLAN (dedicated network for Ceph only)
    • no bottlenecks so far, iperf3 is sitting happily at ~ 22 GBit/s
  • Hardware of the servers (ASUS) is 112 x AMD EPYC 7453 28-Core Processor (2 sockets) and 512 GB of ECC RAM
  • Ceph is reef (18.4.2), PVE is 8.4.1, all packages coming from the enterprise repos.

Disclaimer:

  1. I guess since I forgot to specify a target ratio when I installed the cluster the autoscaler wouldn't do anything. We currently have 128 PGs (I know, not optimal), giving somewhere between 8 and 12 or so PGs per OSD (same, not optimal). The autoscaler is recommending 256 PGs as optimal (for now). I read that for around ~50 OSDs you should plan for 2048 PGs
  2. We don't expect "radical" storage growth in the future, i.e. this is really just VM storage (for operating system disks, no huge file storage etc.)

So, tl;dr:

  1. Increasing the PG count from 128 -> 256 and thinking to do this on a friday evening or some upcoming bank holiday: What is the estimated time for rebalancing? More like in the "a few hours" or in the "up to 2 days" spectrum?
  2. Is it recommended to do the increments in smaller steps (reaching a "power of 2" number of PGs) or is it better to do the leap at once, so let's say:
    1. 128 -> 256 -> 512 -> 1024 -> 2048 or
    2. 128 -> 2048


Thank you very much in advance and best regards,


Alex
 
Last edited:
  1. Is it recommended to do the increments in smaller steps (reaching a "power of 2" number of PGs) or is it better to do the leap at once, so let's say:
    1. 128 -> 256 -> 512 -> 1024 -> 2048 or
    2. 128 -> 2048
I've recently done the same on a 6 Node Ceph Cluster with (6x 3,84TB NVMe per node) i did

128 -> 256 -> 512 -> 1024 the rebalacing took arround 30-45 minutes each time
 
Increasing the PG count from 128 -> 256 and thinking to do this on a friday evening or some upcoming bank holiday: What is the estimated time for rebalancing? More like in the "a few hours" or in the "up to 2 days" spectrum?
With your current utilization, I estimate around 12 hours, assuming you have a decent node-interconnect (25G / 40G / 100G)

Is it recommended to do the increments in smaller steps (reaching a "power of 2" number of PGs) or is it better to do the leap at once, so let's say:
  1. 128 -> 256 -> 512 -> 1024 -> 2048 or
  2. 128 -> 2048
That's actually a tricky question. Since Ceph Nautilus, if you have default settings, Option 1) or 2) will not really make a difference.

This is because changing the pg_num or pgp_num will not immediately change the actual value as it did before.
(In case you wonder: pg_num is the number of placement groups, but data won't actually move / rebalance until pgp_num is adjusted)

How it was before Nautilus: You change pg_num and it's applied immediately. Then you change pgp_num and the data rebalancing process begins. This could lead to a huge percentage of misplaced data for large jumps (e.g. 128 -> 256). In prod, you would do smaller increments to not affect prod workloads.


How it is now:
If you set the number of placement groups via Proxmox UI, it will adjust pg_num, which changes the internal pg_num_target / pgp_num_target value.
The same happens if you adjust the pg_num via CLI.

There is a nice blog article from Ceph:
https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/

Money quotes:
an internal pg_num_target value is set indicating the desired value for pg_num_,_ and the cluster works to slowly converge on that value
More importantly, the adjustment of pgp_num to migrate data and (eventually) converge to pg_num is done gradually to limit the data migration load on the system based on the new target_max_misplaced_ratio config option (which defaults to .05, or 5%). That is, by default, Ceph will try to have no more than 5% of the data in a "misplaced" state and queued for migration, limiting the impact on client workloads.

So, with default settings, it will not really make a difference whether you do option a) (aka step manually 128 -> 256 -> 512, etc). or option b) (directly to 2048).

Ceph will do small increments on itself (e.g. 128 -> 135 -> 142 -> ...), making sure that the percentage of misplaced data does not exceed the target_max_misplaced_ratio config option, which is 5% by default.
That means that data rebalancing will potentially run for a long time (week or longer), because Ceph tries to limit the impact and many pieces of data are moved more than once.

This is good if your cluster has constant workload, but bad if you want to get it done or you're planning large jumps!

So, my recommendation is the following.
Option 1: Directly go to 1024. Ceph will slowly converge towards that value, but it can take a long time and is less efficient (the same data will be moved multiple times). Impact on production would be smaller.

Option 2: Set target_max_misplaced_ratio to 100% and directly go to 1024. This will mean a massive impact on your production workload, but if you can have a maintenance window, this is fine.

In order to increase the rebalancing, you can also set options like mclock profile to high_recovery_ops or a custom mclock profile where you increase osd_max_backfills (this can massively speed up the operation)
 
Last edited:
@FrankyT - thank you very much for sharing. I just did the leap from 128 to 256 pgs - that took around 110 minutes.


@Robin C. - thanks a lot for the in depth stuff, that makes sense, I noticed the ~5% ratio of misplaced objects while monitoring with "ceph -s" - I didn't know about target_max_misplaced_ratio until now, that's good to know. That was really helpful, thanks again!

Best regards,

Alex
 
Can I ask, why not use the autoscaler’s recommendation? What’s the advantage of going higher?
As far as I understand, I think the autoscaler's initial recommendation is based on the amount of data currently stored.

If the amount of data continues to grow, the autoscaler could continually adjust the number of PGs, thus affecting performance in productive operation.

When i first started with one of my Clusters we had like 36 OSD's and only 32PG on the Pool with autoscaler on(i didnt set target Ratio and so on.