Ceph PG quantity - calculator vs autoscaler vs docs

Feb 6, 2025
340
124
43
I'm a bit confused about the autoscaler and PGs.

This cluster has Ceph 19.2.1, 18 OSDs, default 3/2 replicas and default target 100 PGs per OSD. BULK is false. Capacity is just under 18000G. A while back we set a target size of 1500G and we've been gradually approaching that, currently around 1300G SIZE (4.3 TB total Usage in GUI for all pools).

The autoscaler still has the pool set to 128 PG. Multiple calculators show a recommendation of 512 PG for the pool, using 100 PG/OSD, or 1024 with 200/OSD. Ceph says the default target is 100 PG per OSD but also immediately follows that sentence with, "For all but the very smallest deployments a value of 200 is recommended." What is a "very small deployment"?

The number of PGs per OSD varies with drive size (weight) of course but most drives are about 1 TB and so contain 30-35 PG, divided basically equally across nodes and factoring in disk size.

By default the autoscaler only triggers if there is a 3x difference from its recommendation. Is that 384 or 512 in this case? Setting
Code:
ceph osd pool set threshold 2.0
as mentioned in the docs doesn't immediately change the recommendation. Does this take a while to recalculate? Column NEW PG_NUM remains blank.

On one hand, I understand that "auto" means one might just walk away :), but I am confused why the calculators give a much higher number. Is it just that the pool is only ~25% full and it hasn't gotten around to creating more PGs yet?

Thanks in advance.
 
Last edited:
Did you set a "target_ratio" for the pool? How many pools do you have?

Could you also please provide the output of the following two commands?

Code:
pveceph pool ls --noborder
ceph osd df tree

For the first, make sure that the terminal is rather wide, as any output that doesn't fit, will be cut off.
 
Hi Aaron,

One pool, plus cephfs. No target ratio is set, only target size.

Code:
# pveceph pool ls --noborder
Name            Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name               %-Used Used
.mgr               3        2      1           1              1 on                                                                   replicated_rule  2.3950073227752e-05 299642880
ceph_cluster2      3        2    128                        128 on                           1610612736000                           replicated_rule     0.24056188762188 3962973087090
cephfs_data        3        2     32                         32 on                                                                   replicated_rule  0.00210255035199225 26360107008
cephfs_metadata    3        2     32          16             16 on                                                                   replicated_rule 4.26416081609204e-05 533505361

[removed]

Thanks.
 
Last edited:
Okay, a few things I notice:

Different Device Classes: If it is true, you have HDDs and SSDs as OSDs. Currently all pools store their data on all OSDs, ignoring the difference. You could create device-class specific rules and assign them to the pools to separate which pools should be stored on the slow HDDs and which on the fast SSDs.

For example:
Code:
ceph osd crush rule create-replicated replicated_ssd default host ssd
ceph osd crush rule create-replicated replicated_hdd default host hdd
Make sure that all pools (even .mgr) are assigned to a device class and that the "replicated_rule" is not used at all anymore.

More details are in the Ceph docs: https://docs.ceph.com/en/latest/rados/operations/crush-map/#device-classes

BUT: this would mean that all device classes are available on at least 3 nodes. And HDDs are only present on 2 nodes if I see that correctly. So unelss you change that, ignore the above ;)

Use Target Ratios:
They are more flexible than target_size when telling the autoscaler what you expect the pools to consume in space. I would assume, given that you split up the pools into HDD and SSD rules, that the ceph_cluster2 pool will end up on the SSDs? You could give it any ratio if it is the only pool. The autoscaler will then assume that it expected to consume all the space in that device class, and will calculate the PG_num accordingly.

For the other pools:
  • .mgr doesn't need any, it will stay with its 1 PG
  • cephfs_data will most likely grow, depending on what you store on the Ceph FS and should get a ratio.
  • cephfs_metadata will most likely stay tiny, so not need to give it a target_ratio.
If you have multiple pools in the same device class that will get a target_ratio, keep in mind that it is a ratio. Therefore, to make it easier for us humans, I usually recomend to assign values from 0 to 100, or 0.0 to 1.0 in total and split it accordingly. This way, the ratios map directly to percentages.

Once done, the autoscaler should show higher "Optimal PG Num" values that will result in higher PGs / OSD.

Differently sized OSDs: the OSDs are sized quite differently. From 1700 GiB to 200 GiB OSDs in the same host. Such large discrepancies will lead to a rather uneven allocation of data. Larger OSDs will store data. Which in turn means, they see more read and write load than smaller OSDs. This could make them a bottleneck.

You also need to consider how bad it will be if one of the large OSDs fails. Is there enough space in the remaining cluster to recover the data without running out of space?
 
Thanks! Setting a target ratio of 98 on the RBD (with threshold still set to 2.0) immediately changed the PG recommendation from 128 to 512. I then set cephfs_data to 2 so it also had one set...it is ISO storage so will mostly be unused.

So even though the target ratio "takes precedence" over a target size, if it's not set it still seems to affect the autoscaler calculation. The docs just say either can be set ("Setting the Target Size or Target Ratio advanced parameters helps the PG-Autoscaler to make better decisions").

Still, 512 PG is a far cry from Ceph's recommended 200 * 18 = 3600...which would mean either 2048 or more likely 4096 PG.

I'm aware of the drive size and type differences and the space needs. Some of the storage is repurposed from our previous Virtuozzo cluster storage. The lone 200 GB was a leftover cache drive; we used the others for DB/WAL on the HDDs. It was a throw-in on the last server we added...seemed better to use it than remove it (I actually asked here about it). The HDDs will be replaced eventually. It's probably easier to leave them, otherwise I'd want more storage in those nodes anyway. The HDDs have primary affinity set off/zero. Unless wear is dramatically higher in Ceph than Virtuozzo I'm not worried about that based on the last 10 years of usage and the wear amounts shown on those drives in PVE, but will keep an eye on it.

Re: space/recovery, yes, though at 24% usage it's not a concern currently, and the expectation would be to add nodes and OSDs if usage grows. We can do that relatively easily.
 
I eventually realized the counting used differs according to the replication setting...512 PG total would be replicated by 3, thus 1536 PGs "on disk," which is much closer to what it is, now. 200*18 = 3600 divided by 3 is 1200 PG total which is still more than 512, but closer. 1800 divided by 3 is 600 PG total.

Overall it seems the problem was using target size instead of target ratio.
 
ceph osd df tree
You're concerned with optimal PG count when your cluster is lopsided. you have two nodes with HDDs, two nodes with a lot of SSD and 2 nodes with too little. any HDD device class rule would not be able to have a replication:3 rule, and an SSD device class rule would end up not being able to sustain a loss of a "large" node. Until you have AT LEAST 3 nodes for EACH device class with EQUAL total OSD capacity, there's really not much point to obsess about optimal pg count.
 
  • Like
Reactions: gurubert
Not really obsessing, just trying to understand the reasoning behind ~30, 100, and 200 PG per OSD. Whether or not the above “solution” is a bug or should be clarified in the Proxmox documentation is maybe an open question.

The issues you mention are a result of separating the device classes. Since the HDDs will be replaced at some point that effort seems non optimal. But that’s not my question here.
 
Any solution is use case dependent, which is why this is left for you (the operator) to define, and you can find multiple documents making what seem to be antagonistic recommendations.

more pgs/OSD mean more granularity, meaning better seek latency at the cost of more system ram and cpu load, as well as more/less write amplification losses depending on the actual asset size. Less pgs/OSD means the reverse, but in practice the actual impact is very usecase dependent more than just block size write granularity- databases have different impact then video files.

I would suggest not spending any more effort on this until you have your final OSD layout, as well as an understanding of what you are using the storage for. Remember, you can have multiple pools with different rules using the same OSDs. Oh, and you REALLY dont want to mix SSD and HDD devices in the same crush rule- it'll make the whole thing crawl.