Ceph Questions and Thoughts

troycarpenter · Apr 24, 2024

Recently I combined two separate Proxmox clusters into one. Both clusters prior had separate Ceph clusters of three nodes each with 10 OSDs. Earlier this week I finally added the three nodes and OSDs to my converged cluster. All nodes are running Proxmox 8.1.11 (I see 8.2 is now available), with Ceph Reef.

What that means for Ceph is that now the cluster doubled in capacity from 3 nodes/30 OSDs to 6 nodes/60 OSDs. The rebalancing is still happening from that operation, but the misplaced OSDs is now below 10%.

In the meantime, I'm looking at better understanding PGs and such. After a lot of reading and recommendations, I've got a slightly better understanding of the basics of PGs and such, but there are still some questions I have.

After initial combining, I had this from the autoscale-status:

Code:

root@amazon:~# ceph osd pool autoscale-status
POOL                   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
lab-vm-pool          14442G                3.0        235.5T  0.1797                                  1.0    1024              warn       False 
lab-cephfs_data      80902M                3.0        235.5T  0.0010                                  1.0     128              off        False 
lab-cephfs_metadata  209.3M                3.0        235.5T  0.0000                                  1.0      32              warn       False 
.mgr                 217.3M                3.0        235.5T  0.0000                                  1.0       1              on         False

The warning I got in this state is that my number of PGs at 1024 was too high and should have been 256. However, I read enough to know that the autoscaler was basing that recommendation on the current usage and the lack of "guidance" on how big the pools should be (btw, all pools are 3/2).

The two main pools here are lab-vm-pool where VM images are stored and should take the majority of space. lab-cephfs-data is where mainly ISOs and other ancillary data is stored. As far as I can tell, the other two pools are auto-generated and are in the "storage noise" so to speak.

I then decided to set the ratios/sizes of the two partitions. Ideally I want the lab-vm-pool to be 90% of the available space, with everything else in the other 10%. To start, though I set the lab-cephfs-data pool to be 1000G, and tried to set the lab-vm-pool ratio to 90%. I didn't understand if I was supposed to put 90.0 or .9 in the data fields for that. I never found any concrete guidance and a few disparate sources used both scales. In the end, I edited the lab-vm-pool pool from the GUI and used the number 0.9. I then got this:

Code:

root@amazon:~# ceph osd pool autoscale-status
POOL                   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
lab-vm-pool          14467G                3.0        235.5T  0.9876        0.9000           0.9876   1.0    1024              warn       False 
lab-cephfs_data      80902M        1000G   3.0        235.5T  0.0124                                  1.0     128              off        False 
lab-cephfs_metadata  209.3M                3.0        235.5T  0.0000                                  1.0      32              warn       False 
.mgr                 217.3M                3.0        235.5T  0.0000                                  1.0       1              on         False

Instantly, the autoscaler switched warnings from too many PGs (1024 vs 256) to too few PGs (1024 vs 2048). I know that's based on future projected capacity and I don't really feel the need to adjust this now.

In the end, should I just remove all the ratio guidance and go back to having it blank? We've reached a somewhat static state for this storage cluster in that we have not really been increasing disk usage (if we do, it's usually due to a VM needing to expand a disk image). Our problem in the past has been performance, not capacity (part of the reason to add more OSDs is to try to spread operations across more devices. On average the number of PGs per OSD has dropped to around 20 (for completeness, all storage nodes are using two 100Gb links lagged together).

So the two questions asked in case you got lost:

For the target ratio, should it be expressed as the actual number (like 0.9) or as the percentage number (like 90)?
Should I just not worry about any of this ratio mess, turn off the auto-scaler warnings, and go about my business?

Thanks!

mgabriel · Apr 24, 2024

1. The target ratio is expressed as 0.9 (for 90%). You did it right.
2. I'd recommend setting the autoscaler to on and let it do it's magic - except

you want to put lots of data on your pools in the next days or
your storage usage varies heavily within short time frames

That leads to up- or downscaling of PGs and if you suffered from performance issues in the past, this may be something you want to do over a weekend so that distribution/backfilling/recovery happens when your storage isn't too busy.

Search

Search

Ceph Questions and Thoughts

troycarpenter

Renowned Member

mgabriel

Renowned Member