Ceph Octopus upgrade notes - Think twice before enabling auto scale

David Herselman · Dec 3, 2020

Hi,

We've been working through upgrading individual nodes to PVE 6.3 and extended this to now include hyper converged Ceph cluster nodes. The upgrades themselves went very smoothly but the recommendation around setting all storage pools to autoscale can cause some headaches.

The last paragraph in the Ceph Nautilus to Octopus upgrade notes (https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus) reference the following command:
ceph osd pool set POOLNAME pg_autoscale_mode on

or, to apply to all pools:

Code:

for f in `ceph osd pool ls`; do ceph osd set $f pg_autoscale_mode on; done

The algorithm holds back flipping placement group numbers by only actioning a change when it's more than x 3 out by what it recommends. Whilst this sounds great it does mean that resulting changes are relatively massive (typically by a factor of 4). The unexpected consequence of large changes are that the cluster will be in a degraded state for a considerably amount of time and subsequently not trim the monitor database stores.

This may lead to the /var/lib/ceph/mon/*/store.db directory growing to consume all available space and subsequently freeze all I/O. Attempting to compact the monitors and simply restarting them didn't help. Really annoying that Ceph doesn't trim the monitor databases after each incremental change...

Is anyone aware of a command to force truncating unnecessarily old revisions in the monitor databases in these situations? We reverted to temporarily attaching SSDs in USB 3 casings, stopping the monitor, copying the content of the 'store.db' directory over and then mounting the drive on that directory (don't forget to change ownership of /var/lib/ceph/mon/*/store.db after mounting).

The Ceph monitor directory is typically 1 GiB but grew to over 65 GiB whilst reducing placement groups from 256 to 64 on a 18 OSD cluster.

Whilst I've read that Ceph won't trim the monitor store osdmap revisions, whilst unhealthy, the store.db directory shrank from 65 GiB to 3 Gib once we'd transferred them to a larger volume and subsequently restarted the monitor service. We had tried restarting the monitor process before moving the content and this hadn't changed anything; it's almost as if running out of space crashed the monitors in a state where they needed a relatively tiny amount of additional space to subsequently trim old data. Running manual monitor compacts (ceph tell mon.[ID] compact) typically results in the database files being rewritten, requiring twice the space they did initially, but this did not happen after starting the monitors after transferring the 'store.db' directory to a larger volume after they crashed when running out of space.

We subsequently immediately stopped the monitor process and transferred the directory back to it's original volume, as the NVMe volume was substantially faster. We ended up repeating this cycle three times but no matter how many times we tried to compact the database or restart the monitors (yes, we even added the 'mon compact on start = true' switch to ceph.conf) the storage utilisation would grow until nearly doubling before reducing to virtually the same starting size. If we however waited for a monitor to run out of space and then repeated the above they shrank when subsequently starting them with available space.

Regards
David Herselman

Alwin · Dec 4, 2020

Thanks for the feedback. I changed the autoscale section to be more explicit about the implication.

RokaKen · Dec 4, 2020

I experimented with the autoscale feature after my Luminous to Nautilus upgrade. While my experience was far less traumatic than that of @David Herselman , I found that it completely under-calculated the proper number of PGs for data durability and object distribution. It seemed to be entirely skewed toward minimizing resources (which is not the primary goal of CEPH).

Looking at the current documentation, choosing the number of Placement Groups now targets 50-100 PGs/OSD but retains the old target of 100PGs/OSD for calculation purposes on clusters of 50+ OSDs. For less than 50 OSDs, the documentation suggests preselection , but then claims the "autoscaler" will achieve the proper ratio for you. IMO, that is a doc bug or an implementation bug in the pg_autoscale plugin because the result is a dramatically undersized PG count.

Also, I'm going to make a bold assumption that the majority of PVE users are running clusters of less than 50 OSDs. ;-)

Consider the documented PG tradeoffs , we see that data durability, object distribution and resource usage are the essential elements in choosing the number of PGs. To summarize,

Data durability: "In a nutshell, more OSDs mean faster recovery and a lower risk of cascading failures leading to the permanent loss of a Placement Group. Having 512 or 4096 Placement Groups is roughly equivalent in a cluster with less than 50 OSDs as far as data durability is concerned."

Object distribution: "As long as there are one or two orders of magnitude more Placement Groups than OSDs, the distribution should be even. For instance, 256 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs etc."

Resource Usage: Minimizing placement groups saves resources (memory, cpu, network).

So, according to documentation, the OP's 256 PGs should have been a "minimum" (for 18 OSD's) with a preference for 512+ PGs. Instead, the autoscaler set 64! I would not reduce any pool to 64 PGs if it contained data that I cared about. YMMV.

@Alwin , please consider a stronger warning against using the autoscaler on clusters of less than 50 OSDs.

Alwin · Dec 4, 2020

RokaKen said:
@Alwin , please consider a stronger warning against using the autoscaler on clusters of less than 50 OSDs.

While we didn't observe any big impact on our production and test clusters. They are less then 50x OSDs.

But I can see your point. The description might make it to easy to just fire and forget. I updated the upgrade guide with some more cautious wording.

Klug · Dec 12, 2020

Hi all.

I might hit this (what RokaKen) says about the difference between the suggested number of PG (I have 40 OSD, I previously setup 1024 PG with Nautilus) and what pg_autoscale thinks/wants (it says there's too many placement group after Octopus upgrade).

I don't intend to enable the pg_autoscale (based on what was said in this thread) and want to keep my 1024 PG.

Is there a way to disable the "HEALTH_WARN" (because of "1 pools have too many placement groups")?

aaron · Dec 12, 2020

Klug said:
Is there a way to disable the "HEALTH_WARN" (because of "1 pools have too many placement groups")?

Should go away if you set the autoscaler_mode to off for that pool.

Klug · Dec 13, 2020

Code:

root@pve01:~# ceph osd pool autoscale-status
POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
SSD                     6005G                3.0        71535G  0.2519                                  1.0    1024         256  warn
device_health_metrics   2529k                3.0        71535G  0.0000 

root@pve01:~# ceph osd pool set --pool=SSD --var=pg_autoscale_mode --val=off
set pool 1 pg_autoscale_mode to off

root@pve01:~# ceph osd pool autoscale-status
POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
SSD                     6005G                3.0        71535G  0.2519                                  1.0    1024         256  off
device_health_metrics   2529k                3.0        71535G  0.0000                                  1.0       1              on

Thank you.

David Herselman · Jan 9, 2021

The fundamental problem is the osdmaps not being trimmed by the monitors at each step where reducing placement groups end with all being in a clean state. It only prunes the osdmaps once all monitors have been restarted and all placement groups are clean at that moment in time.

ie: Large changes to the number of placement groups will result in a very sudden increase of the monitor databases until the process is complete. This can very quickly lead to unavailability of the cluster.

Found and updated a Ceph bug report:
https://tracker.ceph.com/issues/48212

David Herselman · Jan 11, 2021

I would recommend Proxmox consider advising users to reduce the 'mon_osdmap_full_prune_min' from 10,000 to 1,000 to reduce space utilisation:

Code:

ceph config set mon mon_osdmap_full_prune_min 1000

Alwin · Jan 11, 2021

David Herselman said:
I would recommend Proxmox consider advising users to reduce the 'mon_osdmap_full_prune_min' from 10,000 to 1,000 to reduce space utilisation:

I wouldn't suggest this in general, but I have linked the thread to the upgrade guide. I hope others will share their experience as well.

Zombie · Jan 11, 2021

So I am confused a little bit and maybe need some clarification. Prior to my upgrade I was running with the PGs set to 256 (3 node cluster 7 OSD's per node) and now that I have upgraded the auto scaler is recommending the PGs set to 32. I have not enabled it yet because that seems like a big change to go down that many. Can someone help me clarify if it would be safe to use the recommended or stay at the 256 that I have it set at?

Thanks for any hints.

Alwin · Jan 11, 2021

Zombie said:
So I am confused a little bit and maybe need some clarification. Prior to my upgrade I was running with the PGs set to 256 (3 node cluster 7 OSD's per node) and now that I have upgraded the auto scaler is recommending the PGs set to 32.

The autoscaler provides suggestions to what the optimal PG number for a pool might be. It needs some tuning the provide more than just the defaults (off by factor 3). Best have a look at the scaling recommendations.
https://docs.ceph.com/en/octopus/ra...nt-groups/#viewing-pg-scaling-recommendations

Zombie said:
Can someone help me clarify if it would be safe to use the recommended or stay at the 256 that I have it set at?

Have a look at the autoscale size and ratio, as well as the pg_num_min option (see link above).

Zombie · Jan 11, 2021

Ok so let me make sure I have this correct. Based on the link above for choosing the number of placement groups (OSD * 100)/ Replicas. For me (21 * 100)/3 = 700. So should I set it to 1024 since it is the next one up or go to 512? Right now mine is set to 256, but if it recommends higher I can do that, just not the smartest when choosing what is best for PG numbers.

Thanks for any help.

Alwin · Jan 11, 2021

The autoscaler tries to optimize for minimizing resource usage. This may lead to under/over estimation of the PG count in the default setting. If you want to make an informed decision about the PG count for your cluster, I suggest the use of the pgcalc [0] in combination with the autoscaler.

[0] https://ceph.io/pgcalc/

willybong · Jun 30, 2021

Hi Everyone,
I'm trying to use autoscale in the new "mini" cluster from stretch.
I've noticed that when the default pool had been created (device_health_metrics) the PG number was 1 (what??) and autoscale = on
I have restored gradually some VM to this pool and the PG number has been increased:

Restore VM n.1 (Windows 10) --> 8 PG
Restore VM n.2 (Windows 10) --> 16 PG
Restore VM n.3 (Windows 10) --> 32 PG
Restore VM n.4 (Windows 10) --> 32 PG
Restore VM n.5 (Windows 10) --> 64 PG

I think that the number of PG is increased depending on the used space pool (or depending on the used resources as well (ram/CPU)? )

Has anybody noticed that?

Many thanks

aaron · Jun 30, 2021

The "device_health_metrics" only has one PG. That is okay. It is used by Ceph to store, as the name suggests, health metrics of the disks used for the OSDs.
You should not use it to store VMs but rather create a new pool for them. If you already have an idea how many pools you will have and how much space they will use up in the end, you can tell the autoscaler what you expect. For example, if you plan to only have one pool for the VMs then you can set the target ratio to 1 and call it a day. This will tell the autoscaler that this pool will use all the space in the cluster and therefore it can calculate the pg_num to fit that right away instead of waiting for the pool to fill slowly and increasing the pg_num every once in a while if the better pg_num is off by a factor of 3 from the current one.

If you use multiple pools, the target_ratio is weighted to all the other target_ratios set. Therefore, you should set it accordingly for all the pools (the device_health_metrics pool can be ignored).

For example, if you have 2 pools which both will be expected to use up about 50% of the space available in the cluster, you can set both to 1, or 0.5. As long as they have the same value, they get weighted equally.
If one pool is expected to use 75% and the other 25%, you could set the ratios to 0.75 and 0.25.

aaron · Feb 4, 2022

linuxmanru said:
Please give some advise about this situation.

Please create a new thread for this as it has nothing to do with the original topic AFAICT?

Search

Search

Ceph Octopus upgrade notes - Think twice before enabling auto scale

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

RokaKen

Active Member

Alwin

Proxmox Retired Staff

Klug

Well-Known Member

aaron

Proxmox Staff Member

Klug

Well-Known Member

David Herselman

Renowned Member

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

Zombie

Well-Known Member

Alwin

Proxmox Retired Staff

Zombie

Well-Known Member

Alwin

Proxmox Retired Staff

willybong

Member

aaron

Proxmox Staff Member

aaron

Proxmox Staff Member