Advice on increasing ceph replicas

brad.lanham · Sep 4, 2023

Hi,

I am after some advice on the best way to expand our ceph pool. Some steps have already been undertaken, but I need to pause until I understand what to do next.

Initially we had a proxmox ceph cluster with 4 nodes each with 4 x 1TB SSD OSD. I have since added a 5th node with 6 x 1TB SSD OSD and have now added 2 extra OSD to the inital 4 nodes.

So now there are 5 x nodes each with 6 x OSD.

Autoscaler is enabled, but to me it looks like the number of PGS is too low?

>ceph osd pool autoscale-status

Code:

POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
device_health_metrics  44139k                2.0        27787G  0.0000                                  1.0       1              on       
ceph-vm                 6303G                2.0        27787G  0.4537                                  1.0     512              on       
cephfs_data            18235M                2.0        27787G  0.0013                                  1.0      32              on       
cephfs_metadata        193.7M                2.0        27787G  0.0000                                  4.0      32              on

>ceph osd df tree

Code:

ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME   
 -1         27.13626         -   27 TiB  8.7 TiB  8.7 TiB  1.3 GiB   44 GiB   18 TiB  32.12  1.00    -          root default
 -3          5.45517         -  5.5 TiB  1.8 TiB  1.8 TiB  257 MiB  9.1 GiB  3.6 TiB  33.47  1.04    -              host vhs0
  0    ssd   0.90919   1.00000  931 GiB  365 GiB  363 GiB   51 MiB  1.8 GiB  566 GiB  39.16  1.22   48      up          osd.0
  1    ssd   0.90919   1.00000  931 GiB  287 GiB  286 GiB   44 MiB  1.5 GiB  644 GiB  30.86  0.96   46      up          osd.1
  2    ssd   0.90919   1.00000  931 GiB  248 GiB  247 GiB   40 MiB  1.4 GiB  683 GiB  26.66  0.83   34      up          osd.2
  3    ssd   0.90919   1.00000  931 GiB  257 GiB  255 GiB   40 MiB  1.4 GiB  674 GiB  27.57  0.86   40      up          osd.3
 21    ssd   0.90919   1.00000  931 GiB  341 GiB  340 GiB   40 MiB  1.3 GiB  590 GiB  36.67  1.14   37      up          osd.21
 26    ssd   0.90919   1.00000  931 GiB  372 GiB  370 GiB   42 MiB  1.6 GiB  559 GiB  39.91  1.24   45      up          osd.26
 -5          5.45517         -  5.5 TiB  1.7 TiB  1.7 TiB  244 MiB  8.8 GiB  3.7 TiB  31.49  0.98    -              host vhs1
  4    ssd   0.90919   1.00000  931 GiB  234 GiB  233 GiB   36 MiB  1.5 GiB  697 GiB  25.14  0.78   40      up          osd.4
  5    ssd   0.90919   1.00000  931 GiB  290 GiB  288 GiB   44 MiB  1.6 GiB  641 GiB  31.14  0.97   43      up          osd.5
  6    ssd   0.90919   1.00000  931 GiB  213 GiB  212 GiB   34 MiB  1.3 GiB  718 GiB  22.92  0.71   30      up          osd.6
  7    ssd   0.90919   1.00000  931 GiB  317 GiB  315 GiB   48 MiB  1.6 GiB  614 GiB  34.00  1.06   45      up          osd.7
 20    ssd   0.90919   1.00000  931 GiB  331 GiB  329 GiB   39 MiB  1.4 GiB  600 GiB  35.54  1.11   37      up          osd.20
 25    ssd   0.90919   1.00000  931 GiB  374 GiB  373 GiB   42 MiB  1.4 GiB  557 GiB  40.20  1.25   42      up          osd.25
-11          5.45819         -  5.5 TiB  1.8 TiB  1.8 TiB  285 MiB  8.2 GiB  3.6 TiB  33.18  1.03    -              host vhs11
 16    ssd   0.90970   1.00000  932 GiB  310 GiB  309 GiB   40 MiB  1.3 GiB  622 GiB  33.27  1.04   32      up          osd.16
 17    ssd   0.90970   1.00000  932 GiB  271 GiB  270 GiB   44 MiB  1.4 GiB  660 GiB  29.14  0.91   33      up          osd.17
 18    ssd   0.90970   1.00000  932 GiB  320 GiB  318 GiB   59 MiB  1.4 GiB  612 GiB  34.33  1.07   38      up          osd.18
 19    ssd   0.90970   1.00000  932 GiB  338 GiB  337 GiB   53 MiB  1.5 GiB  593 GiB  36.29  1.13   38      up          osd.19
 23    ssd   0.90970   1.00000  932 GiB  314 GiB  312 GiB   47 MiB  1.3 GiB  618 GiB  33.69  1.05   37      up          osd.23
 28    ssd   0.90970   1.00000  932 GiB  302 GiB  300 GiB   41 MiB  1.2 GiB  630 GiB  32.37  1.01   39      up          osd.28
 -7          5.45517         -  5.5 TiB  1.7 TiB  1.7 TiB  262 MiB  9.0 GiB  3.7 TiB  31.48  0.98    -              host vhs2
  8    ssd   0.90919   1.00000  931 GiB  272 GiB  271 GiB   41 MiB  1.6 GiB  659 GiB  29.27  0.91   37      up          osd.8
  9    ssd   0.90919   1.00000  931 GiB  266 GiB  265 GiB   41 MiB  1.7 GiB  665 GiB  28.62  0.89   42      up          osd.9
 10    ssd   0.90919   1.00000  931 GiB  236 GiB  235 GiB   39 MiB  1.5 GiB  695 GiB  25.37  0.79   34      up          osd.10
 11    ssd   0.90919   1.00000  931 GiB  237 GiB  235 GiB   36 MiB  1.3 GiB  694 GiB  25.43  0.79   34      up          osd.11
 22    ssd   0.90919   1.00000  931 GiB  423 GiB  421 GiB   51 MiB  1.6 GiB  508 GiB  45.44  1.41   47      up          osd.22
 27    ssd   0.90919   1.00000  931 GiB  324 GiB  322 GiB   55 MiB  1.3 GiB  607 GiB  34.76  1.08   37      up          osd.27
 -9          5.31256         -  5.3 TiB  1.6 TiB  1.6 TiB  286 MiB  9.1 GiB  3.7 TiB  30.96  0.96    -              host vhs8
 12    ssd   0.87329   1.00000  894 GiB  278 GiB  276 GiB   52 MiB  2.0 GiB  616 GiB  31.12  0.97   37      up          osd.12
 13    ssd   0.87329   1.00000  894 GiB  266 GiB  264 GiB   66 MiB  1.5 GiB  629 GiB  29.69  0.92   35      up          osd.13
 14    ssd   0.87329   1.00000  894 GiB  266 GiB  265 GiB   54 MiB  1.7 GiB  628 GiB  29.78  0.93   37      up          osd.14
 15    ssd   0.87329   1.00000  894 GiB  254 GiB  253 GiB   34 MiB  1.4 GiB  640 GiB  28.43  0.89   32      up          osd.15
 24    ssd   0.90970   1.00000  932 GiB  291 GiB  290 GiB   36 MiB  1.3 GiB  640 GiB  31.26  0.97   37      up          osd.24
 29    ssd   0.90970   1.00000  932 GiB  328 GiB  327 GiB   44 MiB  1.3 GiB  603 GiB  35.26  1.10   41      up          osd.29
                         TOTAL   27 TiB  8.7 TiB  8.7 TiB  1.3 GiB   44 GiB   18 TiB  32.12                                 
MIN/MAX VAR: 0.71/1.41  STDDEV: 5.05

The %USE varies too much, which I guess is due to the low PG per OSD?

NOTE: The version of ceph is Octopus and plans to upgrade will be taken at a later date.

What I am aiming for is to increase the replicas count to 3 (The extra OSDs and node were put in place to accomodate the extra headroom neaded for the higher replicas count). Should I do that first or adjust the number of PG before hand to something higher?

Octopus doesn't support the bulk flag I know, should I adjust the Target Ratio of say the 'ceph-vm' pool to be something like 0.8 before I do anything in the hopes that the autoscaler corrects the PGs? This pool indeed does host the majority of our data, the other pools are relatively unused.

If it is the case that PGs are too low at the moment, I understand that manually increasing the PG count to something like 1024 will trigger an intensive process of splitting existing PGs into small chunks. Is it better do do this before there are 3 copies of each PG? I am trying to minimise the amount of time that IO will be stressed.

If I do not touch the PG count and instead increase the repelicas count to 3, do I run the risk of the process increasing the PG count anyway while at the same time trying to to create the additional copy of the PG? I worry that this will reduce IO performance to client for a prolonged period of time.

Any advice offered would be appreciated.

Cheers,
Brad

aaron · Sep 4, 2023

What is the output of pveceph pool ls? Either redirect the output to a file, or run it in a large (wide) terminal to fully capture the output.

Definitely set the replica count of all pools to 3 ASAP if your cluster can support the extra data: size of 3 and min_size of 2.

Then you can use the target_ratio and target_size parameters to let the autoscaler know how much space you estimate each pool to consume in the end. You can ignore the device_health_metrics as it will never consume a lot. The cephfs_metadata pool will also most likely not consume much. If you only store some ISO files in the cephfs, then you could use the target_size parameter and assign a target_ratio to the ceph-vm pool. This way, the autoscaler will divide the ~1024 PGs recommended for this cluster accordingly. Since the ceph-vm pool will be the only one with a ratio, the autoscaler calculates the remaining space for it and will assign it PGs respectively.

If you increase the size to 3, you will see an increase of PGs per OSD as the additional replicas need to be stored somewhere.

Since you are still on octopus, you will have to manually enable the balancer: https://docs.ceph.com/en/octopus/rados/operations/balancer/
The balancer helps to "manually" move PGs to other OSDs to get a better leveled out usage in cases where the CRUSH algorithm doesn't do a well enough job.

brad.lanham said:
If it is the case that PGs are too low at the moment, I understand that manually increasing the PG count to something like 1024 will trigger an intensive process of splitting existing PGs into small chunks. Is it better do do this before there are 3 copies of each PG? I am trying to minimise the amount of time that IO will be stressed.

How utilized is your cluster? It should be able to handle that without too many issues, as you could always run into the situation that you lose an OSD or a full node from which it needs to recover ASAP.

Keep in mind, that if the autoscaler is working, it will usually scale up the pg_num slowly to avoid too much load.

Anyway, even if the autoscaler is enabled and configured (with target_ratio or target_size), it will most likely not change anything by itself, as the current pg_num for the ceph-vm pool is 512 and the optimal is most likely 1024. That difference is only a factor of 2. The autoscaler gets active once the difference is a factor of 3 or more. It will warn you, but you will have to set the pg_num yourself.

brad.lanham · Sep 13, 2023

aaron said:
What is the output of pveceph pool ls? Either redirect the output to a file, or run it in a large (wide) terminal to fully capture the output.

Code:

┌───────────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────┬─────────────────┬──────────────────────┬────────────────┐
│ Name                  │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio │ Crush Rule Name │               %-Used │           Used │
╞═══════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════╪═════════════════╪══════════════════════╪════════════════╡
│ ceph-vm               │    3 │        2 │    512 │         512 │            512 │ on                │                          │                           │ replicated_rule │    0.569648325443268 │ 14603810586648 │
├───────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ cephfs_data           │    2 │        2 │     32 │             │             32 │ on                │                          │                           │ replicated_rule │  0.00367486895993352 │    40693370662 │
├───────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ cephfs_metadata       │    2 │        2 │     32 │          16 │             16 │ on                │                          │                           │ replicated_rule │ 3.47073219018057e-05 │      382929705 │
├───────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼──────────────────────┼────────────────┤
│ device_health_metrics │    2 │        2 │      1 │           1 │              1 │ on                │                          │                           │ replicated_rule │ 3.72714066543267e-06 │       41120681 │
└───────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────┴─────────────────┴──────────────────────┴────────────────┘

aaron said:
Definitely set the replica count of all pools to 3 ASAP if your cluster can support the extra data: size of 3 and min_size of 2.

I have now completed this change as I gathered from the context of your reply that this step (even if it does trigger autoscale to increase the PG) won't induce the IO bottle neck I was fearing. Indeed it completed rather quickly with no real noticeable impact to the cluster performance

aaron said:
Then you can use the target_ratio and target_size parameters to let the autoscaler know how much space you estimate each pool to consume in the end. You can ignore the device_health_metrics as it will never consume a lot. The cephfs_metadata pool will also most likely not consume much. If you only store some ISO files in the cephfs, then you could use the target_size parameter and assign a target_ratio to the ceph-vm pool. This way, the autoscaler will divide the ~1024 PGs recommended for this cluster accordingly. Since the ceph-vm pool will be the only one with a ratio, the autoscaler calculates the remaining space for it and will assign it PGs respectively.

If you increase the size to 3, you will see an increase of PGs per OSD as the additional replicas need to be stored somewhere.

Since you are still on octopus, you will have to manually enable the balancer: https://docs.ceph.com/en/octopus/rados/operations/balancer/
The balancer helps to "manually" move PGs to other OSDs to get a better leveled out usage in cases where the CRUSH algorithm doesn't do a well enough job.

Will look at implementing this soon. Alternatively I may also just wait until we can upgrade ceph to the latest version and revisit then

aaron said:
How utilized is your cluster? It should be able to handle that without too many issues, as you could always run into the situation that you lose an OSD or a full node from which it needs to recover ASAP.

Keep in mind, that if the autoscaler is working, it will usually scale up the pg_num slowly to avoid too much load.

Anyway, even if the autoscaler is enabled and configured (with target_ratio or target_size), it will most likely not change anything by itself, as the current pg_num for the ceph-vm pool is 512 and the optimal is most likely 1024. That difference is only a factor of 2. The autoscaler gets active once the difference is a factor of 3 or more. It will warn you, but you will have to set the pg_num yourself.

Thank you for your advice. Apologies for the delayed reply.
Cheers, Brad

Search

Search

Advice on increasing ceph replicas

brad.lanham

Active Member

aaron

Proxmox Staff Member

brad.lanham

Active Member

We value your privacy