OSD reweight

Ronny · Sep 19, 2023

Hello,

maybe often diskussed but also question from me too:

since we have our ceph cluster we can see an unweighted usage of all osd's.

4 nodes with 7x1TB SSDs (1HE, no space left)
3 nodes with 8X1TB SSDs (2HE, some space left)

= 52 SSDs
pve 7.2-11

all ceph-nodes showing us the same like this:
the osd's are used between 80% and 60% - but why?
so my pool with the SSD-Class is nearfull (88%) - but the OSDs are not!
sometimes i reduce the reweight for an high-used OSD (from 1.00 to 0.70) and a few days back.

but it this the normal way?

does we have to less OSDs on our ceph cluster?

What do you think i should do?
i think about:
- upgrade the big nodes with three 2TB SSDs (is this possible because weighting?)
- upgrade the cluster with another node with also 8 SSDs

thanks a lot
Ronny

wigor · Sep 19, 2023

maybe your nodes are weighted unevenly (regarding disk capacity), you have hdd on your nodes as osd too.
what das "ceph osd df tree" says?

Ronny · Sep 19, 2023

the hdd osd's are in an separate class-pool.

here are the output of "ceph osd df tree"

Bash:

root@pve-hp-01:~# ceph osd df tree

ID   CLASS   WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME

 -1          51.27962         -   51 TiB   34 TiB   34 TiB  5.1 GiB   82 GiB   17 TiB  66.94  1.00    -          root default

-13           6.11121         -  6.1 TiB  4.5 TiB  4.5 TiB  394 MiB   10 GiB  1.6 TiB  73.75  1.10    -              host pve-dell-01

 36     ssd   0.87299   1.00000  894 GiB  582 GiB  581 GiB   47 MiB  1.1 GiB  312 GiB  65.05  0.97   52      up          osd.36

 37     ssd   0.87299   0.85004  894 GiB  684 GiB  682 GiB   89 MiB  1.2 GiB  210 GiB  76.46  1.14   61      up          osd.37

 38     ssd   0.87299   1.00000  894 GiB  736 GiB  734 GiB   58 MiB  1.5 GiB  159 GiB  82.27  1.23   66      up          osd.38

 40     ssd   0.87299   0.95001  894 GiB  627 GiB  626 GiB   66 MiB  1.2 GiB  267 GiB  70.13  1.05   56      up          osd.40

 41     ssd   0.87299   1.00000  894 GiB  738 GiB  736 GiB   85 MiB  1.5 GiB  156 GiB  82.54  1.23   66      up          osd.41

 42     ssd   0.87299   1.00000  894 GiB  581 GiB  580 GiB   48 MiB  1.0 GiB  313 GiB  65.00  0.97   52      up          osd.42

 52     ssd   0.87329   0.95001  894 GiB  669 GiB  667 GiB  329 KiB  2.5 GiB  225 GiB  74.82  1.12   60      up          osd.52

-16           6.11200         -  6.1 TiB  4.5 TiB  4.5 TiB  781 MiB  8.6 GiB  1.6 TiB  74.16  1.11    -              host pve-dell-02

 44     ssd   0.87299   1.00000  894 GiB  728 GiB  726 GiB  237 MiB  1.3 GiB  167 GiB  81.36  1.22   66      up          osd.44

 45     ssd   0.87299   0.95001  894 GiB  593 GiB  592 GiB   80 MiB  1.1 GiB  301 GiB  66.34  0.99   53      up          osd.45

 46     ssd   0.87299   1.00000  894 GiB  748 GiB  746 GiB   93 MiB  1.4 GiB  146 GiB  83.65  1.25   67      up          osd.46

 47     ssd   0.87299   1.00000  894 GiB  759 GiB  758 GiB   67 MiB  1.3 GiB  135 GiB  84.90  1.27   68      up          osd.47

 48     ssd   0.87299   1.00000  894 GiB  583 GiB  582 GiB   93 MiB  1.1 GiB  311 GiB  65.23  0.97   52      up          osd.48

 49     ssd   0.87299   0.90002  894 GiB  595 GiB  593 GiB  121 MiB  1.2 GiB  299 GiB  66.51  0.99   53      up          osd.49

 50     ssd   0.87299   1.00000  894 GiB  636 GiB  635 GiB   90 MiB  1.2 GiB  258 GiB  71.15  1.06   57      up          osd.50

-29           6.11240         -  6.1 TiB  4.3 TiB  4.3 TiB  760 MiB  8.2 GiB  1.8 TiB  70.52  1.05    -              host pve-dell-03

 74     ssd   0.87320   1.00000  894 GiB  501 GiB  500 GiB  113 MiB  943 MiB  393 GiB  56.05  0.84   45      up          osd.74

 75     ssd   0.87320   0.95001  894 GiB  661 GiB  660 GiB   49 MiB  1.2 GiB  233 GiB  73.91  1.10   59      up          osd.75

 76     ssd   0.87320   1.00000  894 GiB  559 GiB  558 GiB   66 MiB  1.1 GiB  335 GiB  62.56  0.93   50      up          osd.76

 77     ssd   0.87320   0.79999  894 GiB  616 GiB  615 GiB  125 MiB  1.1 GiB  278 GiB  68.89  1.03   55      up          osd.77

 78     ssd   0.87320   1.00000  894 GiB  751 GiB  749 GiB  110 MiB  1.4 GiB  144 GiB  83.95  1.25   67      up          osd.78

 79     ssd   0.87320   0.89999  894 GiB  607 GiB  606 GiB   52 MiB  1.2 GiB  287 GiB  67.90  1.01   54      up          osd.79

 80     ssd   0.87320   1.00000  894 GiB  719 GiB  718 GiB  244 MiB  1.3 GiB  175 GiB  80.43  1.20   65      up          osd.80

-19           6.11304         -  6.1 TiB  4.4 TiB  4.4 TiB  2.3 MiB   17 GiB  1.7 TiB  71.70  1.07    -              host pve-dell-04

 58     ssd   0.87329   0.95001  894 GiB  561 GiB  558 GiB  302 KiB  2.2 GiB  334 GiB  62.69  0.94   50      up          osd.58

 59     ssd   0.87329   0.95001  894 GiB  736 GiB  734 GiB  373 KiB  2.3 GiB  158 GiB  82.29  1.23   66      up          osd.59

 60     ssd   0.87329   0.79999  894 GiB  651 GiB  649 GiB  315 KiB  2.4 GiB  243 GiB  72.83  1.09   58      up          osd.60

 61     ssd   0.87329   0.85001  894 GiB  606 GiB  604 GiB  327 KiB  2.4 GiB  288 GiB  67.78  1.01   54      up          osd.61

 62     ssd   0.87329   1.00000  894 GiB  713 GiB  710 GiB  392 KiB  2.7 GiB  181 GiB  79.72  1.19   64      up          osd.62

 64     ssd   0.87329   1.00000  894 GiB  627 GiB  624 GiB  305 KiB  2.4 GiB  268 GiB  70.07  1.05   56      up          osd.64

 65     ssd   0.87329   1.00000  894 GiB  595 GiB  593 GiB  313 KiB  2.1 GiB  299 GiB  66.51  0.99   53      up          osd.65

 -3           8.80385         -  8.8 TiB  5.1 TiB  5.0 TiB  1.3 GiB   11 GiB  3.7 TiB  57.41  0.86    -              host pve-hp-01

 25  hdd300   0.27299   1.00000  279 GiB   34 GiB   34 GiB   66 MiB  238 MiB  245 GiB  12.15  0.18   11      up          osd.25

 26  hdd300   0.18199   1.00000  186 GiB   38 GiB   37 GiB   66 MiB  250 MiB  149 GiB  20.22  0.30   13      up          osd.26

 27  hdd300   0.27299   1.00000  279 GiB   57 GiB   57 GiB   95 MiB  204 MiB  222 GiB  20.35  0.30   19      up          osd.27

 28  hdd300   0.27299   1.00000  279 GiB   33 GiB   33 GiB   51 MiB   89 MiB  246 GiB  11.83  0.18   11      up          osd.28

 34  hdd300   0.27299   1.00000  279 GiB   48 GiB   48 GiB   87 MiB  135 MiB  231 GiB  17.19  0.26   16      up          osd.34

 35  hdd300   0.27299   1.00000  279 GiB   51 GiB   51 GiB   98 MiB  251 MiB  228 GiB  18.37  0.27   17      up          osd.35

 55  hdd300   0.27299   1.00000  279 GiB   34 GiB   34 GiB   47 MiB  293 MiB  245 GiB  12.25  0.18   11      up          osd.55

  0     ssd   0.87299   1.00000  894 GiB  483 GiB  482 GiB  125 MiB  922 MiB  411 GiB  54.02  0.81   43      up          osd.0

  1     ssd   0.87299   1.00000  894 GiB  669 GiB  668 GiB   74 MiB  1.3 GiB  225 GiB  74.83  1.12   60      up          osd.1

  2     ssd   0.87299   1.00000  894 GiB  660 GiB  659 GiB  108 MiB  1.3 GiB  234 GiB  73.81  1.10   59      up          osd.2

  3     ssd   0.87299   1.00000  894 GiB  580 GiB  579 GiB   79 MiB  1.1 GiB  314 GiB  64.90  0.97   52      up          osd.3

  5     ssd   0.87299   1.00000  894 GiB  697 GiB  695 GiB   98 MiB  1.4 GiB  197 GiB  77.93  1.16   62      up          osd.5

  7     ssd   0.87299   1.00000  894 GiB  683 GiB  682 GiB  106 MiB  1.3 GiB  211 GiB  76.44  1.14   61      up          osd.7

  9     ssd   0.87299   1.00000  894 GiB  604 GiB  603 GiB  125 MiB  1.0 GiB  290 GiB  67.52  1.01   54      up          osd.9

 51     ssd   0.87299   0.80000  894 GiB  504 GiB  503 GiB  105 MiB  1.0 GiB  390 GiB  56.41  0.84   45      up          osd.51

 -5           9.40527         -  9.4 TiB  6.1 TiB  6.1 TiB  706 MiB   17 GiB  3.3 TiB  64.65  0.97    -              host pve-hp-02

 29  hdd300   0.18199   1.00000  186 GiB   20 GiB   20 GiB   28 MiB  285 MiB  166 GiB  10.98  0.16    7      up          osd.29

 31  hdd300   0.27299   1.00000  279 GiB   34 GiB   34 GiB   55 MiB  325 MiB  245 GiB  12.29  0.18   11      up          osd.31

 32  hdd300   0.27299   1.00000  279 GiB   28 GiB   28 GiB   43 MiB  226 MiB  251 GiB   9.96  0.15    9      up          osd.32

 33  hdd300   0.27299   1.00000  279 GiB   36 GiB   36 GiB   71 MiB  239 MiB  243 GiB  12.96  0.19   12      up          osd.33

 56  hdd300   0.27299   0.95001  279 GiB   54 GiB   54 GiB   84 MiB  116 MiB  225 GiB  19.31  0.29   18      up          osd.56

 57  hdd300   0.27299   1.00000  279 GiB   57 GiB   57 GiB   88 MiB  210 MiB  222 GiB  20.35  0.30   19      up          osd.57

  4     ssd   0.87329   1.00000  894 GiB  740 GiB  737 GiB  360 KiB  2.6 GiB  154 GiB  82.75  1.24   66      up          osd.4

  6     ssd   0.87329   1.00000  894 GiB  589 GiB  587 GiB  304 KiB  2.0 GiB  305 GiB  65.89  0.98   53      up          osd.6

  8     ssd   0.87329   0.89999  894 GiB  672 GiB  669 GiB  403 KiB  2.4 GiB  222 GiB  75.12  1.12   60      up          osd.8

 10     ssd   0.87299   1.00000  894 GiB  727 GiB  726 GiB   65 MiB  1.4 GiB  167 GiB  81.31  1.21   65      up          osd.10

 11     ssd   0.87299   1.00000  894 GiB  626 GiB  624 GiB   78 MiB  1.1 GiB  269 GiB  69.96  1.05   56      up          osd.11

 12     ssd   0.87299   1.00000  894 GiB  650 GiB  649 GiB   56 MiB  1.2 GiB  244 GiB  72.74  1.09   58      up          osd.12

 13     ssd   0.87299   0.95001  894 GiB  558 GiB  557 GiB   52 MiB  1.1 GiB  336 GiB  62.43  0.93   50      up          osd.13

 30     ssd   0.87320   1.00000  894 GiB  751 GiB  750 GiB   86 MiB  1.4 GiB  143 GiB  84.02  1.26   67      up          osd.30

 39     ssd   0.87329   0.95001  894 GiB  683 GiB  681 GiB  355 KiB  2.0 GiB  211 GiB  76.35  1.14   61      up          osd.39

 -7           8.62186         -  8.6 TiB  5.5 TiB  5.4 TiB  1.2 GiB   11 GiB  3.2 TiB  63.31  0.95    -              host pve-hp-03

 21  hdd300   0.27299   1.00000  279 GiB   39 GiB   39 GiB   65 MiB  206 MiB  240 GiB  14.14  0.21   13      up          osd.21

 22  hdd300   0.27299   1.00000  279 GiB   56 GiB   56 GiB   90 MiB  335 MiB  223 GiB  20.17  0.30   19      up          osd.22

 23  hdd300   0.27299   1.00000  279 GiB   36 GiB   36 GiB   89 MiB   86 MiB  243 GiB  13.06  0.20   12      up          osd.23

 24  hdd300   0.27299   1.00000  279 GiB   33 GiB   33 GiB   34 MiB   79 MiB  246 GiB  11.90  0.18   11      up          osd.24

 43  hdd300   0.27299   1.00000  279 GiB   48 GiB   48 GiB   90 MiB  136 MiB  231 GiB  17.29  0.26   16      up          osd.43

 63  hdd300   0.27299   1.00000  279 GiB   34 GiB   33 GiB   33 MiB   85 MiB  246 GiB  12.02  0.18   11      up          osd.63

 14     ssd   0.87299   1.00000  894 GiB  736 GiB  735 GiB   67 MiB  1.4 GiB  158 GiB  82.33  1.23   66      up          osd.14

 15     ssd   0.87299   1.00000  894 GiB  672 GiB  670 GiB   71 MiB  1.4 GiB  222 GiB  75.12  1.12   60      up          osd.15

 16     ssd   0.87299   0.85001  894 GiB  693 GiB  692 GiB   76 MiB  1.4 GiB  201 GiB  77.53  1.16   62      up          osd.16

 17     ssd   0.87299   1.00000  894 GiB  662 GiB  661 GiB  109 MiB  1.3 GiB  232 GiB  74.05  1.11   59      up          osd.17

 19     ssd   0.87299   0.90002  894 GiB  696 GiB  695 GiB   89 MiB  1.3 GiB  198 GiB  77.83  1.16   62      up          osd.19

 20     ssd   0.87299   1.00000  894 GiB  606 GiB  604 GiB  289 MiB  1.3 GiB  289 GiB  67.73  1.01   55      up          osd.20

 53     ssd   0.87299   1.00000  894 GiB  638 GiB  637 GiB  110 MiB  1.1 GiB  256 GiB  71.34  1.07   57      up          osd.53

 54     ssd   0.87299   0.95001  894 GiB  640 GiB  638 GiB   50 MiB  1.1 GiB  255 GiB  71.52  1.07   57      up          osd.54

                          TOTAL   51 TiB   34 TiB   34 TiB  5.1 GiB   82 GiB   17 TiB  66.94

MIN/MAX VAR: 0.15/1.27  STDDEV: 28.25

wigor · Sep 19, 2023

hey, please use code-tags around the output.

i can´t see any big mistake. Your pool seems to be near full.
In ceph you can´t (by normal way) equalize all osd´s. That´s because of pg placement. It´s not on byte level or something similiar. But it doesn´t hurt.
You can try to increase pg count i think, but i´m not sure that it will help.
In my experience, once you play with manual reweight, you have to get more ressources.

Ronny · Sep 20, 2023

hi wigor,
thanks for your answer. at next time i will be think on code-tags

at the moment the SSD Pool have 1024 PGs and PVE says it also in "Optimal PGs"

the thing is every time i decrease one osd's weight i have more space in this pool - but a few hours later the "new" space is gone...

any suggestions for me for adding more SSDs on my cluster?

i think about 3x2TB SSDs in my 3 big-nodes (wich have space for more drives).
does i have some config changes with weighting or something like that - or simple adding and all OK?

or - i add a new cluster-node... but this will be the long way...

thanks a lot

Maximiliano · Sep 20, 2023

Hello, could you please post the output of pveceph pool ls.

the thing is every time i decrease one osd's weight i have more space in this pool - but a few hours later the "new" space is gone...

This is the result of the data being moved around. The pool usage % principally reflects the used % of the fullest OSD. If a single OSD is around 88% full then for all intents and purposes the entire pool is at least 88% full.

Additionally, note that 88% is dangerously full for Ceph, at around that % you might see Ceph warnings and at around 95% all I/O will be blocked on the entire pool to ensure data integrity.

On a different note, I would suggest to not use size/min_size = 2/1 on the hdd_level2 rule, thats a recipe to lose data, please use the default values 3/2.

wigor · Sep 20, 2023

Ronny said:
i think about 3x2TB SSDs in my 3 big-nodes (wich have space for more drives).
does i have some config changes with weighting or something like that - or simple adding and all OK?

I think that would make it worse. In ceph best is to use nearly identical capacities on the nodes. If you increase your "big nodes" further, ceph can not distribute data in "it´s way".
I think best is to exchange the small ssd one by one with 2 TB ssd. Beginning in the node with the lowest capacity. But maybe somebody with more exprience have other ideas.

Maximiliano · Sep 20, 2023

Its very hard to tell from the output of ceph osd df tree since its not inside a code block, but yes. It seems there are nodes which different number of OSDs which could explain why Ceph is having a hard time distributing the OSD usage.

wigor · Sep 20, 2023

Ronny said:
thanks for your answer. at next time i will be think on code-tags

you should be able to edit your post.

Ronny · Sep 20, 2023

hi there,

thanks for your hint - i changed my post with ceph osd tree and added the code tags - nice feature

the pool with the small hdds is for testing ONLY and i know about 2/1 and data loss on crash.

and here are the output form pveceph pool ls

Bash:

root@pve-hp-01:~# pveceph pool ls
┌───────────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────┬─────────────────────┬──────────────────────┬────────────────┐
│ Name                  │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio │ Crush Rule Name     │               %-Used │           Used │
╞═══════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════╪═════════════════════╪══════════════════════╪════════════════╡
│ device_health_metrics │    3 │        2 │      1 │           1 │              1 │ on                │                          │                           │ replicated_rule     │ 9.64546561590396e-05 │      489739233 │
├───────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────────┼──────────────────────┼────────────────┤
│ hdd_level_2           │    2 │        1 │    128 │             │            128 │ on                │                          │                           │ replicated_rule_300 │    0.164854913949966 │   805588378013 │
├───────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────────┼──────────────────────┼────────────────┤
│ pro_level_3           │    3 │        2 │   1024 │             │           1024 │ warn              │                          │                           │ replicated_rule     │    0.880345165729523 │ 37352750540128 │
└───────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────┴─────────────────────┴──────────────────────┴────────────────┘

Maximiliano · Sep 20, 2023

The number of PGs you are running per OSD is somewhere in the lower end of whats recommended, could you please set *a* value (1 for example) in the PG "Autoscale Targer Size" of the `pro_level_3 pool`? You can do this in the web UI at Datacenter->{Node}->Ceph->Pools->{Pool}->Target Ratio.

This will enable the autoscaler to determine what would be the optimal number of PGs you should have on the pool. Since you have the autoscaler mode set to warn you will have to the number of PGs (# of PGs in the web UI) manually. Having more PGs should hopefully help spreading the data more evenly across all your OSDs.

Ronny · Sep 21, 2023

hi,

sorry for my late answer - long work day.

at the moment my pool have this config:
what will be happen with my ceph storage after change the target ratio to 1? some performance impact?

Maximiliano · Sep 22, 2023

When you set the Target Ratio to *a* value, the autoscaler will determine the optimal number of PGs for that pool. If the optimal number of PGs is 3 times bigger (or smaller) than the # of PGs, then the # of PGs will be set to the optimal number of PGs, only if, you had the "PG Autoscale Mode" to on. At the moment you have it set to warn which will only warn you so won't experience any performance impact until you masnually set the # of PGs.

Changing the number of PGs will require a lot of data to be shuffled, which should happen in the background as a low-priority task, but it could still be a good idea to do it in a period where the Ceph cluster is expected to experience low I/O demand.

Ronny · Oct 11, 2023

Hi Maximiliano,

thank you for your advice.

i have tested this on our testing cluster (very old hardware, no ssd's...) and... the cluster is gone

the only thing i changed was the Target Ratio to "1.0"

he struggled hard with IO and after 4 days i turned all nodes off.
it was not the best example for me. so i have many fears to do this on our prod-cluster.

i have found some VMs which i can move the disks to our NFS-share so i have more space ressources available (~1TB)

Maximiliano · Oct 11, 2023

Hello,

When setting the target ratio to *a* value, the autoscaler will be turn on. If it detects that the optimal number of PGs is way off (as described in a previous response) it will set the number of PGs to a new value, causing a possibly big change in Cephs' topology and that might cause more IO than what certain hardware can manage.

In principle this is ok, but one could first set the autoscaler mode to `warn` to prevent automatic changes, and then limit further the IO used by Ceph for rebalancing before setting the number of PGs manually.

Search

Search

OSD reweight

Ronny

Well-Known Member

Attachments

wigor

Well-Known Member

Ronny

Well-Known Member

wigor

Well-Known Member

Ronny

Well-Known Member

Maximiliano

Proxmox Staff Member

wigor

Well-Known Member

Maximiliano

Proxmox Staff Member

wigor

Well-Known Member

Ronny

Well-Known Member

Maximiliano

Proxmox Staff Member

Ronny

Well-Known Member

Maximiliano

Proxmox Staff Member

Ronny

Well-Known Member

Maximiliano

Proxmox Staff Member

We value your privacy