[SOLVED] CEPH: Understanding of utilization

grefabu

Well-Known Member
May 23, 2018
239
14
58
50
Hi,

in one site I overtook the administration aof an pve cluster with five knodes.

The CEPH pool is up to 86%, the Pool is a 3/2 .
Do I thougth right, that when obe node is vanished of an HW Problem I run in a problem with the utilisation. Cause the PGs from this node must rebuild on an other to get the qorum solved?

Although I think 86% is in a warning state otherwis, for utilisations resons?

Bye

Gregor
 
You need to post the output of `ceph status` on one of the good nodes and from one of the bad nodes.
 
Hi,

maybee we misunderstood.

In the moment I dosn't have a problem in the moment, even the utilisation is on 86% of the available Space.
I'll care for a realy problem, like an Server goes Down of an hardware problem or other things.

When one node goes Down, 20% of the available Space is gone. Then I think I run to a critical Problem with the utilisation?

Bye

Gregor
 
To get the terms right :)

The pool consists of many (million) objects. These objects are grouped in placement groups (PGs) to make the accounting less resource intensive. How many replicas (copies) on different nodes in the cluster need to exist and where they are placed is decided on the PG level.

So, in your case, if one of the 5 nodes goes down, the replicas that were on that node will be rebuilt on other nodes in the cluster that do not already have a replica of said PG.

This means, that you should have enough free space in the cluster. Also, the fullest OSD will limit the estimated free space.

Can you post the output of the following commands within [CODE][/CODE] tags? Then we can take a quick look and see how things are :)

Code:
ceph -s
ceph osd df tree
 
Hi,

thank you, here the output.
And yes I see it today before that one OSD is down. That's the next problem,...
In the moment I'm not knowing if there is an guarantee on the hardware,...

ceph -s
Code:
  cluster:
    id:     65b65640-8d90-4f0a-811d-d77b7cc6c771
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum prod-pve01,prod-pve02,prod-pve03,prod-pve04,prod-pve05 (age 5w)
    mgr: prod-pve01(active, since 2M), standbys: prod-pve04, prod-pve02, prod-pve05, prod-pve03
    osd: 30 osds: 29 up (since 5w), 29 in (since 5w)
 
  data:
    pools:   2 pools, 1025 pgs
    objects: 1.63M objects, 6.1 TiB
    usage:   18 TiB used, 7.7 TiB / 25 TiB avail
    pgs:     1024 active+clean
             1    active+clean+scrubbing+deep
 
  io:
    client:   435 KiB/s rd, 6.4 MiB/s wr, 70 op/s rd, 251 op/s wr

ceph osd df tree
Code:
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME          
 -1         26.19873         -   25 TiB   18 TiB   18 TiB  760 MiB   44 GiB  7.7 TiB  69.45  1.00    -          root default       
 -3          5.23975         -  5.2 TiB  3.5 TiB  3.5 TiB   19 MiB  8.8 GiB  1.7 TiB  66.64  0.96    -              host prod-pve01
  0    ssd   0.87329   1.00000  894 GiB  581 GiB  580 GiB  3.1 MiB  1.3 GiB  313 GiB  64.95  0.94   99      up          osd.0      
  1    ssd   0.87329   1.00000  894 GiB  630 GiB  629 GiB  3.4 MiB  1.6 GiB  264 GiB  70.50  1.02  107      up          osd.1      
  2    ssd   0.87329   1.00000  894 GiB  570 GiB  569 GiB  2.9 MiB  1.3 GiB  324 GiB  63.75  0.92   97      up          osd.2      
  3    ssd   0.87329   1.00000  894 GiB  538 GiB  537 GiB  2.9 MiB  1.5 GiB  356 GiB  60.20  0.87   92      up          osd.3      
  4    ssd   0.87329   1.00000  894 GiB  630 GiB  628 GiB  3.4 MiB  1.5 GiB  265 GiB  70.41  1.01  107      up          osd.4      
  5    ssd   0.87329   1.00000  894 GiB  626 GiB  624 GiB  3.3 MiB  1.7 GiB  268 GiB  70.02  1.01  107      up          osd.5      
 -5          5.23975         -  4.4 TiB  3.3 TiB  3.3 TiB  239 MiB  8.1 GiB  1.1 TiB  74.86  1.08    -              host prod-pve02
  6    ssd   0.87329   1.00000  894 GiB  757 GiB  755 GiB  4.0 MiB  1.9 GiB  138 GiB  84.62  1.22  129      up          osd.6      
  7    ssd   0.87329   1.00000  894 GiB  726 GiB  724 GiB  3.9 MiB  1.9 GiB  168 GiB  81.18  1.17  124      up          osd.7      
  8    ssd   0.87329         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.8      
  9    ssd   0.87329   1.00000  894 GiB  656 GiB  655 GiB  3.6 MiB  1.5 GiB  238 GiB  73.37  1.06  112      up          osd.9      
 10    ssd   0.87329   1.00000  894 GiB  733 GiB  731 GiB  225 MiB  1.6 GiB  161 GiB  81.95  1.18  126      up          osd.10     
 11    ssd   0.87329   1.00000  894 GiB  475 GiB  474 GiB  2.5 MiB  1.1 GiB  419 GiB  53.16  0.77   81      up          osd.11     
 -7          5.23975         -  5.2 TiB  3.8 TiB  3.8 TiB   21 MiB  9.8 GiB  1.4 TiB  72.57  1.04    -              host prod-pve03
 12    ssd   0.87329   1.00000  894 GiB  650 GiB  649 GiB  3.4 MiB  1.5 GiB  244 GiB  72.73  1.05  111      up          osd.12     
 13    ssd   0.87329   1.00000  894 GiB  641 GiB  639 GiB  3.4 MiB  1.6 GiB  253 GiB  71.66  1.03  109      up          osd.13     
 14    ssd   0.87329   1.00000  894 GiB  627 GiB  625 GiB  3.4 MiB  1.7 GiB  267 GiB  70.11  1.01  107      up          osd.14     
 15    ssd   0.87329   1.00000  894 GiB  569 GiB  568 GiB  3.0 MiB  1.5 GiB  325 GiB  63.64  0.92   97      up          osd.15     
 16    ssd   0.87329   1.00000  894 GiB  685 GiB  683 GiB  3.6 MiB  1.8 GiB  210 GiB  76.56  1.10  117      up          osd.16     
 17    ssd   0.87329   1.00000  894 GiB  722 GiB  720 GiB  3.9 MiB  1.8 GiB  173 GiB  80.71  1.16  123      up          osd.17     
 -9          5.23975         -  5.2 TiB  3.4 TiB  3.4 TiB  240 MiB  8.7 GiB  1.8 TiB  65.32  0.94    -              host prod-pve04
 18    ssd   0.87329   1.00000  894 GiB  538 GiB  537 GiB  224 MiB  1.3 GiB  356 GiB  60.20  0.87   93      up          osd.18     
 19    ssd   0.87329   1.00000  894 GiB  579 GiB  577 GiB  3.1 MiB  1.5 GiB  316 GiB  64.69  0.93   99      up          osd.19     
 20    ssd   0.87329   1.00000  894 GiB  586 GiB  585 GiB  3.1 MiB  1.4 GiB  308 GiB  65.58  0.94  100      up          osd.20     
 21    ssd   0.87329   1.00000  894 GiB  575 GiB  574 GiB  3.1 MiB  1.4 GiB  319 GiB  64.35  0.93   98      up          osd.21     
 22    ssd   0.87329   1.00000  894 GiB  604 GiB  602 GiB  3.3 MiB  1.4 GiB  290 GiB  67.53  0.97  103      up          osd.22     
 23    ssd   0.87329   1.00000  894 GiB  622 GiB  621 GiB  3.3 MiB  1.6 GiB  272 GiB  69.57  1.00  106      up          osd.23     
-11          5.23975         -  5.2 TiB  3.6 TiB  3.6 TiB  241 MiB  8.8 GiB  1.6 TiB  68.78  0.99    -              host prod-pve05
 24    ssd   0.87329   1.00000  894 GiB  623 GiB  621 GiB  3.3 MiB  1.6 GiB  271 GiB  69.64  1.00  106      up          osd.24     
 25    ssd   0.87329   1.00000  894 GiB  661 GiB  659 GiB  3.5 MiB  1.4 GiB  233 GiB  73.90  1.06  113      up          osd.25     
 26    ssd   0.87329   1.00000  894 GiB  551 GiB  550 GiB  3.0 MiB  1.3 GiB  343 GiB  61.62  0.89   94      up          osd.26     
 27    ssd   0.87329   1.00000  894 GiB  573 GiB  571 GiB  224 MiB  1.4 GiB  322 GiB  64.03  0.92   99      up          osd.27     
 28    ssd   0.87329   1.00000  894 GiB  709 GiB  707 GiB  3.8 MiB  1.6 GiB  185 GiB  79.28  1.14  121      up          osd.28     
 29    ssd   0.87329   1.00000  894 GiB  574 GiB  572 GiB  3.1 MiB  1.5 GiB  320 GiB  64.18  0.92   98      up          osd.29     
                         TOTAL   25 TiB   18 TiB   18 TiB  760 MiB   44 GiB  7.7 TiB  69.45                                        
MIN/MAX VAR: 0.77/1.22  STDDEV: 7.33
 
I think (though i am new to this so if i get this wrong, i apologize)

You have (7TB / divided by 5) free space (assuming any new file/object you place in one of the two pools is replicated to all 5 nodes).

(i am assuming these are replicated pools, not erasure encoded as i don't know how to tell the difference in those two views, lol)
 
You have (7TB / divided by 5) free space (assuming any new file/object you place in one of the two pools is replicated to all 5 nodes).
It is not that simple. The pool(s) use 3/2. So 3 replicas will be spread over the 5 nodes.

One reason why the pool shows as quite full is that some OSDs are used more than others. As a very full OSD will limit the estimated free space.

Which Ceph version is this?
What is the output of ceph balancer status?

You have 894 GiB * 6 OSDs * 5 nodes = 26820 GiB raw capacity. Divided by the 3 replicas and 90% of that (Ceph safety limits) there is about 8046 GiB of usable space if the cluster is okay and the data perfectly balanced across the OSDs. If we subtract the one node that might fail from the calculation we are at 6436.8 GiB
 
It is not that simple.
i was pretty certain it wasn't (due to the two pools) and was hoping it would poke someone to correct me :)

also isn't the free space the space for say a file and all its replicas - so if one writes a 1MB file that will use 1MiB x the number replicas of that 6436.8 GiB? I was unclear if they were asking about 'effective usable space' .. or not...
 
The raw capacity of the cluster is what is used to store the replicas. If all your pools use a size=3, then you can divide the raw capacity by 3, calculate 85 to 90 % of that and that is the max you can effectively use. Roughly. If you have pools with a size of 4, you have to divide by 4 of course. The lower percentage is due to the limits Ceph has on the OSDs, the nearful, backfill-full and full ones.

And you will most likely want to be able to endure the loss of at least one node, so that needs to subtracted from the raw capacity calculation.

In reality, other factors also play a role. In most situations it is an uneven balance of the data across the OSDs and nodes. As the fullest OSD limits the effective free space. With good pg_nums on the pools (autoscaler with target_size or target_ratio can help) the PGs should be very evenly sized. The Ceph balancer then helps to adjust any imbalance in OSD usage caused by PGs that are considerably larger than others by moving them to OSDs that have a lower usage. So in the end, if things work as expected, all OSDs should be within a +- 5% range regarding space used.

I hope this explains it well enough :)
 
yes that's great and matches my rough mental model
(and all my pools are size=3 and i have 3 nodes which is a nice coincidence)
 
Hi,

sorry for the late answer.
Which Ceph version is this?
What is the output of ceph balancer status?

It is ceph: 17.2.6-pve1+3

ceph balancer status
Code:
{
    "active": true,
    "last_optimize_duration": "0:00:00.004928",
    "last_optimize_started": "Wed Oct 18 15:08:58 2023",
    "mode": "upmap",
    "no_optimization_needed": false,
    "optimize_result": "Optimization plan created successfully",
    "plans": []
}

I think this question is solved now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!