Ceph OSD_BACKFILLFULL and POOL_BACKFILLFULL situation

René Pfeiffer

Active Member
Oct 29, 2018
20
4
43
Vienna
web.luchs.at
Hello!
We use a Ceph cluster running on four nodes with 32 OSD daemons. The cluster has 16 slow (HDD) and 16 fast (SSD) disks. The health information indicates the OSD_BACKFILLFULL and POOL_BACKFILLFULL status flags. "ceph health detail" show:

Code:
HEALTH_WARN 1 failed cephadm daemon(s); 1 backfillfull osd(s); 2 pool(s) backfillfull; 4 slow ops, oldest one blocked for 189006 sec, mon.hat-ceph-01 has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.32 on hat-ceph-01 is in error state
[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
    osd.8 is backfill full
[WRN] POOL_BACKFILLFULL: 2 pool(s) backfillfull
    pool 'device_health_metrics' is backfillfull
    pool 'fastpool' is backfillfull
[WRN] SLOW_OPS: 4 slow ops, oldest one blocked for 189006 sec, mon.hat-ceph-01 has slow ops

"ceph osd status" shows:

Code:
ID  HOST          USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE                  
 1  hat-ceph-01   590G   303G    122     2743k      0     24.7k  exists,up              
 2  hat-ceph-01   514G   379G      0        0       0        0   exists,up              
 3  hat-ceph-01   442G   451G      8     1026k      4      320k  exists,up              
 4  hat-ceph-01   238G  10.6T      1     19.1k      0        0   exists,up              
 5  hat-ceph-01   238G  10.6T      1     13.5k      0        0   exists,up              
 6  hat-ceph-01   203G  10.7T      0     5733       0        0   exists,up              
 7  hat-ceph-01   170G  10.7T      0     4914       0        0   exists,up              
 8  hat-ceph-02   811G  82.3G    186     11.0M      1     9829   backfillfull,exists,up 
 9  hat-ceph-02   219G   674G      0        0       0        0   exists,up              
10  hat-ceph-02   442G   452G     55     2961k      0     51.2k  exists,up              
11  hat-ceph-02   519G   375G    103     4789k      5      284k  exists,up              
12  hat-ceph-03   440G   453G    132     4302k      2     52.7k  exists,up              
13  hat-ceph-02   169G  10.7T      0     4914       0        0   exists,up              
14  hat-ceph-03   442G   452G     63     2762k      2     68.8k  exists,up              
15  hat-ceph-02   205G  10.7T      1     15.9k      0        0   exists,up              
16  hat-ceph-03   149G   745G      0        0       0        0   exists,up              
17  hat-ceph-02   204G  10.7T      1     10.3k      0        0   exists,up              
18  hat-ceph-03   738G   156G      0        0       2     90.4k  exists,up              
19  hat-ceph-02   273G  10.6T      0     1638       0        0   exists,up              
20  hat-ceph-03   272G  10.6T      1     9829       0        0   exists,up              
21  hat-ceph-03   171G  10.7T      0     4095       0        0   exists,up              
22  hat-ceph-03   136G  10.7T      0        0       0        0   exists,up              
23  hat-ceph-03   271G  10.6T      0     5836       0        0   exists,up              
24  hat-ceph-04   148G   745G     46     1171k      0        0   exists,up              
25  hat-ceph-04   590G   304G     97     7709k      1     69.6k  exists,up              
26  hat-ceph-04   222G   671G      1     2474k      0     15.1k  exists,up              
27  hat-ceph-04   442G   451G      0        0       0     6552   exists,up              
28  hat-ceph-04  68.7G  10.8T      0        0       0        0   exists,up              
29  hat-ceph-04   271G  10.6T      1     8191       0        0   exists,up              
30  hat-ceph-04   170G  10.7T      0     1945       0        0   exists,up              
31  hat-ceph-04   204G  10.7T      0     1638       0        0   exists,up              
33  hat-ceph-01   368G   525G      0        0       0     28.0k  exists,up

"ceph df" shows:

Code:
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    175 TiB  171 TiB  3.2 TiB   3.2 TiB       1.83
ssd     14 TiB  7.1 TiB  6.9 TiB   6.9 TiB      49.51
TOTAL  189 TiB  178 TiB   10 TiB    10 TiB       5.36
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1   91 MiB       32  274 MiB      0    2.6 TiB
fastpool                2   32  2.3 TiB  604.72k  6.9 TiB  92.13    201 GiB
slowpool                3   32  1.1 TiB  298.55k  3.2 TiB   1.93     54 TiB

How can the "backfillfull" status be cleared?

Best,
René.
 
Last edited:
Can you please post the outputs in [code][/code] blocks? Otherwise it is almost impossible to read it.
 
Code:
ID  HOST          USED  AVAIL
[...]
 8  hat-ceph-02   811G  82.3G
For some reason, that single OSD is almost full, therefore the backfillfull warning, while the others are rather empty.

Can you post the output of ceph osd df tree?
Also pveceph pool ls --noborder?

Which version of Ceph is this?
 
"ceph osd df tree" shows:
Code:
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP      META      AVAIL    %USE   VAR    PGS  STATUS  TYPE NAME           
 -1         188.59813         -  189 TiB   10 TiB   10 TiB   343 MiB    38 GiB  178 TiB   5.36   1.00    -          root default        
 -3          47.14952         -   47 TiB  2.7 TiB  2.7 TiB   113 MiB   9.7 GiB   44 TiB   5.74   1.07    -              host hat-ceph-01
  4    hdd   10.91408   1.00000   11 TiB  239 GiB  238 GiB   502 KiB  1024 MiB   11 TiB   2.14   0.40    7      up          osd.4       
  5    hdd   10.91409   1.00000   11 TiB  238 GiB  237 GiB  1014 KiB  1023 MiB   11 TiB   2.13   0.40    7      up          osd.5       
  6    hdd   10.91409   1.00000   11 TiB  204 GiB  203 GiB   1.5 MiB  1023 MiB   11 TiB   1.82   0.34    6      up          osd.6       
  7    hdd   10.91409   1.00000   11 TiB  171 GiB  170 GiB    92 MiB   932 MiB   11 TiB   1.53   0.28    6      up          osd.7       
  1    ssd    0.87329   1.00000  894 GiB  590 GiB  589 GiB   121 KiB   1.6 GiB  304 GiB  66.02  12.31    8      up          osd.1       
  2    ssd    0.87329   1.00000  894 GiB  515 GiB  513 GiB   9.0 MiB   1.5 GiB  379 GiB  57.59  10.74    7      up          osd.2       
  3    ssd    0.87329   1.00000  894 GiB  443 GiB  441 GiB   6.6 MiB   1.4 GiB  452 GiB  49.50   9.23    6      up          osd.3       
 33    ssd    0.87329   1.00000  894 GiB  369 GiB  368 GiB   2.7 MiB   1.2 GiB  525 GiB  41.25   7.69    5      up          osd.33      
 -7          47.14954         -   47 TiB  2.8 TiB  2.8 TiB   107 MiB    10 GiB   44 TiB   5.89   1.10    -              host hat-ceph-02
 13    hdd   10.91409   1.00000   11 TiB  169 GiB  168 GiB   1.6 MiB  1022 MiB   11 TiB   1.51   0.28    5      up          osd.13      
 15    hdd   10.91409   1.00000   11 TiB  206 GiB  205 GiB   759 KiB   1.0 GiB   11 TiB   1.84   0.34    6      up          osd.15      
 17    hdd   10.91409   1.00000   11 TiB  204 GiB  203 GiB   843 KiB  1023 MiB   11 TiB   1.83   0.34    6      up          osd.17      
 19    hdd   10.91409   1.00000   11 TiB  274 GiB  273 GiB    92 MiB   1.0 GiB   11 TiB   2.45   0.46    9      up          osd.19      
  8    ssd    0.87329   1.00000  894 GiB  812 GiB  810 GiB   3.7 MiB   2.0 GiB   83 GiB  90.75  16.93   11      up          osd.8       
  9    ssd    0.87329   1.00000  894 GiB  220 GiB  219 GiB   2.0 MiB   1.1 GiB  675 GiB  24.57   4.58    3      up          osd.9       
 10    ssd    0.87329   1.00000  894 GiB  442 GiB  441 GiB   3.1 MiB   1.4 GiB  452 GiB  49.44   9.22    6      up          osd.10      
 11    ssd    0.87329   1.00000  894 GiB  519 GiB  518 GiB   3.6 MiB   1.4 GiB  375 GiB  58.04  10.83    7      up          osd.11      
-10          47.14954         -   47 TiB  2.6 TiB  2.6 TiB   106 MiB   9.4 GiB   45 TiB   5.43   1.01    -              host hat-ceph-03
 20    hdd   10.91409   1.00000   11 TiB  273 GiB  272 GiB    92 MiB   932 MiB   11 TiB   2.44   0.45    9      up          osd.20      
 21    hdd   10.91409   1.00000   11 TiB  171 GiB  170 GiB   1.3 MiB  1023 MiB   11 TiB   1.53   0.29    5      up          osd.21      
 22    hdd   10.91409   1.00000   11 TiB  137 GiB  136 GiB   1.5 MiB  1022 MiB   11 TiB   1.23   0.23    4      up          osd.22      
 23    hdd   10.91409   1.00000   11 TiB  272 GiB  271 GiB   1.2 MiB  1023 MiB   11 TiB   2.43   0.45    8      up          osd.23      
 12    ssd    0.87329   1.00000  894 GiB  440 GiB  439 GiB   4.1 MiB   1.3 GiB  454 GiB  49.23   9.18    6      up          osd.12      
 14    ssd    0.87329   1.00000  894 GiB  442 GiB  441 GiB   611 KiB   1.4 GiB  452 GiB  49.43   9.22    6      up          osd.14      
 16    ssd    0.87329   1.00000  894 GiB  149 GiB  148 GiB   3.2 MiB  1021 MiB  745 GiB  16.66   3.11    2      up          osd.16      
 18    ssd    0.87329   1.00000  894 GiB  738 GiB  736 GiB   2.4 MiB   1.8 GiB  157 GiB  82.47  15.38   10      up          osd.18      
-13          47.14954         -   47 TiB  2.1 TiB  2.1 TiB    16 MiB   9.4 GiB   45 TiB   4.39   0.82    -              host hat-ceph-04
 28    hdd   10.91409   1.00000   11 TiB   69 GiB   68 GiB   958 KiB  1023 MiB   11 TiB   0.62   0.11    2      up          osd.28      
 29    hdd   10.91409   1.00000   11 TiB  271 GiB  270 GiB   622 KiB  1023 MiB   11 TiB   2.43   0.45    8      up          osd.29      
 30    hdd   10.91409   1.00000   11 TiB  170 GiB  169 GiB   1.8 MiB  1022 MiB   11 TiB   1.52   0.28    5      up          osd.30      
 31    hdd   10.91409   1.00000   11 TiB  205 GiB  204 GiB   1.1 MiB  1023 MiB   11 TiB   1.83   0.34    6      up          osd.31      
 24    ssd    0.87329   1.00000  894 GiB  149 GiB  148 GiB   4.0 MiB  1020 MiB  746 GiB  16.63   3.10    2      up          osd.24      
 25    ssd    0.87329   1.00000  894 GiB  590 GiB  588 GiB   3.6 MiB   1.6 GiB  304 GiB  65.98  12.31    8      up          osd.25      
 26    ssd    0.87329   1.00000  894 GiB  223 GiB  222 GiB   2.4 MiB   1.1 GiB  671 GiB  24.92   4.65    3      up          osd.26      
 27    ssd    0.87329   1.00000  894 GiB  443 GiB  441 GiB   1.4 MiB   1.6 GiB  452 GiB  49.50   9.23    6      up          osd.27      
                          TOTAL  189 TiB   10 TiB   10 TiB   343 MiB    38 GiB  178 TiB   5.36                                          
MIN/MAX VAR: 0.11/16.93  STDDEV: 34.59
The pveceph tool is not available, because the is on separate servers. The Ceph cluster runs version:
Code:
ceph version 15.2.9 (357616cbf726abb779ca75a551e8d02568e15b17) octopus (stable)
 
Here it is:
Code:
POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  
device_health_metrics  93057k                3.0        188.5T  0.0000                                  1.0       1              on         
fastpool                2353G                3.0        188.5T  0.0366                                  1.0      32              on         
slowpool                1078G                3.0        188.5T  0.0168                                  1.0      32              on
 
Okay, there are a few things that I see.

The first one is that you have OSDs of two different device classes, but both pools (slow & fast) show the same capacity. This means that you most likely did not create CRUSH rules that specify which device class to use, and/or did not configure the pools to use them.

The Ceph documentation explains this quite well (maybe also take a look at the chapter before the linked one). The TL;DR is, that you need to create separate rules for SSD and HDD OSDs.

For example:
Code:
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>

ceph osd crush rule create-replicated replicated_ssd default host ssd
ceph osd crush rule create-replicated replicated_hdd default host hdd

You can then use the GUI in Proxmox VE to assign the rule in the advanced editing parts for the pool or with
Code:
ceph osd pool set <pool-name> crush_rule <rule-name>

The other thing is that the pools are really empty, and the autoscaler does not have any hints how much each pool is expected to consume. Therefore, it will only choose the current fill size of the pool to determine the number of PGs.

This is where I suspect why that one OSD is almost full, while the others are very empty.

In the ceph osd df tree output, you have the column "PG". The rule of thumb is that each OSD should have around 100 PGs as the best compromise between how many PGs need to be accounted for, how fast recovery can be and how well data can be spread across the OSDs.

But the autoscaler needs to know where the journey is going. Therefore, after you assign the device class rules to the pools, set the target_ratio for the slow and fast pool. It can be any value, as it is just a weight, and since they will use their device class exclusively, it will be 100%.

Also assign the "device_health_metrics" pool to one of the new rules. You don't need to set a target ratio for it, but if it is still using the default "replicated_rule", the autoscaler will have a hard time to estimate how many PGs the other pools should get, because this one is overlapping.

According to the PG calculator, each pool should then be getting 512 PGs (16 OSDs per device class).
1647262048443.png

And since 512 is 16 times 32 (the current PG num), the autoscaler will change that. The autoscaler will become active, and not just warn, if the ideal PG num is off from the current one by a factor of 3 or more.


In the end, with the higher PG_num, that data, which now seems to be in one or very few very large PGs, will be split up and spread better over the OSDs. The result should be, that all your OSDs should come down to a similar usage.
 
I checked the CRUSH map. Apparently there are entries in it, and they are matched to the device class. Just to be sure, the device classes hdd and ssd are just labels, right? The speed weight must be configured by the CRUSH rules. Is this correct?
 
Just to be sure, the device classes hdd and ssd are just labels, right? The speed weight must be configured by the CRUSH rules. Is this correct?
Yes. When you create an OSD and don’t provide a custom class (you can just enter it in the Proxmox VE GUI for example), Ceph will try to determine the type automatically.

The rules are what decides how and where data is stored. Mainly regarding redundancy (default on host level) and when you create them as I explained earlier, you can also define which device class the rule should use.
 
I solved the problem by setting a replication policy for the hdd and sdd classes. The balancer works now (in upmap mode), and the placement groups are now on autoscale and increasing (I am not sure if the autoscaling was enabled already or activated by the balancer). The backfillfull status is gone, too. Looks good.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!