Proxmox / Ceph problem after SSD Drive was full

dennis86

New Member
Jul 22, 2021
9
0
1
38
Hallo,
im having a problem with my 5 Node Proxmox / Ceph Cluster.
The SSDs were full /near full and some OSDs gut into lock down. I've added two new SSDs but the Cluster wanted to Backfill and has now 6 of 22 OSDs offline and full.
I've stoped Backfilling to avoid more SSDs from becoming full / offline.
Is there Anyone who can help with that? The new SSDs are empty. I read that we need somehow to move data from the full SSD/OSDs but I dont know how.

Thank you.
Ceph health detail is giving:
pg 4.55 is stuck undersized for 95m, current state undersized+degraded+remapped+backfill_wait+peered, last acting [12]
pg 4.56 is stuck undersized for 114m, current state undersized+degraded+peered, last acting [13]
pg 4.5e is stuck undersized for 114m, current state stale+undersized+remapped+peered, last acting [7]
pg 4.5f is stuck undersized for 95m, current state active+undersized+degraded+remapped+backfill_wait, last acting [14,9]
pg 4.64 is stuck undersized for 2m, current state active+undersized+degraded+remapped+backfill_wait, last acting [14,1]
pg 4.67 is stuck undersized for 2m, current state active+undersized+degraded, last acting [16,3]
pg 4.6b is stuck undersized for 2m, current state active+undersized+remapped, last acting [5,0]
pg 4.6c is stuck undersized for 2m, current state active+undersized+remapped, last acting [16,12]
pg 4.6d is stuck undersized for 95m, current state active+undersized+remapped, last acting [13,0]
pg 4.6e is stuck undersized for 95m, current state active+undersized+degraded, last acting [2,0]
pg 4.6f is stuck undersized for 2m, current state active+undersized+degraded+remapped+backfill_wait, last acting [9,16]
pg 4.71 is stuck undersized for 43m, current state active+undersized+remapped, last acting [7,13]
pg 4.72 is stuck undersized for 2m, current state active+undersized+degraded+remapped+backfill_wait, last acting [14,8]
pg 4.75 is stuck undersized for 2m, current state active+undersized+degraded, last acting [0,16]
pg 4.76 is stuck undersized for 2m, current state active+undersized+degraded, last acting [5,9]
pg 4.7e is stuck undersized for 43m, current state active+undersized+remapped, last acting [1,7]
pg 4.7f is stuck undersized for 95m, current state undersized+degraded+peered, last acting [13]
[WRN] POOL_BACKFILLFULL: 6 pool(s) backfillfull

ceph health
HEALTH_WARN noout,nobackfill flag(s) set; 5 backfillfull osd(s); 6 osds down; Reduced data availability: 49 pgs inactive, 4 pgs down, 1 pg stale; Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; Degraded data redundancy: 297286/1775595 objects degraded (16.743%), 198 pgs degraded, 247 pgs undersized; 6 pool(s) backfillfull; 102 daemons have recently crashed
 

Attachments

  • Bildschirmfoto vom 2022-11-09 13-12-10.png
    Bildschirmfoto vom 2022-11-09 13-12-10.png
    74.7 KB · Views: 24
Can you please post the output of ceph osd df tree into [code][/code] blocks?

Running out of space is the one thing you should avoid at all costs with Ceph, everything else is usually not too bad (besides too many failed OSDs) :-/
 
ceph osd df tree response

Code:
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME      
 -1         9.60413         -  1.7 TiB  1.5 TiB  1.5 TiB   11 MiB  3.2 GiB  233 GiB      0     0    -          root default   
 -9         1.74619         -      0 B      0 B      0 B      0 B      0 B      0 B      0     0    -              host cl1nbg
  6    ssd  0.43649   1.00000      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.6  
  7    ssd  0.43649   1.00000  447 GiB  390 GiB  389 GiB  4.5 MiB  751 MiB   57 GiB  87.18  1.14  101      up          osd.7  
 10    ssd  0.43660   1.00000  447 GiB  372 GiB  371 GiB  3.4 MiB  722 MiB   75 GiB  83.14  1.09   87      up          osd.10 
 11    ssd  0.43660   0.84999      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.11 
-11         1.74619         -  1.7 TiB  1.5 TiB  1.5 TiB   11 MiB  3.2 GiB  233 GiB  86.98  1.14    -              host nbgcl2
  8    ssd  0.43649   0.39999  447 GiB  408 GiB  407 GiB  2.8 MiB  866 MiB   39 GiB  91.21  1.19   94      up          osd.8  
  9    ssd  0.43649   0.39999  447 GiB  418 GiB  417 GiB  1.8 MiB  899 MiB   29 GiB  93.46  1.22   90      up          osd.9  
 12    ssd  0.43660   0.39999  447 GiB  367 GiB  366 GiB  3.9 MiB  794 MiB   81 GiB  81.99  1.07   83      up          osd.12 
 13    ssd  0.43660   0.39999  447 GiB  363 GiB  363 GiB  2.8 MiB  730 MiB   84 GiB  81.25  1.06   82      up          osd.13 
 -3         2.18279         -  447 GiB  3.6 GiB  3.5 GiB      0 B  127 MiB  443 GiB      0     0    -              host nbgcl3
  0    ssd  0.43649   0.39999  447 GiB  412 GiB  411 GiB  4.4 MiB  810 MiB   35 GiB  92.18  1.21   88      up          osd.0  
  1    ssd  0.43649   0.39999  447 GiB  423 GiB  422 GiB  3.1 MiB  813 MiB   24 GiB  94.69  1.24   93      up          osd.1  
 18    ssd  0.43660   0.39999  447 GiB  409 GiB  409 GiB  3.0 MiB  807 MiB   38 GiB  91.56  1.20    0    down          osd.18 
 19    ssd  0.43660   0.39999      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.19 
 20    ssd  0.43660   0.39999  447 GiB  3.6 GiB  3.5 GiB      0 B  127 MiB  443 GiB   0.81  0.01   10      up          osd.20 
 -5         1.74619         -  1.3 TiB  1.1 TiB  1.0 TiB  6.6 MiB  2.2 GiB  266 GiB  80.19  1.05    -              host nbgcl4
  2    ssd  0.43649   0.39999  447 GiB  355 GiB  355 GiB  1.9 MiB  736 MiB   92 GiB  79.48  1.04   78      up          osd.2  
  3    ssd  0.43649   0.39999  447 GiB  367 GiB  367 GiB  2.5 MiB  759 MiB   80 GiB  82.18  1.08   86      up          osd.3  
 14    ssd  0.43660   0.39999  447 GiB  353 GiB  352 GiB  2.2 MiB  721 MiB   94 GiB  78.91  1.03   73      up          osd.14 
 15    ssd  0.43660   0.39999      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.15 
 -7         2.18279         -  2.2 TiB  1.5 TiB  1.5 TiB   12 MiB  3.0 GiB  731 GiB  67.32  0.88    -              host nbgcl5
  4    ssd  0.43649   0.39999  447 GiB  417 GiB  416 GiB  1.3 MiB  789 MiB   30 GiB  93.25  1.22    0    down          osd.4  
  5    ssd  0.43649   0.39999  447 GiB  420 GiB  419 GiB  4.2 MiB  751 MiB   27 GiB  93.90  1.23   95      up          osd.5  
 16    ssd  0.43660   0.43660  447 GiB  305 GiB  305 GiB  3.6 MiB  643 MiB  142 GiB  68.31  0.89   68      up          osd.16 
 17    ssd  0.43660   0.43660  447 GiB  356 GiB  356 GiB  3.1 MiB  776 MiB   91 GiB  79.73  1.04   79      up          osd.17 
 21    ssd  0.43660   0.43660  447 GiB  6.3 GiB  6.2 GiB    2 KiB  119 MiB  441 GiB   1.42  0.02    1      up          osd.21 
                        TOTAL  7.9 TiB  6.0 TiB  6.0 TiB   48 MiB   12 GiB  1.9 TiB  76.37

Thanks
 
Last edited:
Okay, AFAICT, node 5 and 3 got the 2 new OSDs? Ideally, all Nodes would get one for additional space.
It looks like you were lucky and have at least one copy of each PG still on a working OSD AFAICT from the screenshot.

Right now, all you can do is to make sure that Ceph gets enough space.
- If possible, add more OSDs to all nodes.
- Do you have data that you can delete?
- Try to get the OSDs that are down, back up running.
It is possible that they are in "failed" state now. Check their logs (/var/log/ceph/ceph-osd.<ID>.log) to see why they won't start.
To get them out of failed state, you might need to run systemctl reset-failed ceph-osd@<ID>.service. Then you can try to start them again.


If the OSDs don't come back up and the logs don't give a clear indication, or if it is a lot of work to fix the reason, you could also consider destroying and recreating them. Since there is at least one working replica available, it would be okayish to do it.

There is, according to the screenshot, one PG that is not in "active" state, but still "activating". If the pool is IO blocked because not all PGs are "active", you could also consider to reduce the "size" of the pool.
But before you do that, there are a few things to consider. Never set the "min_size" to anything smaller than 2! If the pool is currently using a size=3, you could set it to 2. This will drastically reduce the space needed in the cluster, but you will not have "operational" redundancy. Meaning, if an OSD fails, and you only have 1 replica available, the pool will be IO blocked until Ceph can recover from it.
 
Hello Aaron,
thank you for your reply.
We added 5 new 1TB SSDs yesterday and it was rebuilding.
Space should be there enough now.
But I still have 4 inactive/down PGs which are hindering VMS coming up.

Code:
Reduced data availability: 4 pgs inactive, 4 pgs down

pg 1.43 is down, acting [25,24,23]
pg 1.f9 is down, acting [26,24,22]
pg 4.50 is down, acting [24,22,8]
pg 4.74 is down, acting [20,26,22]

Tried to restart the failed OSDs, but they're not starting. Log output Is:
Code:
    -9> 2022-11-10T06:36:24.055+0100 7fe85e767080  4 rocksdb: [version_set.cc:4558] Recovered from manifest file:db/MANIFEST-284778 succeeded,manifest_file_number is 284778, next_file_number is 284780, last_sequence is 26051415834, log_number is 284775,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0

    -8> 2022-11-10T06:36:24.055+0100 7fe85e767080  4 rocksdb: [version_set.cc:4574] Column family [default] (ID 0), log number is 284775

    -7> 2022-11-10T06:36:24.055+0100 7fe85e767080  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1668058584060404, "job": 1, "event": "recovery_started", "log_files": [284779]}
    -6> 2022-11-10T06:36:24.055+0100 7fe85e767080  4 rocksdb: [db_impl/db_impl_open.cc:758] Recovering log #284779 mode 2
    -5> 2022-11-10T06:36:27.431+0100 7fe85e767080  3 rocksdb: [le/block_based/filter_policy.cc:579] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_version>=5.
    -4> 2022-11-10T06:36:27.499+0100 7fe85e767080  1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc0000000, block size 0x4000, free 0x735990000, fragmentation 0.586671, allocated 0x0
    -3> 2022-11-10T06:36:27.499+0100 7fe85e767080 -1 bluefs _allocate allocation failed, needed 0x72457
    -2> 2022-11-10T06:36:27.499+0100 7fe85e767080 -1 bluefs _flush_range allocated: 0x110000 offset: 0x101581 length: 0x80ed6
    -1> 2022-11-10T06:36:27.527+0100 7fe85e767080 -1 ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fe85e767080 time 2022-11-10T06:36:27.501805+0100
./src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

The command
Code:
systemctl reset-failed ceph-osd@<ID>.service
did not work.

Do you know anything else?

Output from ceph osd df tree:
Code:
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE   DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME     
 -1         14.15208         -   12 TiB   6.6 TiB  6.6 TiB   60 MiB   18 GiB  4.9 TiB  57.17  1.00    -          root default   
 -9          2.65578         -  1.8 TiB  1000 GiB  998 GiB  8.7 MiB  2.8 GiB  825 GiB  54.80  0.96    -              host cl1nbg
  6    ssd   0.43649         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.6 
  7    ssd   0.43649   1.00000  447 GiB   322 GiB  322 GiB  989 KiB  690 MiB  125 GiB  72.14  1.26   79      up          osd.7 
 10    ssd   0.43660   1.00000  447 GiB   308 GiB  307 GiB  7.7 MiB  832 MiB  139 GiB  68.86  1.20   70      up          osd.10
 11    ssd   0.43660         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.11
 22    ssd   0.90959   1.00000  931 GiB   370 GiB  369 GiB   23 KiB  1.3 GiB  561 GiB  39.73  0.69   90      up          osd.22
-11          2.65578         -  2.7 TiB   1.6 TiB  1.6 TiB   15 MiB  4.3 GiB  1.0 TiB  60.50  1.06    -              host nbgcl2
  8    ssd   0.43649   0.39999  447 GiB   271 GiB  270 GiB  1.9 MiB  734 MiB  176 GiB  60.54  1.06   54      up          osd.8 
  9    ssd   0.43649   0.39999  447 GiB   288 GiB  287 GiB  1.9 MiB  957 MiB  159 GiB  64.37  1.13   59      up          osd.9 
 12    ssd   0.43660   0.39999  447 GiB   258 GiB  257 GiB  1.8 MiB  742 MiB  189 GiB  57.65  1.01   55      up          osd.12
 13    ssd   0.43660   0.39999  447 GiB   251 GiB  250 GiB  9.2 MiB  759 MiB  196 GiB  56.08  0.98   53      up          osd.13
 23    ssd   0.90959   1.00000  931 GiB   579 GiB  577 GiB   25 KiB  1.2 GiB  353 GiB  62.12  1.09  133      up          osd.23
 -3          3.09238         -  2.2 TiB   1.3 TiB  1.3 TiB   14 MiB  3.4 GiB  981 GiB  56.83  0.99    -              host nbgcl3
  0    ssd   0.43649   0.39999  447 GiB   278 GiB  277 GiB  6.9 MiB  908 MiB  169 GiB  62.13  1.09   61      up          osd.0 
  1    ssd   0.43649   0.39999  447 GiB   297 GiB  296 GiB  6.7 MiB  905 MiB  150 GiB  66.42  1.16   63      up          osd.1 
 18    ssd   0.43660         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.18
 19    ssd   0.43660         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.19
 20    ssd   0.43660   0.39999  447 GiB    62 GiB   62 GiB      0 B  294 MiB  385 GiB  13.88  0.24   17      up          osd.20
 24    ssd   0.90959   1.00000  931 GiB   655 GiB  654 GiB   14 KiB  1.3 GiB  277 GiB  70.30  1.23  146      up          osd.24
 -5          2.65578         -  2.2 TiB   1.4 TiB  1.4 TiB   17 MiB  3.7 GiB  818 GiB  64.00  1.12    -              host nbgcl4
  2    ssd   0.43649   0.39999  447 GiB   304 GiB  303 GiB  4.7 MiB  800 MiB  143 GiB  67.94  1.19   64      up          osd.2 
  3    ssd   0.43649   0.39999  447 GiB   242 GiB  242 GiB  6.0 MiB  797 MiB  205 GiB  54.20  0.95   55      up          osd.3 
 14    ssd   0.43660   0.39999  447 GiB   233 GiB  232 GiB  6.4 MiB  814 MiB  214 GiB  52.17  0.91   47      up          osd.14
 15    ssd   0.43660         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.15
 25    ssd   0.90959   1.00000  931 GiB   675 GiB  674 GiB   24 KiB  1.4 GiB  256 GiB  72.50  1.27  157      up          osd.25
 -7          3.09238         -  2.7 TiB   1.3 TiB  1.3 TiB  5.7 MiB  3.8 GiB  1.3 TiB  49.99  0.87    -              host nbgcl5
  4    ssd   0.43649         0      0 B       0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.4 
  5    ssd   0.43649   0.39999  447 GiB   271 GiB  270 GiB  1.6 MiB  634 MiB  176 GiB  60.56  1.06   61      up          osd.5 
 16    ssd   0.43660   0.43660  447 GiB   212 GiB  211 GiB  2.8 MiB  688 MiB  235 GiB  47.36  0.83   45      up          osd.16
 17    ssd   0.43660   0.43660  447 GiB   225 GiB  225 GiB  1.2 MiB  701 MiB  222 GiB  50.42  0.88   49      up          osd.17
 21    ssd   0.43660   0.43660  447 GiB    70 GiB   70 GiB    2 KiB  279 MiB  377 GiB  15.74  0.28   17      up          osd.21
 26    ssd   0.90959   1.00000  931 GiB   581 GiB  580 GiB    7 KiB  1.6 GiB  350 GiB  62.41  1.09  132      up          osd.26
                         TOTAL   12 TiB   6.6 TiB  6.6 TiB   60 MiB   18 GiB  4.9 TiB  57.17

Thank you for any help.
 

Attachments

  • Bildschirmfoto vom 2022-11-10 06-31-42.png
    Bildschirmfoto vom 2022-11-10 06-31-42.png
    51.4 KB · Views: 12
  • Bildschirmfoto vom 2022-11-10 06-31-55.png
    Bildschirmfoto vom 2022-11-10 06-31-55.png
    33.2 KB · Views: 12
The numbers in brackets behind the PGs are the OSDs they are stored on. AFAICT, all those OSDs are up.

What does ceph pg dump_stuck show?
What about ceph pg <ID> query?
The IDs are 1.43, 1.f9, 4.50, 4.74.

https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg is a good guide for what to check in a troubleshooting situation.

The OSD logs look very similar to what I had in the enterprise support last week. Seems like the OSDs are too full and the internal DB (rocksdb) cannot allocate new space. Last week we did destroy and recreate these OSDs. But before you do that, let's try to get those 4 PGs up and running. There is enough space for now, so the OSDs are not needed right away.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!