PVE CEPH issues (full and recovery)

VincentdeWit

New Member
Jul 27, 2023
2
0
1
Hi all,

Since a couple of days we are experiencing an outage of our CEPH cluster.
We are running a 3/2 config, with 3 nodes and 3 OSD's per node (all same NVMe in both type and size).
At first our VMs became unreachable, probably due to the fact that the disks where full and thus prevented IO.
After that we restarted the machines and noticed that 3 OSD's where down and tried to restart them.

But they failed to start with bluestore throwing enospc error:

Code:
-1038> 2023-07-26T14:50:58.886+0200 7fbc8af6e540 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
    -3> 2023-07-26T14:51:23.565+0200 7fbc8af6e540 -1 bluefs _allocate allocation failed, needed 0x27b6
    -2> 2023-07-26T14:51:23.565+0200 7fbc8af6e540 -1 bluefs _flush_range_F allocated: 0x0 offset: 0x0 length: 0x27b6
    -1> 2023-07-26T14:51:23.577+0200 7fbc8af6e540 -1 ./src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range_F(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fbc8af6e540 time 2023-07-26T14:51:23.570570+0200
./src/os/bluestore/BlueFS.cc: 3137: ceph_abort_msg("bluefs enospc")

We removed the OSD's and added extra storage, but now the cluster is stuck and cannot get it up and running.
We changed some of the ratios to try to gain access to the disks, but does not help.
Also allowing it to reweight by utilization and changing the ratio's to move the full disks around did help to solve some of the warnings but not all of them.

What can we do best? See ceph detail below, removed some of it because of chars limit.

Code:
HEALTH_ERR noout flag(s) set; 1 backfillfull osd(s); 1 full osd(s); 2 nearfull osd(s); Reduced data availability: 37 pgs inactive; Low space hindering backfill (add storage if this doesn't resolve itself): 36 pgs backfill_toofull; Degraded data redundancy: 196392/1449408 objects degraded (13.550%), 79 pgs degraded, 79 pgs undersized; 23 pgs not deep-scrubbed in time; 2 pool(s) full; 88 daemons have recently crashed
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
    osd.3 is backfill full
[ERR] OSD_FULL: 1 full osd(s)
    osd.5 is full
[WRN] OSD_NEARFULL: 2 nearfull osd(s)
    osd.4 is near full
    osd.8 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 37 pgs inactive
    pg 2.0 is stuck inactive for 2d, current state undersized+degraded+remapped+backfilling+peered, last acting [0]
    pg 2.1 is stuck inactive for 2d, current state undersized+degraded+remapped+backfilling+peered, last acting [2]
    pg 2.6 is stuck inactive for 2d, current state undersized+degraded+remapped+backfill_toofull+peered, last acting [0]
    pg 2.a is stuck inactive for 2d, current state undersized+degraded+remapped+backfilling+peered, last acting [1]
    pg 2.d is stuck inactive for 2d, current state undersized+degraded+remapped+backfill_toofull+peered, last acting [2]
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 36 pgs backfill_toofull
    pg 2.5 is active+undersized+degraded+remapped+backfill_toofull, acting [8,0]
    pg 2.6 is undersized+degraded+remapped+backfill_toofull+peered, acting [0]
    pg 2.7 is active+undersized+degraded+remapped+backfill_toofull, acting [2,7]
    pg 2.8 is active+undersized+degraded+remapped+backfill_toofull, acting [0,7]
    pg 2.d is undersized+degraded+remapped+backfill_toofull+peered, acting [2]
    pg 2.f is active+undersized+degraded+remapped+backfill_toofull, acting [8,0]
[WRN] PG_DEGRADED: Degraded data redundancy: 196392/1449408 objects degraded (13.550%), 79 pgs degraded, 79 pgs undersized
    pg 2.0 is stuck undersized for 10h, current state undersized+degraded+remapped+backfilling+peered, last acting [0]
    pg 2.1 is stuck undersized for 10h, current state undersized+degraded+remapped+backfilling+peered, last acting [2]
    pg 2.3 is stuck undersized for 67m, current state active+undersized+degraded+remapped+backfilling, last acting [2,8]
    pg 2.5 is stuck undersized for 10h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [8,0]
    pg 2.6 is stuck undersized for 10h, current state undersized+degraded+remapped+backfill_toofull+peered, last acting [0]
    pg 2.7 is stuck undersized for 10h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [2,7]
    pg 2.8 is stuck undersized for 10h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [0,7]
    pg 2.a is stuck undersized for 10h, current state undersized+degraded+remapped+backfilling+peered, last acting [1]
    pg 2.c is stuck undersized for 67m, current state active+undersized+degraded+remapped+backfilling, last acting [0,7]
    pg 2.d is stuck undersized for 10h, current state undersized+degraded+remapped+backfill_toofull+peered, last acting [2]
    pg 2.e is stuck undersized for 67m, current state active+undersized+degraded+remapped+backfilling, last acting [2,8]
[WRN] PG_NOT_DEEP_SCRUBBED: 23 pgs not deep-scrubbed in time
    pg 2.7e not deep-scrubbed since 2023-07-12T04:39:51.750223+0200
    pg 2.35 not deep-scrubbed since 2023-07-12T16:25:19.553858+0200
    pg 2.32 not deep-scrubbed since 2023-07-14T08:42:02.184452+0200
    pg 2.2b not deep-scrubbed since 2023-07-11T15:31:56.792350+0200
    pg 2.22 not deep-scrubbed since 2023-07-14T07:14:43.041368+0200
    pg 2.a not deep-scrubbed since 2023-07-13T23:43:06.970018+0200
    pg 2.5 not deep-scrubbed since 2023-07-13T22:52:05.079369+0200
    pg 2.7d not deep-scrubbed since 2023-07-13T04:31:01.394803+0200
    pg 2.1 not deep-scrubbed since 2023-07-13T03:46:17.360905+0200
    pg 2.f not deep-scrubbed since 2023-07-13T11:58:44.528170+0200
    pg 2.11 not deep-scrubbed since 2023-07-13T07:40:55.838036+0200
    pg 2.13 not deep-scrubbed since 2023-07-14T13:55:49.448741+0200
    pg 2.15 not deep-scrubbed since 2023-07-12T04:59:45.326263+0200
    pg 2.3f not deep-scrubbed since 2023-07-13T22:33:55.154729+0200
    pg 2.46 not deep-scrubbed since 2023-07-12T10:45:21.814722+0200
    pg 2.47 not deep-scrubbed since 2023-07-12T18:03:00.237776+0200
    pg 2.4f not deep-scrubbed since 2023-07-12T02:36:05.006218+0200
    pg 2.55 not deep-scrubbed since 2023-07-12T10:40:12.190681+0200
    pg 2.56 not deep-scrubbed since 2023-07-14T15:34:16.399247+0200
    pg 2.59 not deep-scrubbed since 2023-07-13T10:49:19.802743+0200
    pg 2.5d not deep-scrubbed since 2023-07-14T06:46:56.523599+0200
    pg 2.6b not deep-scrubbed since 2023-07-14T09:45:44.006297+0200
    pg 2.71 not deep-scrubbed since 2023-07-14T08:44:26.257269+0200
[WRN] POOL_FULL: 2 pool(s) full
    pool '.mgr' is full (no space)
    pool 'ceph-pool' is full (no space)
[WRN] RECENT_CRASH: 88 daemons have recently crashed
    osd.4 crashed on host pve02 at 2023-07-26T20:38:27.868239Z

OSD's
Code:
root@pve01:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         5.67580  root default
-3         2.18300      host pve01
 0    ssd  0.43660          osd.0       up   1.00000  1.00000
 1    ssd  0.43660          osd.1       up   0.90002  1.00000
 2    ssd  0.43660          osd.2       up   0.90002  1.00000
10    ssd  0.43660          osd.10      up   0.79999  1.00000
11    ssd  0.43660          osd.11      up   0.79999  1.00000
-5         1.30980      host pve02
 3    ssd  0.43660          osd.3       up   0.79999  1.00000
 4    ssd  0.43660          osd.4       up   0.79999  1.00000
 5    ssd  0.43660          osd.5       up   0.79999  1.00000
-7         2.18300      host pve03
 6    ssd  0.43660          osd.6       up   1.00000  1.00000
 7    ssd  0.43660          osd.7       up   0.90002  1.00000
 8    ssd  0.43660          osd.8       up   0.90002  1.00000
 9    ssd  0.43660          osd.9       up   0.79999  1.00000
12    ssd  0.43660          osd.12      up   0.79999  1.00000

Thanks!
 
Running out of space is a more painful experience with Ceph. What is the output of ceph osd df tree?
Also please ceph -s

As a stopgap measure to clear up space right away, you could set the size of all the pools to 2. Redundancy will not be great, but you should get 1/3 of space freed up.

Then either clean up or add more disks.
 
  • Like
Reactions: herzkerl
Thanks, that seems to have helped!
Now all data is available and freed, it still is extremely slow and stuck on continuously backfilling the last 60ish pgs.
Does it just take time for that to recover?
 
If the logs or status doesn't indicate any more details, you have to wait. Restarting an OSD with a size/min_size of 2/2 ist not recommended as that would get you below the min_size and IO would be blocked again. Overall you should add more disks or clean up unneeded data as quickly as possible so that you can set the size back to 3 for all pools without running out of space.
 
Be careful when you will add more disks.
In my past experience in adding disk in ceph storage, I can do these considerations:
  • After adding a disk and including the OSD, we had to wait for the PG realignment. The graph was mostly yellow, but there was no impact on customer service.
  • Since it doesn't make much sense to add PGs to the pool with fewer than 50 OSDs, I didn't modify the number of PGs in the pool, even though the formula would recommend 1024.
  • Take note of the OSD weights in the figure: disks with more storage capacity also have more weight. This means that disks with more space will be utilized more frequently according to the CRUSH algorithm.
1690467388434.png

MM
 
Last edited:
  • Like
Reactions: herzkerl