Partition on ceph corrupt

Orionis

Member
Sep 2, 2021
34
0
11
41
hello everyone
I have a big problem
since 3 year i have a ceph cluster for my data.
for this i have a pool of 9TO and i created a vm disk of 8.5TO.
few day i have one OSD full. i do a little procedure. Out osd for write on vm disk. clean lot of unused file on my vm disk for liberate space.
and clean the out osd before put in.

normally all work fine. but this time no.

When i reboot the VM with the VM disk the partiton on vm disk can't be mounted.
I thinks it was the rebalance/recovery was not completed but today is finish; but nothing
when i try recover partion i have this:

Code:
 Bad magic number in super-block
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/sdb

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

Found a gpt partition table in /dev/sdb

i try e2fsck -b 8193 /dev/sdb but work nothing . it start and it seem like freeze.

please how i can fix it. i have lot of data what i d'ont care but there are (photo and document i can't replace)


Thanks
 
I'm not surprised that the VM is no longer running. Before you try to fix anything in the VM, you should first make sure that your CEPH is healthy again.

Please the output of ceph osd df tree and ceph osd pool ls detail
(Please post in code tags (not inline!) and not as a screenshot. This is otherwise difficult to use on a smartphone.)
 
Code:
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         18.19342         -   18 TiB   14 TiB   14 TiB  206 MiB   24 GiB  4.2 TiB  76.67  1.00    -          root default
-3          8.18707         -  8.2 TiB  7.0 TiB  7.0 TiB   58 MiB   12 GiB  1.2 TiB  85.43  1.11    -              host bsg-galactica
 0    hdd   3.63869   1.00000  3.6 TiB  3.3 TiB  3.3 TiB   58 MiB  5.7 GiB  374 GiB  89.96  1.17  255      up          osd.0
 1    hdd   2.72899   1.00000  2.7 TiB  2.2 TiB  2.2 TiB    2 KiB  3.7 GiB  581 GiB  79.20  1.03  144      up          osd.1
 2    hdd   1.81940   1.00000  1.8 TiB  1.6 TiB  1.6 TiB    2 KiB  2.7 GiB  266 GiB  85.71  1.12  108      up          osd.2
-5         10.00635         -   10 TiB  7.0 TiB  6.9 TiB  148 MiB   12 GiB  3.1 TiB  69.51  0.91    -              host bsg-pegasus
 3    hdd   0.90970   1.00000  932 GiB  736 GiB  735 GiB    1 KiB  1.3 GiB  195 GiB  79.03  1.03   48      up          osd.3
 4    hdd   3.63869   1.00000  3.6 TiB  2.5 TiB  2.5 TiB    5 KiB  4.3 GiB  1.1 TiB  68.46  0.89  170      up          osd.4
 5    hdd   3.63869   1.00000  3.6 TiB  2.3 TiB  2.3 TiB    2 KiB  4.1 GiB  1.4 TiB  62.68  0.82  180      up          osd.5
 6    hdd   0.90970   1.00000  932 GiB  740 GiB  738 GiB   96 MiB  1.3 GiB  192 GiB  79.42  1.04   53      up          osd.6
 7    hdd   0.45479   1.00000  466 GiB  341 GiB  340 GiB    1 KiB  553 MiB  125 GiB  73.17  0.95   22      up          osd.7
 8    hdd   0.45479   1.00000  466 GiB  419 GiB  419 GiB   52 MiB  650 MiB   47 GiB  90.01  1.17   32      up          osd.8
                        TOTAL   18 TiB   14 TiB   14 TiB  206 MiB   24 GiB  4.2

Code:
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_                                                    hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 6922 flags hashps                                                    pool,backfillfull stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 3 'OMV_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins                                                     pg_num 512 pgp_num 128 pgp_num_target 512 autoscale_mode on last_change 6922 lf                                                    or 0/503/4911 flags hashpspool,backfillfull,selfmanaged_snaps stripe_width 0 app                                                    lication rbd



this is result.

i think there are no probleme with the pool but just the vm disk or partition

When i boot with parted magig, if i try testdisk i can see my file stucture and i can copy files but not all file.
it freeze

same i try fsck.ext4 -p -b 32768 -B 4096 /dev/sdb.

Start few second and freeze.
testdisk deep scan start and freeze.

i thing there is a big probleme on partion but i don't know how to pass or repaire that or copy the maxmimun
 
replicated size 2 min_size 1
With replica 2 your data is in great danger, if it is that important to you, you should take replica 3 or create regular backups. But replica 2 and important data in one sentence can only go wrong.

i think there are no probleme with the pool but just the vm disk or partition
I can only recommend that you stop trying to repair the VM and focus on the cause and not the symptoms. The overall condition of your cluster is critical, you should first make sure that it is set to clean again.

Please output of ceph health detailand ceph osd crush rule dump

Out osd for write on vm disk. clean lot of unused file on my vm disk for liberate space.
and clean the out osd before put in.
Which OSD was that?

How many OSDs should your nodes have? Was the above representation with a total of 9 OSDs the initial situation?
 
As @sb-jw already mentioned, the way this looks, the pool is currently not operational.

Another thing that would be interesting to see if it can be recovered: ceph health detail

I see that you run a 2-node Ceph cluster with the OMV_data pool having a 2/1 size/min_size. This configuration is absolutely not recommended as it could mean that data can be lost or corrupted quite easily. 3-nodes with a size/min_size of 3/2 is what should be used!

Given that 40 PGs are in incomplete state, I think this is what happened here, one OSD down -> only one replica left. Writes occur with only one replica, since min_size=1 allows it. Then the old OSD came back, was trying to catch up and/or Ceph was trying to recreate the replicas to get back to two replicas. But in the meantime, OSDs with the newer copies of the replicas went out and then there are only older copies of these replicas around. Or it might be due to the backfill not working at the moment.

Did you increase the max PGs per OSD limit? I see one OSD that has 255 PGs and the default max limit IIRC is 250.

Also, galactica doesn't have as much capacity as pegasus (understandable, given it is the older battlestar ^^). Some OSDs (0, 8) are running into the backfill full limit (90%). You could increase it temporarily by running ceph osd set-backfillfull-ratio 0.92. This might help to get out of the situation, but in the end, adding more storage or reducing the used capacity in the pool is what is needed. Only once all the PGs of the pool get back to "active" state
 
ceph health detail

Code:
HEALTH_WARN noout flag(s) set; 3 nearfull osd(s); Reduced data availability: 40 pgs inactive, 40 pgs incomplete; Low space hindering backfill (add storage if this doesn't resolve itself): 41 pgs backfill_toofull; Degraded data redundancy: 57275/3726142 objects degraded (1.537%), 15 pgs degraded, 15 pgs undersized; 83 pgs not deep-scrubbed in time; 83 pgs not scrubbed in time; 2 pool(s) nearfull
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] OSD_NEARFULL: 3 nearfull osd(s)
    osd.0 is near full
    osd.2 is near full
    osd.8 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 40 pgs inactive, 40 pgs incomplete
    pg 3.e is incomplete, acting [5,0]
    pg 3.12 is incomplete, acting [0,8]
    pg 3.2c is incomplete, acting [5,0]
    pg 3.41 is incomplete, acting [6,0]
    pg 3.48 is incomplete, acting [0,5]
    pg 3.57 is incomplete, acting [0,5]
    pg 3.65 is incomplete, acting [4,2]
    pg 3.68 is incomplete, acting [0,5]
    pg 3.70 is incomplete, acting [5,0]
    pg 3.75 is incomplete, acting [5,0]
    pg 3.8e is incomplete, acting [5,0]
    pg 3.92 is incomplete, acting [0,8]
    pg 3.ac is incomplete, acting [5,0]
    pg 3.c1 is incomplete, acting [6,0]
    pg 3.c8 is incomplete, acting [0,5]
    pg 3.d7 is incomplete, acting [0,5]
    pg 3.e5 is incomplete, acting [4,2]
    pg 3.e8 is incomplete, acting [0,5]
    pg 3.f0 is incomplete, acting [5,0]
    pg 3.f5 is incomplete, acting [5,0]
    pg 3.10e is incomplete, acting [5,0]
    pg 3.112 is incomplete, acting [0,8]
    pg 3.12c is incomplete, acting [5,0]
    pg 3.141 is incomplete, acting [6,0]
    pg 3.148 is incomplete, acting [0,5]
    pg 3.157 is incomplete, acting [0,5]
    pg 3.165 is incomplete, acting [4,2]
    pg 3.168 is incomplete, acting [0,5]
    pg 3.170 is incomplete, acting [5,0]
    pg 3.175 is incomplete, acting [5,0]
    pg 3.18e is incomplete, acting [5,0]
    pg 3.192 is incomplete, acting [0,8]
    pg 3.1ac is incomplete, acting [5,0]
    pg 3.1c1 is incomplete, acting [6,0]
    pg 3.1c8 is incomplete, acting [0,5]
    pg 3.1d7 is incomplete, acting [0,5]
    pg 3.1e5 is incomplete, acting [4,2]
    pg 3.1e8 is incomplete, acting [0,5]
    pg 3.1f0 is incomplete, acting [5,0]
    pg 3.1f5 is incomplete, acting [5,0]
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 41 pgs backfill_toofull
    pg 3.14 is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.18 is active+remapped+backfill_wait+backfill_toofull, acting [3,2]
    pg 3.22 is active+remapped+backfill_wait+backfill_toofull, acting [3,2]
    pg 3.23 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.2b is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.37 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.49 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.54 is active+remapped+backfill_toofull, acting [0,4]
    pg 3.6c is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.77 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.94 is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.98 is active+remapped+backfill_wait+backfill_toofull, acting [2,3]
    pg 3.a2 is active+remapped+backfill_wait+backfill_toofull, acting [3,2]
    pg 3.a3 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.ab is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.c5 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.c9 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.d4 is active+remapped+backfill_toofull, acting [0,4]
    pg 3.ec is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.f7 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.114 is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.118 is active+remapped+backfill_wait+backfill_toofull, acting [2,3]
    pg 3.123 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.12b is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.137 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.149 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.154 is active+remapped+backfill_toofull, acting [0,4]
    pg 3.16c is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.17a is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.194 is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.198 is active+remapped+backfill_wait+backfill_toofull, acting [2,3]
    pg 3.1a2 is active+remapped+backfill_wait+backfill_toofull, acting [3,2]
    pg 3.1a3 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.1ab is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [4]
    pg 3.1c5 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.1d4 is active+remapped+backfill_toofull, acting [0,4]
    pg 3.1de is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
    pg 3.1e7 is active+remapped+backfill_toofull, acting [0,4]
    pg 3.1ec is active+remapped+backfill_wait+backfill_toofull, acting [4,2]
    pg 3.1f7 is active+remapped+backfill_wait+backfill_toofull, acting [2,4]
    pg 3.1fa is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [2]
[WRN] PG_DEGRADED: Degraded data redundancy: 57275/3726142 objects degraded (1.537%), 15 pgs degraded, 15 pgs undersized
    pg 3.2b is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.37 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.49 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.ab is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.c5 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.c9 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.fa is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfilling, last acting [2]
    pg 3.12b is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.137 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.149 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.17a is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.1ab is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [4]
    pg 3.1c5 is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.1de is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
    pg 3.1fa is stuck undersized for 8h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [2]
[WRN] PG_NOT_DEEP_SCRUBBED: 83 pgs not deep-scrubbed in time
    pg 3.1fa not deep-scrubbed since 2022-07-26T02:14:39.950601+0200
    pg 3.1f7 not deep-scrubbed since 2022-08-03T02:07:20.244484+0200
    pg 3.1f5 not deep-scrubbed since 2022-11-23T21:14:08.772646+0100
    pg 3.1f0 not deep-scrubbed since 2022-11-20T21:16:11.613272+0100
    pg 3.1ec not deep-scrubbed since 2022-07-31T08:35:36.276683+0200
    pg 3.1e8 not deep-scrubbed since 2022-07-26T23:13:26.879814+0200
    pg 3.1e7 not deep-scrubbed since 2023-10-18T22:13:09.489686+0200
    pg 3.1e5 not deep-scrubbed since 2022-11-20T04:38:11.107513+0100
    pg 3.1de not deep-scrubbed since 2022-11-22T03:55:14.252760+0100
    pg 3.1d7 not deep-scrubbed since 2022-08-02T22:00:30.906588+0200
    pg 3.1d4 not deep-scrubbed since 2022-12-06T07:44:44.250972+0100
    pg 3.1c8 not deep-scrubbed since 2022-08-02T08:37:35.502820+0200
    pg 3.1c5 not deep-scrubbed since 2022-07-29T01:55:59.775234+0200
    pg 3.1c1 not deep-scrubbed since 2022-07-31T12:50:22.933559+0200
    pg 3.1ac not deep-scrubbed since 2022-07-28T13:36:48.842499+0200
    pg 3.1ab not deep-scrubbed since 2022-07-28T15:14:52.520932+0200
    pg 3.1a3 not deep-scrubbed since 2022-07-29T18:47:33.869713+0200
    pg 3.1a2 not deep-scrubbed since 2023-10-24T04:05:36.561406+0200
    pg 3.198 not deep-scrubbed since 2023-10-24T05:18:46.815927+0200
    pg 3.194 not deep-scrubbed since 2022-08-01T09:20:44.945099+0200
    pg 3.192 not deep-scrubbed since 2022-08-01T18:31:44.084229+0200
    pg 3.18e not deep-scrubbed since 2022-11-22T15:19:51.327080+0100
    pg 3.17a not deep-scrubbed since 2022-07-26T02:14:39.950601+0200
    pg 3.177 not deep-scrubbed since 2022-08-03T02:07:20.244484+0200
    pg 3.175 not deep-scrubbed since 2022-11-23T21:14:08.772646+0100
    pg 3.170 not deep-scrubbed since 2022-11-20T21:16:11.613272+0100
    pg 3.16c not deep-scrubbed since 2022-07-31T08:35:36.276683+0200
    pg 3.168 not deep-scrubbed since 2022-07-26T23:13:26.879814+0200
    pg 3.165 not deep-scrubbed since 2022-11-20T04:38:11.107513+0100
    pg 3.157 not deep-scrubbed since 2022-08-02T22:00:30.906588+0200
    pg 3.154 not deep-scrubbed since 2022-12-06T07:44:44.250972+0100
    pg 3.149 not deep-scrubbed since 2022-11-22T14:59:47.293759+0100
    pg 3.148 not deep-scrubbed since 2022-08-02T08:37:35.502820+0200
    pg 3.141 not deep-scrubbed since 2022-07-31T12:50:22.933559+0200
    pg 3.137 not deep-scrubbed since 2022-08-03T15:11:30.200915+0200
    pg 3.12c not deep-scrubbed since 2022-07-28T13:36:48.842499+0200
    pg 3.12b not deep-scrubbed since 2022-07-28T15:14:52.520932+0200
    pg 3.123 not deep-scrubbed since 2022-07-29T18:47:33.869713+0200
    pg 3.118 not deep-scrubbed since 2023-10-24T05:18:46.815927+0200
    pg 3.114 not deep-scrubbed since 2022-08-01T09:20:44.945099+0200
    pg 3.112 not deep-scrubbed since 2022-08-01T18:31:44.084229+0200
    pg 3.10e not deep-scrubbed since 2022-11-22T15:19:51.327080+0100
    pg 3.77 not deep-scrubbed since 2022-08-03T02:07:20.244484+0200
    pg 3.75 not deep-scrubbed since 2022-11-23T21:14:08.772646+0100
    pg 3.70 not deep-scrubbed since 2022-11-20T21:16:11.613272+0100
    pg 3.6c not deep-scrubbed since 2022-07-31T08:35:36.276683+0200
    pg 3.68 not deep-scrubbed since 2022-07-26T23:13:26.879814+0200
    pg 3.65 not deep-scrubbed since 2022-11-20T04:38:11.107513+0100
    pg 3.57 not deep-scrubbed since 2022-08-02T22:00:30.906588+0200
    pg 3.54 not deep-scrubbed since 2022-12-06T07:44:44.250972+0100
    33 more pgs...
[WRN] PG_NOT_SCRUBBED: 83 pgs not scrubbed in time
    pg 3.1fa not scrubbed since 2022-08-01T00:16:07.453383+0200
    pg 3.1f7 not scrubbed since 2022-08-06T08:33:17.002533+0200
    pg 3.1f5 not scrubbed since 2022-11-26T05:17:33.120909+0100
    pg 3.1f0 not scrubbed since 2022-11-27T00:42:13.122933+0100
    pg 3.1ec not scrubbed since 2022-08-06T02:27:36.869029+0200
    pg 3.1e8 not scrubbed since 2022-07-31T22:23:46.225507+0200
    pg 3.1e7 not scrubbed since 2023-10-25T03:54:22.266340+0200
    pg 3.1e5 not scrubbed since 2022-11-26T09:58:17.790782+0100
    pg 3.1de not scrubbed since 2022-11-26T12:14:18.780277+0100
    pg 3.1d7 not scrubbed since 2022-08-06T04:22:46.075679+0200
    pg 3.1d4 not scrubbed since 2022-12-08T15:15:24.365346+0100
    pg 3.1c8 not scrubbed since 2022-08-06T00:19:41.137858+0200
    pg 3.1c5 not scrubbed since 2022-08-01T15:25:53.819763+0200
    pg 3.1c1 not scrubbed since 2022-08-06T14:04:48.123228+0200
    pg 3.1ac not scrubbed since 2022-08-01T09:53:35.617819+0200
    pg 3.1ab not scrubbed since 2022-07-31T21:38:21.309805+0200
    pg 3.1a3 not scrubbed since 2022-08-01T09:43:54.346831+0200
    pg 3.1a2 not scrubbed since 2023-10-25T13:41:31.144357+0200
    pg 3.198 not scrubbed since 2023-10-25T13:17:03.721935+0200
    pg 3.194 not scrubbed since 2022-08-06T11:46:43.752656+0200
    pg 3.192 not scrubbed since 2022-08-05T17:27:23.540156+0200
    pg 3.18e not scrubbed since 2022-11-26T12:26:19.057741+0100
    pg 3.17a not scrubbed since 2022-08-01T00:16:07.453383+0200
    pg 3.177 not scrubbed since 2022-08-06T08:33:17.002533+0200
    pg 3.175 not scrubbed since 2022-11-26T05:17:33.120909+0100
    pg 3.170 not scrubbed since 2022-11-27T00:42:13.122933+0100
    pg 3.16c not scrubbed since 2022-08-06T02:27:36.869029+0200
    pg 3.168 not scrubbed since 2022-07-31T22:23:46.225507+0200
    pg 3.165 not scrubbed since 2022-11-26T09:58:17.790782+0100
    pg 3.157 not scrubbed since 2022-08-06T04:22:46.075679+0200
    pg 3.154 not scrubbed since 2022-12-08T15:15:24.365346+0100
    pg 3.149 not scrubbed since 2022-11-26T09:02:34.646313+0100
    pg 3.148 not scrubbed since 2022-08-06T00:19:41.137858+0200
    pg 3.141 not scrubbed since 2022-08-06T14:04:48.123228+0200
    pg 3.137 not scrubbed since 2022-08-05T23:59:51.276845+0200
    pg 3.12c not scrubbed since 2022-08-01T09:53:35.617819+0200
    pg 3.12b not scrubbed since 2022-07-31T21:38:21.309805+0200
    pg 3.123 not scrubbed since 2022-08-01T09:43:54.346831+0200
    pg 3.118 not scrubbed since 2023-10-25T13:17:03.721935+0200
    pg 3.114 not scrubbed since 2022-08-06T11:46:43.752656+0200
    pg 3.112 not scrubbed since 2022-08-05T17:27:23.540156+0200
    pg 3.10e not scrubbed since 2022-11-26T12:26:19.057741+0100
    pg 3.77 not scrubbed since 2022-08-06T08:33:17.002533+0200
    pg 3.75 not scrubbed since 2022-11-26T05:17:33.120909+0100
    pg 3.70 not scrubbed since 2022-11-27T00:42:13.122933+0100
    pg 3.6c not scrubbed since 2022-08-06T02:27:36.869029+0200
    pg 3.68 not scrubbed since 2022-07-31T22:23:46.225507+0200
    pg 3.65 not scrubbed since 2022-11-26T09:58:17.790782+0100
    pg 3.57 not scrubbed since 2022-08-06T04:22:46.075679+0200
    pg 3.54 not scrubbed since 2022-12-08T15:15:24.365346+0100
    33 more pgs...
[WRN] POOL_NEARFULL: 2 pool(s) nearfull
    pool 'device_health_metrics' is nearfull
    pool 'OMV_data' is nearfull

ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]


i know my configuration is not good, it's my first try to leanr how it work with my personnal device and it was low cost .
i change the ration ( thanks aaron ) now eph work again for rebalancing i stand the end of and see

it was the OSD 1 ( 3TO)
 
Did you increase the max PGs per OSD limit? I see one OSD that has 255 PGs and the default max limit IIRC is 250.
With Nautilus it was even reduced to 100 [1]. But there are a few peculiarities that are presented in a more differentiated manner in the CEPH documentary [2] [3].

[1] https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/
[2] https://docs.ceph.com/en/latest/rados/operations/placement-groups/#preselection
[3] https://docs.ceph.com/en/latest/rados/operations/placement-groups/#choosing-the-number-of-pgs

i change the ration ( thanks aaron ) now eph work again for rebalancing i stand the end of and see
That means the output from above is from before the change?

However, you have to make sure that your OSD does not reach 95%, otherwise the pool will stop again. This can happen due to recovery, where an OSD temporarily goes up before it reaches its final state. Therefore, it is always important that you have enough room for improvement.

But you should then think about how you can tidy up your CEPH. Your fill level is currently so high that the failure of an OSD can lead you to disaster again.
It's best to only use OSDs of the same size and always make sure that at least one OSD can fail and that you don't reach a near-full ratio of 85 %. Otherwise, I can really only advise you not to work with Replica 2. This is basically like a RAID 1, as long as both servers are there everything is great. If one of the two fails (assuming the data is actually distributed 50/50) everything is still good. But if only one OSD goes out, everything will fail again and in the worst case scenario you will have data loss. Especially if you create such a large disk image, there is a high probability that you will have problems with the VM.
 
However, you have to make sure that your OSD does not reach 95%,
As the pool is set up, this is not actually possible. Even ignoring the obvious problems of having a 2pg pool, the osd distribution is completely lopsided. Despite pegasus having 10TB of OSD space, ONLY 8.2TB OF IT IS USABLE. the remaining capacity will never be used. consequently, even though the pool utilization is "only" 70%, the individual OSDs hosted on galactica have nowhere to rebalance to. The pool is not able to function properly.

takeaways-
1. have 3 nodes, not two. ideally, more then that.
2. balance your osds across your nodes. with the number of hosts at the minimum crush distribution rule, there is no way to use excess capacity that isnt present on others.
3. dont manually play with pg count per individual osd. Ceph will do a better job managing it then you, and there are all manner of unintended consequences if you do.
 
Last edited:
With Nautilus it was even reduced to 100 [1]. But there are a few peculiarities that are presented in a more differentiated manner in the CEPH documentary [2] [3].

[1] https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/
[2] https://docs.ceph.com/en/latest/rados/operations/placement-groups/#preselection
[3] https://docs.ceph.com/en/latest/rados/operations/placement-groups/#choosing-the-number-of-pgs


That means the output from above is from before the change?

However, you have to make sure that your OSD does not reach 95%, otherwise the pool will stop again. This can happen due to recovery, where an OSD temporarily goes up before it reaches its final state. Therefore, it is always important that you have enough room for improvement.

But you should then think about how you can tidy up your CEPH. Your fill level is currently so high that the failure of an OSD can lead you to disaster again.
It's best to only use OSDs of the same size and always make sure that at least one OSD can fail and that you don't reach a near-full ratio of 85 %. Otherwise, I can really only advise you not to work with Replica 2. This is basically like a RAID 1, as long as both servers are there everything is great. If one of the two fails (assuming the data is actually distributed 50/50) everything is still good. But if only one OSD goes out, everything will fail again and in the worst case scenario you will have data loss. Especially if you create such a large disk image, there is a high probability that you will have problems with the VM.
yes

see


Code:
  cluster:
    id:     2c042659-77b4-4303-8ecb-3f6a88cd7d54
    health: HEALTH_WARN
            noout flag(s) set
            3 nearfull osd(s)
            Reduced data availability: 40 pgs inactive, 40 pgs incomplete
            Degraded data redundancy: 54344/3726046 objects degraded (1.458%), 15 pgs degraded, 15 pgs undersized
            83 pgs not deep-scrubbed in time
            83 pgs not scrubbed in time
            2 pool(s) nearfull

  services:
    mon: 2 daemons, quorum bsg-galactica,bsg-pegasus (age 39m)
    mgr: bsg-galactica(active, since 5d)
    osd: 9 osds: 9 up (since 35m), 9 in (since 3h); 44 remapped pgs
         flags noout

  data:
    pools:   2 pools, 513 pgs
    objects: 1.86M objects, 7.1 TiB
    usage:   14 TiB used, 4.2 TiB / 18 TiB avail
    pgs:     7.797% pgs not active
             54344/3726046 objects degraded (1.458%)
             168321/3726046 objects misplaced (4.517%)
             429 active+clean
             40  incomplete
             28  active+remapped+backfill_wait
             14  active+undersized+degraded+remapped+backfill_wait
             1   active+clean+remapped
             1   active+undersized+degraded+remapped+backfilling

  io:
    recovery: 341 KiB/s, 0 objects/s

  progress:
    Global Recovery Event (5d)
      [=======================.....] (remaining: 23h)
 
just a question

why I can copy some files and not others?

why when I try a recovery operration with fsck or other this freeze or block?
 
As the pool is set up, this is not actually possible.
Absolutely right, this shouldn’t happen with backfilling. My comment was more of a general nature, as a few OSDs are already very heavily loaded. If he comes up with the idea of adding more OSDs or replacing existing ones, then that could happen - that's what my comment was intended to do.

why I can copy some files and not others?
Your virtual hard drive is distributed across the entire cluster, you have problems with individual OSDs and PGs, the data in there has a problem. You can access all other data. However, the overall condition of your file system is not optimal, which is why I would recommend avoiding writing and recovery processes as long as CEPH is not green again.
 
Your virtual hard drive is distributed across the entire cluster
in more practical terms:
1707152547039.png
Any data block that is present on the any of the above inactive pgs cannot process any data. Any IO sent to those blocks remains queued until it can be discharged; to the requestor, it will just wait indefinitely; what's worse, since it does not timeout for the host os, the parent process cant even be killed. At this point I'd not waste any effort in reviving the guest- bring your filesystem to healthy operation first.

You now have a real life example of why you shouldn't have a 2pg storage policy.

-edit- I should have explained the above a bit better. since you now have pgs with a single surviving osd, the data can be corrupt permanently since there is no copy to compare it to.
 
Last edited:
  • Like
Reactions: sb-jw
ok i see

Now i add one osd for help the system but i have a other osd who won't start

the osd log :c

Code:
64s, timeout is 5.000000s
2024-02-05T21:11:57.201+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read stalled read  0x5b99ef5000~f000 (buffered) since 1311.5266                                            27s, timeout is 5.000000s
2024-02-05T21:12:02.325+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read stalled read  0x5b99f04000~20000 (buffered) since 1316.645                                            994s, timeout is 5.000000s
2024-02-05T21:12:02.329+0100 7f70b659a080  0 bluestore(/var/lib/ceph/osd/ceph-0)                                             log_latency_fn slow operation observed for _collection_list, latency = 67.27367                                            4011s, lat = 67s cid =3.d9_head start #3:9b0222ea:::rbd_data.3d6522923536.000000                                            000016cd0e:head# end GHMAX max 64
2024-02-05T21:12:07.482+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read_random stalled read  0x4d57bc2772~f97 (buffered) since 132                                            1.748714s, timeout is 5.000000s
2024-02-05T21:12:12.598+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read_random stalled read  0x649c8f96b1~ec3 (buffered) since 132                                            6.900638s, timeout is 5.000000s
2024-02-05T21:12:17.982+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read_random stalled read  0x5abee5b506~4afa (buffered) since 13                                            32.290234s, timeout is 5.000000s
2024-02-05T21:12:28.211+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/o                                            sd/ceph-0/block) read_random stalled read  0x5ac0130000~16ddab (buffered) since                                             1337.400713s, timeout is 5.000000s
2024-02-05T21:12:33.375+0100 7f70b659a080 -1 bdev(0x561f78ff9c00 /var/lib/ceph/osd/ceph-0/block) read_random stalled read  0x1519584ec~f48 (buffered) since 1347.749675s, timeout is 5.000000s
 
Code:
Feb 05 21:27:39 bsg-galactica systemd[1]: Starting Ceph object storage daemon osd.0...
Feb 05 21:27:39 bsg-galactica systemd[1]: Started Ceph object storage daemon osd.0.
root@bsg-galactica:~# tail -f /var/log/ceph/ceph-osd.0.log
2024-02-05T21:27:45.353+0100 7f2b7677a080  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1707164865358167, "job": 1, "event": "recovery_finished"}
2024-02-05T21:27:45.353+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824
2024-02-05T21:27:45.353+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta old nid_max 1316479
2024-02-05T21:27:45.353+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta old blobid_max 245760
2024-02-05T21:27:45.509+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta freelist_type bitmap
2024-02-05T21:27:45.509+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta ondisk_format 4 compat_ondisk_format 3
2024-02-05T21:27:45.509+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _open_super_meta min_alloc_size 0x1000
2024-02-05T21:27:45.525+0100 7f2b7677a080  1 freelist init
2024-02-05T21:27:45.525+0100 7f2b7677a080  1 freelist _read_cfg
2024-02-05T21:27:45.529+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc opening allocation metadata
2024-02-05T21:28:09.279+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc loaded 321 GiB in 601655 extents, allocator type hybrid, capacity 0x3a381400000, block size 0x1000, free 0x5021048000, fragmentation 0.00716073
2024-02-05T21:28:09.279+0100 7f2b7677a080  4 rocksdb: [db_impl/db_impl.cc:396] Shutdown: canceling all background work
2024-02-05T21:28:09.279+0100 7f2b7677a080  4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
2024-02-05T21:28:09.383+0100 7f2b7677a080  1 bluefs umount
2024-02-05T21:28:09.383+0100 7f2b7677a080  1 bdev(0x5621dec9cc00 /var/lib/ceph/osd/ceph-0/block) close
2024-02-05T21:28:09.567+0100 7f2b7677a080  1 bdev(0x5621dec9cc00 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2024-02-05T21:28:09.567+0100 7f2b7677a080  1 bdev(0x5621dec9cc00 /var/lib/ceph/osd/ceph-0/block) open size 4000783007744 (0x3a381400000, 3.6 TiB) block_size 4096 (4 KiB) rotational device, discard not supported
2024-02-05T21:28:09.567+0100 7f2b7677a080  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3.6 TiB
2024-02-05T21:28:09.567+0100 7f2b7677a080  1 bluefs mount
2024-02-05T21:28:09.587+0100 7f2b7677a080  1 bluefs _init_alloc shared, id 1, capacity 0x3a381400000, block size 0x10000
2024-02-05T21:28:11.623+0100 7f2b7677a080  1 bluefs mount shared_bdev_used = 6243221504
2024-02-05T21:28:11.623+0100 7f2b7677a080  1 bluestore(/var/lib/ceph/osd/ceph-0) _prepare_db_environment set db_paths to db,3800743857356 db.slow,3800743857356
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: RocksDB version: 6.8.1

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: Git sha rocksdb_build_git_sha:@0@
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: Compile date Aug 31 2023
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: DB SUMMARY

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: CURRENT file:  CURRENT

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: IDENTITY file:  IDENTITY

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: MANIFEST file:  MANIFEST-002083 size: 21315 Bytes

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: SST files in db dir, Total Num: 119, files: 001942.sst 001943.sst 001944.sst 001945.sst 001946.sst 001947.sst 001948.sst 001955.sst 001956.sst

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: SST files in db.slow dir, Total Num: 0, files:

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: Write Ahead Log file in db.wal: 002084.log size: 0 ;

2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                         Options.error_if_exists: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.create_if_missing: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                         Options.paranoid_checks: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                                     Options.env: 0x5621ddf802c0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                                      Options.fs: Legacy File System
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                                Options.info_log: 0x5621ef6cd3e0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                Options.max_file_opening_threads: 16
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                              Options.statistics: (nil)
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                               Options.use_fsync: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.max_log_file_size: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                  Options.max_manifest_file_size: 1073741824
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                   Options.log_file_time_to_roll: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.keep_log_file_num: 1000
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                    Options.recycle_log_file_num: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                         Options.allow_fallocate: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                        Options.allow_mmap_reads: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.allow_mmap_writes: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                        Options.use_direct_reads: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                        Options.use_direct_io_for_flush_and_compaction: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:          Options.create_missing_column_families: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                              Options.db_log_dir:
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                                 Options.wal_dir: db.wal
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                Options.table_cache_numshardbits: 6
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                      Options.max_subcompactions: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                  Options.max_background_flushes: -1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                         Options.WAL_ttl_seconds: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.WAL_size_limit_MB: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                        Options.max_write_batch_group_size_bytes: 1048576
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.manifest_preallocation_size: 4194304
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                     Options.is_fd_close_on_exec: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                   Options.advise_random_on_open: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                    Options.db_write_buffer_size: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                    Options.write_buffer_manager: 0x5621e63d2000
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:         Options.access_hint_on_compaction_start: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:  Options.new_table_reader_for_compaction_inputs: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:           Options.random_access_max_buffer_size: 1048576
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                      Options.use_adaptive_mutex: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                            Options.rate_limiter: (nil)
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:     Options.sst_file_manager.rate_bytes_per_sec: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                       Options.wal_recovery_mode: 2
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                  Options.enable_thread_tracking: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                  Options.enable_pipelined_write: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                  Options.unordered_write: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:         Options.allow_concurrent_memtable_write: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:      Options.enable_write_thread_adaptive_yield: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.write_thread_max_yield_usec: 100
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:            Options.write_thread_slow_yield_usec: 3
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                               Options.row_cache: None
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                              Options.wal_filter: None
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.avoid_flush_during_recovery: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.allow_ingest_behind: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.preserve_deletes: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.two_write_queues: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.manual_wal_flush: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.atomic_flush: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.avoid_unnecessary_blocking_io: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.persist_stats_to_disk: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.write_dbid_to_manifest: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.log_readahead_size: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.sst_file_checksum_func: Unknown
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.max_background_jobs: 2
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.max_background_compactions: 2
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.avoid_flush_during_shutdown: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:           Options.writable_file_max_buffer_size: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.delayed_write_rate : 16777216
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.max_total_wal_size: 1073741824
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:             Options.delete_obsolete_files_period_micros: 21600000000
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                   Options.stats_dump_period_sec: 600
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.stats_persist_period_sec: 600
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                 Options.stats_history_buffer_size: 1048576
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                          Options.max_open_files: -1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                          Options.bytes_per_sync: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                      Options.wal_bytes_per_sync: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:                   Options.strict_bytes_per_sync: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:       Options.compaction_readahead_size: 2097152
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: Compression algorithms supported:
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kZSTDNotFinalCompression supported: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kZSTD supported: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kXpressCompression supported: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kLZ4HCCompression supported: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kLZ4Compression supported: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kBZip2Compression supported: 0
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kZlibCompression supported: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb:   kSnappyCompression supported: 1
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: Fast CRC32 supported: Supported on x86
2024-02-05T21:28:11.623+0100 7f2b7677a080  4 rocksdb: [version_set.cc:4412] Recovering from manifest file: db/MANIFEST-002083
 
i thing the problem is here

Code:
024-02-05T21:45:25.452+0100 7f3399cda080  0 _get_class not permitted to load kvs
2024-02-05T21:45:25.452+0100 7f3399cda080  0 _get_class not permitted to load lua
2024-02-05T21:45:25.456+0100 7f3399cda080  0 <cls> ./src/cls/hello/cls_hello.cc:316: loading cls_hello
2024-02-05T21:45:25.460+0100 7f3399cda080  0 <cls> ./src/cls/cephfs/cls_cephfs.cc:201: loading cephfs
2024-02-05T21:45:25.460+0100 7f3399cda080  0 _get_class not permitted to load sdk
2024-02-05T21:45:25.460+0100 7f3399cda080  0 osd.0 7136 crush map has features 288514051259236352, adjusting msgr requires for clients
2024-02-05T21:45:25.460+0100 7f3399cda080  0 osd.0 7136 crush map has features 288514051259236352 was 8705, adjusting msgr requires for mons
2024-02-05T21:45:25.460+0100 7f3399cda080  0 osd.0 7136 crush map has features 3314933000852226048, adjusting msgr requires for osds
2024-02-05T21:45:25.460+0100 7f3399cda080  1 osd.0 7136 check_osdmap_features require_osd_release unknown -> pacific
 
in more practical terms:
View attachment 62621
Any data block that is present on the any of the above inactive pgs cannot process any data. Any IO sent to those blocks remains queued until it can be discharged; to the requestor, it will just wait indefinitely; what's worse, since it does not timeout for the host os, the parent process cant even be killed. At this point I'd not waste any effort in reviving the guest- bring your filesystem to healthy operation first.

You now have a real life example of why you shouldn't have a 2pg storage policy.

-edit- I should have explained the above a bit better. since you now have pgs with a single surviving osd, the data can be corrupt permanently since there is no copy to compare it to.
if i understand the 40 pg are lost forever.
OK

but can i try to get what can be.

i have recover a hundred files . but it was borring, i do file by file and i must restart the process when i have a damaged files.

any idea ?