Bluestore erroring opening db

paulatz · Apr 18, 2024

Hello, I'm managing a cluster of 4 nodes using proxmox 7.4-17 with CEPH. After a messy shutdown caused by a long power outage, a couple of VM images were corrupted, but we could restore them from backup. However, two of the OSD services refuse to come back, showing this kind of error:

Code:

Apr 18 11:25:58 pve03.impmc.upmc.fr ceph-osd[2976008]: 2024-04-18T11:25:58.854+0200 7f20e86d5080 -1 rocksdb: verify_sharding unable to list column families: Corruption: CURRENT file does n

ot end with newline
Apr 18 11:25:58 pve03.xxx.fr ceph-osd[2976008]: 2024-04-18T11:25:58.854+0200 7f20e86d5080 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db erroring opening db:
Apr 18 11:25:59 pve03.xxx.fr ceph-osd[2976008]: 2024-04-18T11:25:59.322+0200 7f20e86d5080 -1 osd.10 0 OSD:init: unable to mount object store
Apr 18 11:25:59 pve03.xxx.fr ceph-osd[2976008]: 2024-04-18T11:25:59.322+0200 7f20e86d5080 -1  ** ERROR: osd init failed: (5) Input/output error
Apr 18 11:25:59 pve03.xxx.fr systemd[1]: ceph-osd@10.service: Main process exited, code=exited, status=1/FAILURE

I have been searching the web for a solution, but all I find is very technical, and refers to bug present in old versions of Ceph that are supposedly fixed. I had a try with ceph-bluestore-tool, but the obvious doesn ot seem to be sufficient:

Code:

# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-10 fsck
2024-04-18T11:54:04.644+0200 7fcb4245a3c0 -1 rocksdb: verify_sharding unable to list column families: Corruption: CURRENT file does not end with newline

2024-04-18T11:54:04.644+0200 7fcb4245a3c0 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db erroring opening db:

repair failed: (5) Input/output error

Do you know anything that could be tried, or since all the data is recovered, is there a way to just tell ceph to forget about it and reclaim this storage space?

thank you for your help

gurubert · Apr 19, 2024

Does this OSD have an external RocksDB, i.e. is it an HDD with the RocksDB on SSD?
Is the external device available?

paulatz · Apr 19, 2024

To be honest, I don't know. The person who set up the cluster left without leaving any information, but I think "probably not" on each node there are 2 SSD for the system, 4 SSD dedicated to ceph, and no hdd. Is there any way I can look to be sure, or any other information I can provide?

gurubert · Apr 19, 2024

You're right, with only SSDs there should be no external RocksDB.

Could you paste the output of ceph -s and ceph osd df tree here?

paulatz · Apr 19, 2024

Update: I think the problem is that the OSD is on an encrypted partition ("block" points to /dev/md-6) and what is failing is cryptsetup, as I have this on the log:

Code:

 [DM]: deactivating crypt device sddlpN-nfIt-2xs3-jexh-fvJc-lwcf-zAkfSM (dm-6)... skipping

I'll keep looking on this front, anyway, here are the outputs you asked (and thank you so much for your assistance):

Code:

# ceph -s
  cluster:
    id:     4597b67f-2f30-44d2-bb9e-298f56e15119
    health: HEALTH_OK
 
  services:
    mon: 4 daemons, quorum pve01,pve02,pve03,pve04 (age 8d)
    mgr: pve02(active, since 9d), standbys: pve01, pve03, pve04
    osd: 16 osds: 14 up (since 8d), 14 in (since 9d)
 
  data:
    pools:   2 pools, 129 pgs
    objects: 588.63k objects, 2.2 TiB
    usage:   6.7 TiB used, 24 TiB / 31 TiB avail
    pgs:     129 active+clean
 
  io:
    client:   1.8 MiB/s rd, 3.6 MiB/s wr, 166 op/s rd, 205 op/s wr

and

Code:

# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME     
-1         34.93115         -   31 TiB  6.7 TiB  6.6 TiB   80 MiB   37 GiB   24 TiB  21.76  1.00    -          root default 
-3          8.73279         -  6.5 TiB  1.6 TiB  1.6 TiB   26 MiB  7.7 GiB  4.9 TiB  25.12  1.15    -              host pve01
 0    hdd   2.18320   1.00000  2.2 TiB  461 GiB  460 GiB  457 KiB  1.2 GiB  1.7 TiB  20.63  0.95   26      up          osd.0
 4    hdd   2.18320   1.00000  2.2 TiB  655 GiB  652 GiB   25 MiB  3.1 GiB  1.5 TiB  29.30  1.35   38      up          osd.4
 8    hdd   2.18320         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.8
12    hdd   2.18320   1.00000  2.2 TiB  569 GiB  565 GiB  495 KiB  3.4 GiB  1.6 TiB  25.45  1.17   32      up          osd.12
-5          8.73279         -  8.7 TiB  1.7 TiB  1.7 TiB   26 MiB  8.7 GiB  7.1 TiB  19.04  0.87    -              host pve02
 1    hdd   2.18320   1.00000  2.2 TiB  495 GiB  494 GiB  504 KiB  1.2 GiB  1.7 TiB  22.16  1.02   28      up          osd.1
 5    hdd   2.18320   1.00000  2.2 TiB  427 GiB  423 GiB  395 KiB  3.4 GiB  1.8 TiB  19.09  0.88   24      up          osd.5
 9    hdd   2.18320   1.00000  2.2 TiB  372 GiB  370 GiB   25 MiB  1.5 GiB  1.8 TiB  16.64  0.76   22      up          osd.9
13    hdd   2.18320   1.00000  2.2 TiB  408 GiB  406 GiB  403 KiB  2.6 GiB  1.8 TiB  18.27  0.84   23      up          osd.13
-7          8.73279         -  6.5 TiB  1.6 TiB  1.6 TiB   26 MiB  9.3 GiB  5.0 TiB  24.31  1.12    -              host pve03
 2    hdd   2.18320   1.00000  2.2 TiB  585 GiB  582 GiB  498 KiB  3.4 GiB  1.6 TiB  26.17  1.20   33      up          osd.2
 6    hdd   2.18320   1.00000  2.2 TiB  601 GiB  598 GiB   25 MiB  2.8 GiB  1.6 TiB  26.89  1.24   35      up          osd.6
10    hdd   2.18320         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.10
14    hdd   2.18320   1.00000  2.2 TiB  444 GiB  441 GiB  437 KiB  3.1 GiB  1.7 TiB  19.88  0.91   25      up          osd.14
-9          8.73279         -  8.7 TiB  1.8 TiB  1.7 TiB  1.7 MiB   11 GiB  7.0 TiB  20.05  0.92    -              host pve04
 3    hdd   2.18320   1.00000  2.2 TiB  478 GiB  475 GiB  469 KiB  2.9 GiB  1.7 TiB  21.38  0.98   27      up          osd.3
 7    hdd   2.18320   1.00000  2.2 TiB  480 GiB  477 GiB  436 KiB  3.1 GiB  1.7 TiB  21.45  0.99   27      up          osd.7
11    hdd   2.18320   1.00000  2.2 TiB  513 GiB  511 GiB  460 KiB  2.6 GiB  1.7 TiB  22.96  1.06   29      up          osd.11
15    hdd   2.18320   1.00000  2.2 TiB  322 GiB  319 GiB  340 KiB  2.8 GiB  1.9 TiB  14.40  0.66   18      up          osd.15
                        TOTAL   31 TiB  6.7 TiB  6.6 TiB   80 MiB   37 GiB   24 TiB  21.76                                   
MIN/MAX VAR: 0.66/1.35  STDDEV: 3.99

paulatz · Apr 19, 2024

p.s. it is osd 8 and 10 who have failed, I guess it is evident from above

paulatz · Apr 19, 2024

Correction: they are not SDD, but 10k rpm HDD. I did not question they could be HDD as they were only 2.2T in size

VictorSTS · Apr 19, 2024

All you PG's are "active+clean": all your data it's as safe as it can be (probably 3 replicas if using default configuration). I would simply destroy the damaged OSDs (be sure to tick "wipe disk") and recreate them. Do it one by one and leave time for Ceph to rebalance between deletions/recreation.

paulatz · Apr 19, 2024

Thank you Victor, that is more or less what I was planning to do. I'll see if I can find from the web interface without breaking something.

paulatz · Apr 23, 2024

End of the story: I removed the two faulty OSD using proxmox interface, then waited for the rebalancing to complete (a couple of hours). At that point, I could directly add one back. The other disk was stuck in an undefined status, where it still appeared to be in use, I could get the id using lsblk and then disable the encrypted volumes with "dmadmin remove <id>" and it was good to go (I wiped the MBR and rebooted in the meanwhile, but I don't think it was necessary)

Search

Search

Bluestore erroring opening db

paulatz

New Member

gurubert

Famous Member

paulatz

New Member

gurubert

Famous Member

paulatz

New Member

paulatz

New Member

paulatz

New Member

VictorSTS

Renowned Member

paulatz

New Member

paulatz

New Member