[SOLVED] Ceph OSD stopped working, unable to restart it

lama

New Member
Aug 28, 2022
8
7
3
Hello everyone.

I have a 3 node cluster that has been running well for a couple of years. Recently one of the OSDs has stopped working, and I'm unable to start it again.

Screenshot 2024-06-27 at 8.53.48 PM.png

The OSD doesn't start when I press the 'Start' button, and behaves differently when the OSD in 'In' or 'Out' of the cluster.
  • Out: The Proxmox log states: SRV osd.0 - Start but nothing seems to happen
  • In: the following error appears
Screenshot 2024-06-27 at 9.00.24 PM.png

The ceph health status is below:

Screenshot 2024-06-27 at 9.05.50 PM.png

The last few lines of the ceph-osd.0.log is below, but being inexperienced with ceph the only thing that stands out is the last line "ERROR: osd init failed: (22) Invalid argument". However, I'm unsure what argument it's referring to, and I haven't made any changes to the ceph configuration since it was built (although I did upgrade from Proxmox 6.x to 8.2.2 a few weeks ago).

Code:
2024-06-27T20:56:30.102+1000 721e93c006c0  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1719485790103248, "job": 5, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [18806, 18794, 18779, 18761], "files_L1": [18742], "score": 1, "input_data_size": 44726858}
2024-06-27T20:56:30.107+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.107+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 osd.0 0 OSD::init() : unable to read osd superblock
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  allocation stats probe 0: cnt: 0 frags: 0 size: 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -1: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -2: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -4: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -8: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -16: 0,  0, 02024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0) ------------
2024-06-27T20:56:30.112+1000 721ea5e8f3c0  4 rocksdb: [db/db_impl/db_impl.cc:446] Shutdown: canceling all background work2024-06-27T20:56:30.119+1000 721e93c006c0  4 rocksdb: (Original Log Time 2024/06/27-20:56:30.120256) [db/compaction/compaction_job.cc:812] [p-1] compacted to: files[4 1 0 0 0 0 0] max score 0.00, MB/sec: 2632.7 rd, 0.0 wr, level 1, files in(4, 1) out(0) MB in(7.6, 35.1) out(0.0), read-write-amplify(5.6) write-amplify(0.0) Shutdown in progress: Database shutdown, records in: 243557, records dropped: 243557 output_compression: NoCompression

2024-06-27T20:56:30.119+1000 721e93c006c0  4 rocksdb: (Original Log Time 2024/06/27-20:56:30.120274) EVENT_LOG_v1 {"time_micros": 1719485790120268, "job": 5, "event": "compaction_finished", "compaction_time_micros": 16989, "compaction_time_cpu_micros": 3893, "output_level": 1, "num_output_files": 0, "total_output_size": 0, "num_input_records": 243557, "num_output_records": 0, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 1, 0, 0, 0, 0, 0]}
2024-06-27T20:56:30.120+1000 721ea5e8f3c0  4 rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete
2024-06-27T20:56:30.203+1000 721ea5e8f3c0  1 bluefs umount
2024-06-27T20:56:30.203+1000 721ea5e8f3c0  1 bdev(0x59373c281000 /var/lib/ceph/osd/ceph-0/block) close2024-06-27T20:56:30.468+1000 721ea5e8f3c0  1 freelist shutdown
2024-06-27T20:56:30.484+1000 721ea5e8f3c0  1 bdev(0x59373c159800 /var/lib/ceph/osd/ceph-0/block) close
2024-06-27T20:56:30.623+1000 721ea5e8f3c0 -1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
~                                                                                                                                                          
"ceph-osd.0.log" 48184 lines, 4386240 bytes


Any help would be much appreciated.
 
Last edited:
Maybe your storage drive is dying.

Replace it or recreate it.
Thank you.

I performed a Disk Health check on the disk using the following command, and it passed the SMART overall-health self-assessment test. (ref. https://pve.proxmox.com/wiki/Disk_Health_Monitoring)

Code:
smartctl -a /dev/sdX

Based off this the disk seems healthy, and not hardware related? (maybe corruption in the ceph configuration data (e.g. due to unexpected power loss)?

I tried running
Code:
 ceph-volume lvm activate --all
from https://forum.proxmox.com/threads/ceph-osd-recovery.70338/ but that didn't seem to work.
 
Based off this the disk seems healthy
ummm, no. dumping smart data just shows if any problems were trapped in the past, not if there are any faults present.

run a full test before you retry to use the drive
smartctl --test=long /dev/sd[x]

many ssd's dont even have smart test functionality available, in which case I'd replace the disk on principle.
 
  • Like
Reactions: lama
ummm, no. dumping smart data just shows if any problems were trapped in the past, not if there are any faults present.

run a full test before you retry to use the drive
smartctl --test=long /dev/sd[x]

many ssd's dont even have smart test functionality available, in which case I'd replace the disk on principle.
Thanks for advice. I've run a full test using the command above, and it passed (phew!).

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 15918 -
 
I've tried to destroy the OSD using the instructions from https://pve.proxmox.com/pve-docs/chapter-pveceph.html.

However the follow error message appears: binary not installed /user/bin/ceph-mon

Screenshot 2024-06-28 at 7.38.28 PM.png

Strangely the file is there, and has the same hash as the files on the machines

[host with broken ocd]
root@orcus:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon

[hosts with working ocd]
root@eris:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon

root@ceres:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon
 
Solved. For anyone encountering this issue in the future the following steps were taken
  1. Destroy disk via CLI -
    Code:
    pveceph osd destroy <ID>
    (in hindsight I should have also destroyed the partition table using the cleanup option, i.e.
    Code:
     pveceph osd destroy <ID> -cleanup
    (which should allow you to skip to the last step)
  2. Attempted to create an OSD via the GUI but couldn't proceed as 'No Disk unused' Screenshot 2024-06-28 at 8.15.33 PM.png
  3. Attempted to delete the partition via the GUI but encountered the following error Screenshot 2024-06-28 at 8.22.18 PM.png]
  4. Removed the lock on the partition using the following command
    Code:
     dmsetup remove <ID beginning with "ceph">
    (thank you proxwolfe - https://forum.proxmox.com/threads/sda-has-a-holder.97771/post-513875)
  5. With the partition lock removed, deleted the partition using the GUI Screenshot 2024-06-28 at 8.23.41 PM.png
  6. Recreated the OSD Screenshot 2024-06-28 at 8.24.43 PM.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!