[SOLVED] Ceph OSD stopped working, unable to restart it

lama · Jun 27, 2024

Hello everyone.

I have a 3 node cluster that has been running well for a couple of years. Recently one of the OSDs has stopped working, and I'm unable to start it again.

The OSD doesn't start when I press the 'Start' button, and behaves differently when the OSD in 'In' or 'Out' of the cluster.

Out: The Proxmox log states: SRV osd.0 - Start but nothing seems to happen
In: the following error appears

The ceph health status is below:

The last few lines of the ceph-osd.0.log is below, but being inexperienced with ceph the only thing that stands out is the last line "ERROR: osd init failed: (22) Invalid argument". However, I'm unsure what argument it's referring to, and I haven't made any changes to the ceph configuration since it was built (although I did upgrade from Proxmox 6.x to 8.2.2 a few weeks ago).

Code:

2024-06-27T20:56:30.102+1000 721e93c006c0  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1719485790103248, "job": 5, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [18806, 18794, 18779, 18761], "files_L1": [18742], "score": 1, "input_data_size": 44726858}
2024-06-27T20:56:30.107+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.107+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 bluestore(/var/lib/ceph/osd/ceph-0) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xa2190997, expected 0x19221daa, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2024-06-27T20:56:30.108+1000 721ea5e8f3c0 -1 osd.0 0 OSD::init() : unable to read osd superblock
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  allocation stats probe 0: cnt: 0 frags: 0 size: 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -1: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -2: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -4: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -8: 0,  0, 0
2024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0)  probe -16: 0,  0, 02024-06-27T20:56:30.108+1000 721e91e006c0  0 bluestore(/var/lib/ceph/osd/ceph-0) ------------
2024-06-27T20:56:30.112+1000 721ea5e8f3c0  4 rocksdb: [db/db_impl/db_impl.cc:446] Shutdown: canceling all background work2024-06-27T20:56:30.119+1000 721e93c006c0  4 rocksdb: (Original Log Time 2024/06/27-20:56:30.120256) [db/compaction/compaction_job.cc:812] [p-1] compacted to: files[4 1 0 0 0 0 0] max score 0.00, MB/sec: 2632.7 rd, 0.0 wr, level 1, files in(4, 1) out(0) MB in(7.6, 35.1) out(0.0), read-write-amplify(5.6) write-amplify(0.0) Shutdown in progress: Database shutdown, records in: 243557, records dropped: 243557 output_compression: NoCompression

2024-06-27T20:56:30.119+1000 721e93c006c0  4 rocksdb: (Original Log Time 2024/06/27-20:56:30.120274) EVENT_LOG_v1 {"time_micros": 1719485790120268, "job": 5, "event": "compaction_finished", "compaction_time_micros": 16989, "compaction_time_cpu_micros": 3893, "output_level": 1, "num_output_files": 0, "total_output_size": 0, "num_input_records": 243557, "num_output_records": 0, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 1, 0, 0, 0, 0, 0]}
2024-06-27T20:56:30.120+1000 721ea5e8f3c0  4 rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete
2024-06-27T20:56:30.203+1000 721ea5e8f3c0  1 bluefs umount
2024-06-27T20:56:30.203+1000 721ea5e8f3c0  1 bdev(0x59373c281000 /var/lib/ceph/osd/ceph-0/block) close2024-06-27T20:56:30.468+1000 721ea5e8f3c0  1 freelist shutdown
2024-06-27T20:56:30.484+1000 721ea5e8f3c0  1 bdev(0x59373c159800 /var/lib/ceph/osd/ceph-0/block) close
2024-06-27T20:56:30.623+1000 721ea5e8f3c0 -1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
~                                                                                                                                                          
"ceph-osd.0.log" 48184 lines, 4386240 bytes

Any help would be much appreciated.

Nemesiz · Jun 27, 2024

Maybe your storage drive is dying.

Replace it or recreate it.

lama · Jun 28, 2024

Nemesiz said:
Maybe your storage drive is dying.

Replace it or recreate it.

Thank you.

I performed a Disk Health check on the disk using the following command, and it passed the SMART overall-health self-assessment test. (ref. https://pve.proxmox.com/wiki/Disk_Health_Monitoring)

Code:

smartctl -a /dev/sdX

Based off this the disk seems healthy, and not hardware related? (maybe corruption in the ceph configuration data (e.g. due to unexpected power loss)?

I tried running

Code:

 ceph-volume lvm activate --all

from https://forum.proxmox.com/threads/ceph-osd-recovery.70338/ but that didn't seem to work.

alexskysilk · Jun 28, 2024

lama said:
Based off this the disk seems healthy

ummm, no. dumping smart data just shows if any problems were trapped in the past, not if there are any faults present.

run a full test before you retry to use the drive
smartctl --test=long /dev/sd[x]

many ssd's dont even have smart test functionality available, in which case I'd replace the disk on principle.

Nemesiz · Jun 28, 2024

If you think you storage drive is healthy then recreate OSD. As of now it show bad data.

lama · Jun 28, 2024

alexskysilk said:
ummm, no. dumping smart data just shows if any problems were trapped in the past, not if there are any faults present.

run a full test before you retry to use the drive
smartctl --test=long /dev/sd[x]

many ssd's dont even have smart test functionality available, in which case I'd replace the disk on principle.

Thanks for advice. I've run a full test using the command above, and it passed (phew!).

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 15918 -

lama · Jun 28, 2024

Nemesiz said:
If you think you storage drive is healthy then recreate OSD. As of now it show bad data.

That's great - recreate the OSD next. Thanks for your help.

lama · Jun 28, 2024

I've tried to destroy the OSD using the instructions from https://pve.proxmox.com/pve-docs/chapter-pveceph.html.

However the follow error message appears: binary not installed /user/bin/ceph-mon

Strangely the file is there, and has the same hash as the files on the machines

[host with broken ocd]
root@orcus:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon

[hosts with working ocd]
root@eris:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon

root@ceres:~# shasum /usr/bin/ceph-mon
d67c109877e5c90b072834108c13ecacddfc234a /usr/bin/ceph-mon

lama · Jun 28, 2024

Solved. For anyone encountering this issue in the future the following steps were taken

Destroy disk via CLI -
Code:
```
pveceph osd destroy <ID>
```
(in hindsight I should have also destroyed the partition table using the cleanup option, i.e.
Code:
```
 pveceph osd destroy <ID> -cleanup
```
(which should allow you to skip to the last step)
Attempted to create an OSD via the GUI but couldn't proceed as 'No Disk unused'
Attempted to delete the partition via the GUI but encountered the following error ]
Removed the lock on the partition using the following command
Code:
```
 dmsetup remove <ID beginning with "ceph">
```
(thank you proxwolfe - https://forum.proxmox.com/threads/sda-has-a-holder.97771/post-513875)
With the partition lock removed, deleted the partition using the GUI
Recreated the OSD

MajorD · Sep 21, 2025

Did you lose data? I have the same problem and am afraid to destroy an osd.

Search

Search

[SOLVED] Ceph OSD stopped working, unable to restart it

lama

New Member

Nemesiz

Renowned Member

lama

New Member

alexskysilk

Distinguished Member

Nemesiz

Renowned Member

lama

New Member

lama

New Member

lama

New Member

lama

New Member

MajorD

Renowned Member

We value your privacy