Ceph osd init failed

koniambo

New Member
Jul 4, 2025
17
0
1
Hi,

After some fio bench on my 3 nodes I noticed that 2 of my OSD are down in cluster, I can't restart the service and I tried to restart the node

tail /var/log/ceph/ceph-osd.0.log
2025-07-23T10:36:27.632+0200 72bfaf4ea880 1 bdev(0x5a80d2f02e00 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2025-07-23T10:36:27.632+0200 72bfaf4ea880 0 bdev(0x5a80d2f02e00 /var/lib/ceph/osd/ceph-0/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block failed: (22) Invalid argument
2025-07-23T10:36:27.632+0200 72bfaf4ea880 1 bdev(0x5a80d2f02e00 /var/lib/ceph/osd/ceph-0/block) open size 1600319913984 (0x1749a800000, 1.5 TiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2025-07-23T10:36:27.632+0200 72bfaf4ea880 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-0/block at offset 66: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
2025-07-23T10:36:27.632+0200 72bfaf4ea880 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-0/block at offset 4096: End of buffer [buffer:2]
2025-07-23T10:36:27.633+0200 72bfaf4ea880 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-0/block at offset 4096: End of buffer [buffer:2]
2025-07-23T10:36:27.633+0200 72bfaf4ea880 -1 bluestore(/var/lib/ceph/osd/ceph-0) _check_main_bdev_label not all labels read properly
2025-07-23T10:36:27.633+0200 72bfaf4ea880 1 bdev(0x5a80d2f02e00 /var/lib/ceph/osd/ceph-0/block) close
2025-07-23T10:36:27.894+0200 72bfaf4ea880 -1 osd.0 0 OSD:init: unable to mount object store
2025-07-23T10:36:27.894+0200 72bfaf4ea880 -1 ** ERROR: osd init failed: (5) Input/output error



This is the command I used to bench

fio --ioengine=libaio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
fio --ioengine=libaio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Thanks for any help
 
Last edited:
Turn out that my /var/lib/ceph/osd/ceph-* is empty


On the broken OSD

ls -l /var/lib/ceph/osd/ceph-2/
total 0

On a healthy OSD

ls -l /var/lib/ceph/osd/ceph-1/
total 28
lrwxrwxrwx 1 ceph ceph 93 4 juil. 10:22 block -> /dev/ceph-988edeec-dd68-450e-bbe7-b387e4101768/osd-block-bfd000b7-eadd-4c00-83f5-7bf92a64ee33
-rw------- 1 ceph ceph 37 4 juil. 10:22 ceph_fsid
-rw------- 1 ceph ceph 37 4 juil. 10:22 fsid
-rw------- 1 ceph ceph 55 4 juil. 10:22 keyring
-rw------- 1 ceph ceph 6 4 juil. 10:22 ready
-rw------- 1 ceph ceph 3 4 juil. 10:22 require_osd_release
-rw------- 1 ceph ceph 10 4 juil. 10:22 type
-rw------- 1 ceph ceph 2 4 juil. 10:22 whoami

What happend ?
 
ERROR: osd init failed: (5) Input/output error
This is the command I used to bench

fio --ioengine=libaio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
fio --ioengine=libaio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Where did you run the FIO benchmark? In a VM or on the host?

If on the host, is /dev/sda by any chance the disk that is used by that OSD? lsblk should show a view of the LVs and VGs and on which physical disk they are located.

Do you see any other I/O errors in dmesg or the journal?
 
Well, if I remember correctly:
Running fio against a block device, will wipe that device.

Therefore the LVM-partition of the bluestore might be gone, which fits with the following line of the logfile:
Code:
2025-07-23T10:36:27.633+0200 72bfaf4ea880 -1 bluestore(/var/lib/ceph/osd/ceph-0) _check_main_bdev_label not all labels read properly

To test, could you post the result of the
Code:
lsblk
command that aaron requested?

For testing with fio the io performance, we usually run it from within a VM.
This way a end-2-end testing with Bus, Cache, Network speed impact and all other site effects, are considered as well in the performance test.
With just testing a single OSD device the network performance is not taken in consideration, which is very often the bottle neck.

BR, Lucas
 
Hi,

I ran the FIO command on the osd of my cluster ceph and it corrupt my database, followed those step : https://www.proxmox.com/images/download/pve/docs/Proxmox-VE-Ceph-Benchmark-202312-rev0.pdf

But it seems that you need to do it before installing OSD I guess..

The lsblk command showed nothing wrong

For testing with fio the io performance, we usually run it from within a VM.
This way a end-2-end testing with Bus, Cache, Network speed impact and all other site effects, are considered as well in the performance test.
With just testing a single OSD device the network performance is not taken in consideration, which is very often the bottle neck.

Wanted to compare IO of rados bench and fio. Is testing it in a VM a good way to test this ?

Thanks for your help I just reinstalled my ceph (it was a testing cluster)
 
But it seems that you need to do it before installing OSD I guess..
Absolutely!
If you still had enough OSDs and at least one replica of all PGs (no unfound PGs), you could also have cleaned up those corrupted OSDs, wiped the disks and set them up new again.

So very roughly from the top of my head:
  1. vgchange -an {affected VG}
  2. rm -rf /var/lib/ceph/osd/ceph-{OSD ID}
  3. systemctl disable ceph-osd@{OSD ID}.service
  4. ceph osd purge {OSD ID}
  5. Do a "Wipe Disk" in the Node → Disk Panel
  6. Recreate the OSD as usual
 
  • Like
Reactions: koniambo
The lsblk command showed nothing wrong

The bluestore lvm partition was supposed to be in the output. It gets created during adding the OSD to the disk.
If there was no partition listed, that was the result of the fio run. :)

Wanted to compare IO of rados bench and fio. Is testing it in a VM a good way to test this ?
That depends :)

If you want to compare the rados with the native speed of the drive the osd utilizes.
It is suitable to run fio against the osd drives.

For the reason that you most likely also want to know the speeds your VM/Guests are experiencing,
we usually also run the speedtest from within the VM/Guests.
Like it is also pointed out in the Benchmark Report, the strongest bottleneck is usually the network setup.
So for meaningful testing you should choose a setup, that incorporates (all) the bottlenecks.

BR, Lucas
 
  • Like
Reactions: koniambo