Ceph OSD crashes on I/O errors under light load - ssd via pcie adapter

mercifultape

Member
Dec 18, 2023
5
0
6
Hello,

I wanted to set up some VMs on ceph RBD, however upon VM cloning from template, random ceph OSD crashes on I/O errors.

Code:
May 07 22:52:29 host3 ceph-osd[4201]: 2026-05-07T22:52:29.798+0200 77f8cb7b16c0 -1 bdev(0x57611ba92400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
May 07 22:52:29 host3 ceph-osd[4201]: ./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")

HW: Samsung 990 pro via pcie -> m.2 adapter

Versions: basically latest on no-subscription
PVE 9.1.9, kernel 7.0.0-3-pve, Ceph 19.2.3 (squid), no-subscription

Drive is barely used and fine according to diagnostics tools:
SMART: 0 media errors, 100% spare, 0% wear
nvme error-log: 64 entries all zero
nvme device-self-test: passed

AER does not seem to be supported:
Code:
May 07 22:30:00 host3 kernel: acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 07 22:30:00 host3 kernel: acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]

I tried nvme_core.default_ps_max_latency_us=0 in cmdline (disables NVMe APST) — doesn't help.

Does anybody seem to have similiar experience? Any ideas?

Thank you.
 

Attachments

When even the kernel shows an IO error (in dmesg) it is time to replace the device.
Are you suggesting that all 3 drives across the ceph cluster that are basically new and unused are faulty? That doesn't seem like the case here - nvme controller error log is empty, self-test passes. In my suspect list are: pcie to m2 adapters and kernel for now.
 
Last edited:
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
 
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
ASPM seems to be disabled already:

Code:
LnkCap: Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM Disabled; RCB 64 bytes
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
 
Hello,

I wanted to set up some VMs on ceph RBD, however upon VM cloning from template, random ceph OSD crashes on I/O errors.

Code:
May 07 22:52:29 host3 ceph-osd[4201]: 2026-05-07T22:52:29.798+0200 77f8cb7b16c0 -1 bdev(0x57611ba92400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
May 07 22:52:29 host3 ceph-osd[4201]: ./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")

HW: Samsung 990 pro via pcie -> m.2 adapter

Versions: basically latest on no-subscription
PVE 9.1.9, kernel 7.0.0-3-pve, Ceph 19.2.3 (squid), no-subscription

Drive is barely used and fine according to diagnostics tools:
SMART: 0 media errors, 100% spare, 0% wear
nvme error-log: 64 entries all zero
nvme device-self-test: passed

AER does not seem to be supported:
Code:
May 07 22:30:00 host3 kernel: acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 07 22:30:00 host3 kernel: acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]

I tried nvme_core.default_ps_max_latency_us=0 in cmdline (disables NVMe APST) — doesn't help.

Does anybody seem to have similiar experience? Any ideas?

Thank you.

I ran into almost the same issue with a consumer NVMe on a PCIe-to-M.2 adapter. The drive looked perfectly healthy in SMART and all the NVMe tests passed, but Ceph would still crash the OSD with I/O errors. In the end, the problem was the adapter or the PCIe slot, not the SSD itself. I moved the drive to another slot and the crashes stopped completely. Ceph is very sensitive and tends to expose hardware issues that don’t show up during normal use. Since your 990 Pro seems fine, I’d start by reseating the adapter, trying a different PCIe slot, and updating the motherboard BIOS if you haven’t already.