Ceph OSD crashes on I/O errors under light load - ssd via pcie adapter

mercifultape

Member
Dec 18, 2023
3
0
6
Hello,

I wanted to set up some VMs on ceph RBD, however upon VM cloning from template, random ceph OSD crashes on I/O errors.

Code:
May 07 22:52:29 host3 ceph-osd[4201]: 2026-05-07T22:52:29.798+0200 77f8cb7b16c0 -1 bdev(0x57611ba92400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
May 07 22:52:29 host3 ceph-osd[4201]: ./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")

HW: Samsung 990 pro via pcie -> m.2 adapter

Versions: basically latest on no-subscription
PVE 9.1.9, kernel 7.0.0-3-pve, Ceph 19.2.3 (squid), no-subscription

Drive is barely used and fine according to diagnostics tools:
SMART: 0 media errors, 100% spare, 0% wear
nvme error-log: 64 entries all zero
nvme device-self-test: passed

AER does not seem to be supported:
Code:
May 07 22:30:00 host3 kernel: acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 07 22:30:00 host3 kernel: acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]

I tried nvme_core.default_ps_max_latency_us=0 in cmdline (disables NVMe APST) — doesn't help.

Does anybody seem to have similiar experience? Any ideas?

Thank you.
 

Attachments