Ceph OSD crashes on I/O errors under light load with kernel 7.x - ssd via pcie adapter - FIXED with 6.17.13-7-pve

mercifultape

Member
Dec 18, 2023
6
0
6
Hello,

I wanted to set up some VMs on ceph RBD, however upon VM cloning from template, random ceph OSD crashes on I/O errors.

Code:
May 07 22:52:29 host3 ceph-osd[4201]: 2026-05-07T22:52:29.798+0200 77f8cb7b16c0 -1 bdev(0x57611ba92400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
May 07 22:52:29 host3 ceph-osd[4201]: ./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")

HW: Samsung 990 pro via pcie -> m.2 adapter

Versions: basically latest on no-subscription
PVE 9.1.9, kernel 7.0.0-3-pve, Ceph 19.2.3 (squid), no-subscription

Drive is barely used and fine according to diagnostics tools:
SMART: 0 media errors, 100% spare, 0% wear
nvme error-log: 64 entries all zero
nvme device-self-test: passed

AER does not seem to be supported:
Code:
May 07 22:30:00 host3 kernel: acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 07 22:30:00 host3 kernel: acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]

I tried nvme_core.default_ps_max_latency_us=0 in cmdline (disables NVMe APST) — doesn't help.

Does anybody seem to have similiar experience? Any ideas?

Thank you.
 

Attachments

When even the kernel shows an IO error (in dmesg) it is time to replace the device.
Are you suggesting that all 3 drives across the ceph cluster that are basically new and unused are faulty? That doesn't seem like the case here - nvme controller error log is empty, self-test passes. In my suspect list are: pcie to m2 adapters and kernel for now.
 
Last edited:
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
 
  • Like
Reactions: gurubert
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
ASPM seems to be disabled already:

Code:
LnkCap: Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM Disabled; RCB 64 bytes
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-