Ceph OSD crashes on I/O errors under light load with kernel 7.x - ssd via pcie adapter - FIXED with 6.17.13-7-pve

mercifultape

Member
Dec 18, 2023
7
0
6
Hello,

I wanted to set up some VMs on ceph RBD, however upon VM cloning from template, random ceph OSD crashes on I/O errors.

Code:
May 07 22:52:29 host3 ceph-osd[4201]: 2026-05-07T22:52:29.798+0200 77f8cb7b16c0 -1 bdev(0x57611ba92400 /var/lib/ceph/osd/ceph-2/block) _aio_thread got r=-5 ((5) Input/output error)
May 07 22:52:29 host3 ceph-osd[4201]: ./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")

HW: Samsung 990 pro via pcie -> m.2 adapter

Versions: basically latest on no-subscription
PVE 9.1.9, kernel 7.0.0-3-pve, Ceph 19.2.3 (squid), no-subscription

Drive is barely used and fine according to diagnostics tools:
SMART: 0 media errors, 100% spare, 0% wear
nvme error-log: 64 entries all zero
nvme device-self-test: passed

AER does not seem to be supported:
Code:
May 07 22:30:00 host3 kernel: acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 07 22:30:00 host3 kernel: acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]

I tried nvme_core.default_ps_max_latency_us=0 in cmdline (disables NVMe APST) — doesn't help.

Does anybody seem to have similiar experience? Any ideas?

Thank you.

UPDATE: When downgraded to 6.17.13-7-pve, ceph osds do not crash anymore.
 

Attachments

Last edited:
When even the kernel shows an IO error (in dmesg) it is time to replace the device.
Are you suggesting that all 3 drives across the ceph cluster that are basically new and unused are faulty? That doesn't seem like the case here - nvme controller error log is empty, self-test passes. In my suspect list are: pcie to m2 adapters and kernel for now.
 
Last edited:
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
 
  • Like
Reactions: gurubert
Try to update the firmware of your Samsung drives.
Try to disable PCIe ASPM in BIOS or any other power saving feature.
Check if the drives overheat -> improve cooling.
Could also be an issue with the pcie -> m.2 adapter ...
ASPM seems to be disabled already:

Code:
LnkCap: Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM Disabled; RCB 64 bytes
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-