CEPH OSD Crash NVME to PCIE Adapter

mn234

Renowned Member
Mar 27, 2016
38
3
73
I am getting crashes for some NVMe OSDs running on NVMe to PCIE adapters.

Here's the hardware:

MINISFORUM MS-01-S1260 Mini PC with Intel Core i5-12600H
64 GB Memory
2x NVME disks RAID1 for OS
2x NVME disks for CEPH (one directly on mobo, the other on a NVMe to PCIe adapter. The ones on the mobo don't have any problems, only the ones on the pcie adapter)

NVMe to PCIe Adapter: ICY DOCK M.2 NVMe SSD to PCIe 3.0/4.0 x4
NVMe OSD: MZ1LB1T9HALS Samsung PM983 1.92TB NVMe PCIe M.2 22110 SSD MZ-1LB1T90 (one on PCIe adapter and one on mobo)

CEPH backend is running on 10-Gig with a 1-Gig backup link,.

My hunch is that this appears to be happening when the disks are being utilized at near-max capacity or some slight traffic hickup. I'm also having this issue when the CEPH link is switched from the 10-Gig to the 1-Gig, such as switch updates. Maybe the PCIE adapter has a limit to the IOPS?

A reboot of the OS fixes the issue.

Here's the latest crash code.


Code:
{
    "assert_condition": "abort",
    "assert_file": "./src/blk/kernel/KernelDevice.cc",
    "assert_func": "void KernelDevice::_aio_thread()",
    "assert_line": 687,
    "assert_msg": "./src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 79ce7ca646c0 time 2025-10-28T22:07:52.729287-0500\n./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n",
    "assert_thread_name": "bstore_aio",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x79ce88e49df0]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x9495c) [0x79ce88e9e95c]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x182) [0x62671961db39]",
        "(KernelDevice::_aio_thread()+0xac2) [0x62671a3e1082]",
        "(KernelDevice::AioCompletionThread::entry()+0x11) [0x62671a3e86b1]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x79ce88e9cb7b]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x79ce88f1a7b8]"
    ],
    "ceph_version": "19.2.3",
    "crash_id": "2025-10-29T03:07:52.732265Z_b9449e2b-36f1-4c91-bd26-977cd4027c74",
    "entity_name": "osd.5",
    "io_error": true,
    "io_error_code": -5,
    "io_error_devname": "dm-1",
    "io_error_length": 24576,
    "io_error_offset": 602154524672,
    "io_error_optype": 8,
    "io_error_path": "/var/lib/ceph/osd/ceph-5/block",
    "os_id": "13",
    "os_name": "Debian GNU/Linux 13 (trixie)",
    "os_version": "13 (trixie)",
    "os_version_id": "13",
    "process_name": "ceph-osd",
    "stack_sig": "e0dac35f07f78a02f8d4ab554909b29d6cf03dd362259f5dba5cf65f38324228",
    "timestamp": "2025-10-29T03:07:52.732265Z",
    "utsname_hostname": "vmhost6",
    "utsname_machine": "x86_64",
    "utsname_release": "6.14.11-3-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.14.11-3 (2025-09-22T10:13Z)"
}

Has anyone run into this issue or has any idea what I could do to alleviate this problem? One hypothesis I had was to try and limit the speeds of the OSD iops somehow.