I am getting crashes for some NVMe OSDs running on NVMe to PCIE adapters.
Here's the hardware:
MINISFORUM MS-01-S1260 Mini PC with Intel Core i5-12600H
64 GB Memory
2x NVME disks RAID1 for OS
2x NVME disks for CEPH (one directly on mobo, the other on a NVMe to PCIe adapter. The ones on the mobo don't have any problems, only the ones on the pcie adapter)
NVMe to PCIe Adapter: ICY DOCK M.2 NVMe SSD to PCIe 3.0/4.0 x4
NVMe OSD: MZ1LB1T9HALS Samsung PM983 1.92TB NVMe PCIe M.2 22110 SSD MZ-1LB1T90 (one on PCIe adapter and one on mobo)
CEPH backend is running on 10-Gig with a 1-Gig backup link,.
My hunch is that this appears to be happening when the disks are being utilized at near-max capacity or some slight traffic hickup. I'm also having this issue when the CEPH link is switched from the 10-Gig to the 1-Gig, such as switch updates. Maybe the PCIE adapter has a limit to the IOPS?
A reboot of the OS fixes the issue.
Here's the latest crash code.
Has anyone run into this issue or has any idea what I could do to alleviate this problem? One hypothesis I had was to try and limit the speeds of the OSD iops somehow.
Here's the hardware:
MINISFORUM MS-01-S1260 Mini PC with Intel Core i5-12600H
64 GB Memory
2x NVME disks RAID1 for OS
2x NVME disks for CEPH (one directly on mobo, the other on a NVMe to PCIe adapter. The ones on the mobo don't have any problems, only the ones on the pcie adapter)
NVMe to PCIe Adapter: ICY DOCK M.2 NVMe SSD to PCIe 3.0/4.0 x4
NVMe OSD: MZ1LB1T9HALS Samsung PM983 1.92TB NVMe PCIe M.2 22110 SSD MZ-1LB1T90 (one on PCIe adapter and one on mobo)
CEPH backend is running on 10-Gig with a 1-Gig backup link,.
My hunch is that this appears to be happening when the disks are being utilized at near-max capacity or some slight traffic hickup. I'm also having this issue when the CEPH link is switched from the 10-Gig to the 1-Gig, such as switch updates. Maybe the PCIE adapter has a limit to the IOPS?
A reboot of the OS fixes the issue.
Here's the latest crash code.
Code:
{
"assert_condition": "abort",
"assert_file": "./src/blk/kernel/KernelDevice.cc",
"assert_func": "void KernelDevice::_aio_thread()",
"assert_line": 687,
"assert_msg": "./src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 79ce7ca646c0 time 2025-10-28T22:07:52.729287-0500\n./src/blk/kernel/KernelDevice.cc: 687: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n",
"assert_thread_name": "bstore_aio",
"backtrace": [
"/lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x79ce88e49df0]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x9495c) [0x79ce88e9e95c]",
"gsignal()",
"abort()",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x182) [0x62671961db39]",
"(KernelDevice::_aio_thread()+0xac2) [0x62671a3e1082]",
"(KernelDevice::AioCompletionThread::entry()+0x11) [0x62671a3e86b1]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x79ce88e9cb7b]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x79ce88f1a7b8]"
],
"ceph_version": "19.2.3",
"crash_id": "2025-10-29T03:07:52.732265Z_b9449e2b-36f1-4c91-bd26-977cd4027c74",
"entity_name": "osd.5",
"io_error": true,
"io_error_code": -5,
"io_error_devname": "dm-1",
"io_error_length": 24576,
"io_error_offset": 602154524672,
"io_error_optype": 8,
"io_error_path": "/var/lib/ceph/osd/ceph-5/block",
"os_id": "13",
"os_name": "Debian GNU/Linux 13 (trixie)",
"os_version": "13 (trixie)",
"os_version_id": "13",
"process_name": "ceph-osd",
"stack_sig": "e0dac35f07f78a02f8d4ab554909b29d6cf03dd362259f5dba5cf65f38324228",
"timestamp": "2025-10-29T03:07:52.732265Z",
"utsname_hostname": "vmhost6",
"utsname_machine": "x86_64",
"utsname_release": "6.14.11-3-pve",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.14.11-3 (2025-09-22T10:13Z)"
}
Has anyone run into this issue or has any idea what I could do to alleviate this problem? One hypothesis I had was to try and limit the speeds of the OSD iops somehow.