ceph-osd crashes with kernel 6.17.2-1-pve on Dell system

Herman1 · 2025-11-24T08:35:20+0100

Hey! Recently i upgraded one of the three running nodes in a cluster to 6.17.2-1-pve kernel, ceph version remains the same on all hosts (19.2.3).

When server rebooted i noticed that instantly ceph-osd processes were crashing:

Code:

ceph-osd[10805]: ./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

And kernel did throw these stack traces:

Code:

kernel: sd 0:2:0:0: [sda] tag#616 page boundary ptr_sgl: 0x00000000df48bcb9
kernel: BUG: unable to handle page fault for address: ff685a6f8dd63000
kernel: #PF: supervisor write access in kernel mode
kernel: #PF: error_code(0x0002) - not-present page
kernel: PGD 100000067 P4D 100874067 PUD 100875067 PMD 108abd067 PTE 0
kernel: Oops: Oops: 0002 [#1] SMP NOPTI
kernel: CPU: 81 UID: 0 PID: 1012 Comm: kworker/81:1H Tainted: P S         OE       6.17.2-1-pve #1 PREEMPT(voluntary)
kernel: Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
kernel: Hardware name: Dell Inc. PowerEdge R660xs/00NDRY, BIOS 2.7.5 07/31/2025
kernel: Workqueue: kblockd blk_mq_run_work_fn
kernel: RIP: 0010:megasas_build_and_issue_cmd_fusion+0xeaa/0x1870 [megaraid_sas]
kernel: Code: 20 48 89 d1 48 83 e1 fc 83 e2 01 48 0f 45 d9 4c 8b 73 10 44 8b 6b 18 4c 89 f9 4c 8d 79 08 45 85 fa 0f 84 fd 03 00 00 45 29 cc <4c> 89 31 48 83

kernel: RSP: 0018:ff685a6fa0b0fb50 EFLAGS: 00010206
kernel: RAX: 00000000fe298000 RBX: ff42339b0e6b2cc0 RCX: ff685a6f8dd63000
kernel: RDX: ff685a6f8dd63008 RSI: ff42339b0e6b2b88 RDI: 0000000000000000
kernel: RBP: ff685a6fa0b0fc20 R08: 0000000000000200 R09: 0000000000001000
kernel: R10: 0000000000000fff R11: 0000000000001000 R12: 0000000000101000
kernel: R13: 0000000000102000 R14: 0000000009a00000 R15: ff685a6f8dd63008
kernel: FS:  0000000000000000(0000) GS:ff4233da0e986000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ff685a6f8dd63000 CR3: 0000000133351004 CR4: 0000000000f73ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  megasas_queue_command+0x122/0x1d0 [megaraid_sas]
kernel:  scsi_queue_rq+0x409/0xcc0
kernel:  blk_mq_dispatch_rq_list+0x121/0x740
kernel:  ? sbitmap_get+0x73/0x180
kernel:  __blk_mq_sched_dispatch_requests+0x408/0x600
kernel:  blk_mq_sched_dispatch_requests+0x2d/0x80
kernel:  blk_mq_run_work_fn+0x72/0x90
kernel:  process_one_work+0x188/0x370
kernel:  worker_thread+0x33a/0x480
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0x108/0x220
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x205/0x240
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1a/0x30
kernel:  </TASK>

(Stack traces can be seen for all 5 ceph-osd disks i have in a system, the stack trace is the same only the drive letter changes)

I tried restarting the host, recreating the osds - nothing helped, only thing that helps is to boot into older installed kernel i have on the system - 6.14.11-3-pve, when i boot into that kernel - everything works like a charm.

I did ran the memtest on the system just to make sure its not some hardware issue, aswell as reseated the cables for backplane and stuff.
Here is some info about the hardware:
Dell Inc. PowerEdge R660xs
Bios: 2.7.5 (newest)
Raid controller: PERC H755N (version: 52.30.0-6115 - newest) - Disks are passed through to the system as NON-RAID disks.
ceph version 19.2.3 (2f03f1cd83e5d40cdf1393cb64a662a8e8bb07c6) squid (stable)
pve-manager/9.0.18/5cacb35d7ee87217 (running kernel: 6.14.11-3-pve)
While reading forums i noticed some threads regarding people having issues with newer kernel on Dell systems (maybe related?)

bzr · 2025-11-24T16:49:22+0100

Hello,
we have the syame problem on our HPE Hosts.
Booting on older Kernel 6.14 resolves the Problem.
i have currently one node up with new kernel (6.17) for debugging purposes if someone need outputs or logs...

Search

Search

ceph-osd crashes with kernel 6.17.2-1-pve on Dell system

Herman1

New Member

bzr

New Member

We value your privacy