ceph-osd crashes with kernel 6.17.2-1-pve on Dell system

Herman1

New Member
Nov 24, 2025
2
0
1
Hey! Recently i upgraded one of the three running nodes in a cluster to 6.17.2-1-pve kernel, ceph version remains the same on all hosts (19.2.3).

When server rebooted i noticed that instantly ceph-osd processes were crashing:

Code:
ceph-osd[10805]: ./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

And kernel did throw these stack traces:

Code:
kernel: sd 0:2:0:0: [sda] tag#616 page boundary ptr_sgl: 0x00000000df48bcb9
kernel: BUG: unable to handle page fault for address: ff685a6f8dd63000
kernel: #PF: supervisor write access in kernel mode
kernel: #PF: error_code(0x0002) - not-present page
kernel: PGD 100000067 P4D 100874067 PUD 100875067 PMD 108abd067 PTE 0
kernel: Oops: Oops: 0002 [#1] SMP NOPTI
kernel: CPU: 81 UID: 0 PID: 1012 Comm: kworker/81:1H Tainted: P S         OE       6.17.2-1-pve #1 PREEMPT(voluntary)
kernel: Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
kernel: Hardware name: Dell Inc. PowerEdge R660xs/00NDRY, BIOS 2.7.5 07/31/2025
kernel: Workqueue: kblockd blk_mq_run_work_fn
kernel: RIP: 0010:megasas_build_and_issue_cmd_fusion+0xeaa/0x1870 [megaraid_sas]
kernel: Code: 20 48 89 d1 48 83 e1 fc 83 e2 01 48 0f 45 d9 4c 8b 73 10 44 8b 6b 18 4c 89 f9 4c 8d 79 08 45 85 fa 0f 84 fd 03 00 00 45 29 cc <4c> 89 31 48 83

kernel: RSP: 0018:ff685a6fa0b0fb50 EFLAGS: 00010206
kernel: RAX: 00000000fe298000 RBX: ff42339b0e6b2cc0 RCX: ff685a6f8dd63000
kernel: RDX: ff685a6f8dd63008 RSI: ff42339b0e6b2b88 RDI: 0000000000000000
kernel: RBP: ff685a6fa0b0fc20 R08: 0000000000000200 R09: 0000000000001000
kernel: R10: 0000000000000fff R11: 0000000000001000 R12: 0000000000101000
kernel: R13: 0000000000102000 R14: 0000000009a00000 R15: ff685a6f8dd63008
kernel: FS:  0000000000000000(0000) GS:ff4233da0e986000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ff685a6f8dd63000 CR3: 0000000133351004 CR4: 0000000000f73ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  megasas_queue_command+0x122/0x1d0 [megaraid_sas]
kernel:  scsi_queue_rq+0x409/0xcc0
kernel:  blk_mq_dispatch_rq_list+0x121/0x740
kernel:  ? sbitmap_get+0x73/0x180
kernel:  __blk_mq_sched_dispatch_requests+0x408/0x600
kernel:  blk_mq_sched_dispatch_requests+0x2d/0x80
kernel:  blk_mq_run_work_fn+0x72/0x90
kernel:  process_one_work+0x188/0x370
kernel:  worker_thread+0x33a/0x480
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0x108/0x220
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x205/0x240
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1a/0x30
kernel:  </TASK>

(Stack traces can be seen for all 5 ceph-osd disks i have in a system, the stack trace is the same only the drive letter changes)

I tried restarting the host, recreating the osds - nothing helped, only thing that helps is to boot into older installed kernel i have on the system - 6.14.11-3-pve, when i boot into that kernel - everything works like a charm.

I did ran the memtest on the system just to make sure its not some hardware issue, aswell as reseated the cables for backplane and stuff.
Here is some info about the hardware:
Dell Inc. PowerEdge R660xs
Bios: 2.7.5 (newest)
Raid controller: PERC H755N (version: 52.30.0-6115 - newest) - Disks are passed through to the system as NON-RAID disks.
ceph version 19.2.3 (2f03f1cd83e5d40cdf1393cb64a662a8e8bb07c6) squid (stable)
pve-manager/9.0.18/5cacb35d7ee87217 (running kernel: 6.14.11-3-pve)
While reading forums i noticed some threads regarding people having issues with newer kernel on Dell systems (maybe related?)
 
Hello,
we have the syame problem on our HPE Hosts.
Booting on older Kernel 6.14 resolves the Problem.
i have currently one node up with new kernel (6.17) for debugging purposes if someone need outputs or logs...
 
Is there anyone who can help with the problem? Or is downgrading the kernel back to 6.14 the solution?
 
For now i've only found downgrading the kernel as the solution, but the newer kernels have to include some sort of fix, otherwise we all expriencing the issue will be stuck with the older kernel.
 
For now i've only found downgrading the kernel as the solution, but the newer kernels have to include some sort of fix, otherwise we all expriencing the issue will be stuck with the older kernel.
We're currently testing a new kernel with a larger set of changes. Nothing specific for megaraid_sas - but quite some changes in the scsi subsystem.
It's currently available in the pbs-test repository (and will soon be available for pve-test as well).

A quick search online did not show too many hits for this particular stacktrace - only something remotely related for a much older kernel on SLES:
https://stgsupport.stgscc.suse.com/...ontrollers-randomly-crash-on-boot?language=de

Sadly we could not yet reproduce the issue and don't have a matching system.

If you can trigger the issue reliably (in a non-critical environment) - trying the new kernel when it's available and/or setting
`smp_affinity_enable=0` for the module might help in getting this narrowed down and fixed.

Thanks for the report in any case!

A similar trace was also reported in the general kernel 6.17 announcement thread:
 
i just updated kernel to 6.17.2-2 (is this the kernel you mentioned?). But same behaviour as with 6.17.2-1.
When i set the noin flag on the ceph cluster and reboot the host, the OSDs are show as UP/OUT. As soon as i set it to in it goes down...
in the ceph-osd log i see some entires for "transitioning to primary" then "transitioning to stray" and then i get spammed with
7e377f26b6c0 1 heartbeat_map is_healthy 'OSD: osd_op_tp thread 0x7e3762cb56c0' had timed out after 15.000000954s
until i set it to out again. After that i cant restart the service an have to reboot the server.

After booting to 6.14.11-4 i can set the osds to in without problems...
 
Last edited:
i just updated kernel to 6.17.2-2 (is this the kernel you mentioned?).
no that would be proxmox-kernel-6.17.4-1-pve - I'll post here when it's available in the public pve repos as well (currently only on pbs-test)

But thanks for the test - at least it rules out that the regression came in between 6.17.2-1 and 6.17.2-2

When i set the noin flag on the ceph cluster and reboot the host, the OSDs are show as UP/OUT. As soon as i set it to in it goes down...
in the ceph-osd log i see some entires for "transitioning to primary" then "transitioning to stray" and then i get spammed with
7e377f26b6c0 1 heartbeat_map is_healthy 'OSD: osd_op_tp thread 0x7e3762cb56c0' had timed out after 15.000000954s
I don't think it's a ceph-specific problem - the other reporter in the general thread ran into the kernel trace by running `proxmox-boot-tool refresh` (which doesn't do much I/O either)