ceph-osd crashes with kernel 6.17.2-1-pve on Dell system

Herman1 · Nov 24, 2025

Hey! Recently i upgraded one of the three running nodes in a cluster to 6.17.2-1-pve kernel, ceph version remains the same on all hosts (19.2.3).

When server rebooted i noticed that instantly ceph-osd processes were crashing:

Code:

ceph-osd[10805]: ./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

And kernel did throw these stack traces:

Code:

kernel: sd 0:2:0:0: [sda] tag#616 page boundary ptr_sgl: 0x00000000df48bcb9
kernel: BUG: unable to handle page fault for address: ff685a6f8dd63000
kernel: #PF: supervisor write access in kernel mode
kernel: #PF: error_code(0x0002) - not-present page
kernel: PGD 100000067 P4D 100874067 PUD 100875067 PMD 108abd067 PTE 0
kernel: Oops: Oops: 0002 [#1] SMP NOPTI
kernel: CPU: 81 UID: 0 PID: 1012 Comm: kworker/81:1H Tainted: P S         OE       6.17.2-1-pve #1 PREEMPT(voluntary)
kernel: Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
kernel: Hardware name: Dell Inc. PowerEdge R660xs/00NDRY, BIOS 2.7.5 07/31/2025
kernel: Workqueue: kblockd blk_mq_run_work_fn
kernel: RIP: 0010:megasas_build_and_issue_cmd_fusion+0xeaa/0x1870 [megaraid_sas]
kernel: Code: 20 48 89 d1 48 83 e1 fc 83 e2 01 48 0f 45 d9 4c 8b 73 10 44 8b 6b 18 4c 89 f9 4c 8d 79 08 45 85 fa 0f 84 fd 03 00 00 45 29 cc <4c> 89 31 48 83

kernel: RSP: 0018:ff685a6fa0b0fb50 EFLAGS: 00010206
kernel: RAX: 00000000fe298000 RBX: ff42339b0e6b2cc0 RCX: ff685a6f8dd63000
kernel: RDX: ff685a6f8dd63008 RSI: ff42339b0e6b2b88 RDI: 0000000000000000
kernel: RBP: ff685a6fa0b0fc20 R08: 0000000000000200 R09: 0000000000001000
kernel: R10: 0000000000000fff R11: 0000000000001000 R12: 0000000000101000
kernel: R13: 0000000000102000 R14: 0000000009a00000 R15: ff685a6f8dd63008
kernel: FS:  0000000000000000(0000) GS:ff4233da0e986000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ff685a6f8dd63000 CR3: 0000000133351004 CR4: 0000000000f73ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  megasas_queue_command+0x122/0x1d0 [megaraid_sas]
kernel:  scsi_queue_rq+0x409/0xcc0
kernel:  blk_mq_dispatch_rq_list+0x121/0x740
kernel:  ? sbitmap_get+0x73/0x180
kernel:  __blk_mq_sched_dispatch_requests+0x408/0x600
kernel:  blk_mq_sched_dispatch_requests+0x2d/0x80
kernel:  blk_mq_run_work_fn+0x72/0x90
kernel:  process_one_work+0x188/0x370
kernel:  worker_thread+0x33a/0x480
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0x108/0x220
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x205/0x240
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1a/0x30
kernel:  </TASK>

(Stack traces can be seen for all 5 ceph-osd disks i have in a system, the stack trace is the same only the drive letter changes)

I tried restarting the host, recreating the osds - nothing helped, only thing that helps is to boot into older installed kernel i have on the system - 6.14.11-3-pve, when i boot into that kernel - everything works like a charm.

I did ran the memtest on the system just to make sure its not some hardware issue, aswell as reseated the cables for backplane and stuff.
Here is some info about the hardware:
Dell Inc. PowerEdge R660xs
Bios: 2.7.5 (newest)
Raid controller: PERC H755N (version: 52.30.0-6115 - newest) - Disks are passed through to the system as NON-RAID disks.
ceph version 19.2.3 (2f03f1cd83e5d40cdf1393cb64a662a8e8bb07c6) squid (stable)
pve-manager/9.0.18/5cacb35d7ee87217 (running kernel: 6.14.11-3-pve)
While reading forums i noticed some threads regarding people having issues with newer kernel on Dell systems (maybe related?)

bzr · Nov 24, 2025

Hello,
we have the syame problem on our HPE Hosts.
Booting on older Kernel 6.14 resolves the Problem.
i have currently one node up with new kernel (6.17) for debugging purposes if someone need outputs or logs...

bzr · Dec 1, 2025

Is there anyone who can help with the problem? Or is downgrading the kernel back to 6.14 the solution?

Herman1 · Dec 1, 2025

For now i've only found downgrading the kernel as the solution, but the newer kernels have to include some sort of fix, otherwise we all expriencing the issue will be stuck with the older kernel.

Stoiko Ivanov · Dec 3, 2025

Herman1 said:
For now i've only found downgrading the kernel as the solution, but the newer kernels have to include some sort of fix, otherwise we all expriencing the issue will be stuck with the older kernel.

We're currently testing a new kernel with a larger set of changes. Nothing specific for megaraid_sas - but quite some changes in the scsi subsystem.
It's currently available in the pbs-test repository (and will soon be available for pve-test as well).

A quick search online did not show too many hits for this particular stacktrace - only something remotely related for a much older kernel on SLES:
https://stgsupport.stgscc.suse.com/...ontrollers-randomly-crash-on-boot?language=de

Sadly we could not yet reproduce the issue and don't have a matching system.

If you can trigger the issue reliably (in a non-critical environment) - trying the new kernel when it's available and/or setting
`smp_affinity_enable=0` for the module might help in getting this narrowed down and fixed.

Thanks for the report in any case!

A similar trace was also reported in the general kernel 6.17 announcement thread:

S

Post in thread 'Opt-in Linux 6.17 Kernel for Proxmox VE 9 available on test & no-subscription'

Dec 3, 2025

sd 0:2:1:0: [sda] tag#4057 page boundary ptr_sgl: 0x00000000ba1fad69[ 28.571202] BUG: unable to handle page fault for address: ff72bd070403c000[ 28.571210] #PF: supervisor write access in kernel mode[ 28.571216] #PF: error_code(0x0002) - not-present page[ 28.571222] PGD 100000067 P4D 100304067 PUD 100305067 PMD 12ddba067 PTE 0[ 28.571232] Oops: Oops: 0002 [#1] SMP NOPTI[ 28.571240] CPU: 5 UID: 0 PID: 1205 Comm: kworker/u128:4 Tainted: P O 6.17.2-2-pve #1 PREEMPT(voluntary) [ 28.571250] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE[ 28.571256] Hardware name...

bzr · Dec 4, 2025

i just updated kernel to 6.17.2-2 (is this the kernel you mentioned?). But same behaviour as with 6.17.2-1.
When i set the noin flag on the ceph cluster and reboot the host, the OSDs are show as UP/OUT. As soon as i set it to in it goes down...
in the ceph-osd log i see some entires for "transitioning to primary" then "transitioning to stray" and then i get spammed with
7e377f26b6c0 1 heartbeat_map is_healthy 'OSD: osd_op_tp thread 0x7e3762cb56c0' had timed out after 15.000000954s
until i set it to out again. After that i cant restart the service an have to reboot the server.

After booting to 6.14.11-4 i can set the osds to in without problems...

Stoiko Ivanov · Dec 4, 2025

bzr said:
i just updated kernel to 6.17.2-2 (is this the kernel you mentioned?).

no that would be proxmox-kernel-6.17.4-1-pve - I'll post here when it's available in the public pve repos as well (currently only on pbs-test)

But thanks for the test - at least it rules out that the regression came in between 6.17.2-1 and 6.17.2-2

bzr said:
When i set the noin flag on the ceph cluster and reboot the host, the OSDs are show as UP/OUT. As soon as i set it to in it goes down...
in the ceph-osd log i see some entires for "transitioning to primary" then "transitioning to stray" and then i get spammed with
7e377f26b6c0 1 heartbeat_map is_healthy 'OSD: osd_op_tp thread 0x7e3762cb56c0' had timed out after 15.000000954s

I don't think it's a ceph-specific problem - the other reporter in the general thread ran into the kernel trace by running `proxmox-boot-tool refresh` (which doesn't do much I/O either)

carles89 · Dec 14, 2025

Same here, but with fully updated PBS 4.1 and enterprise repos. The server is an HP DL360 Gen10+, and the RAID controller an HP MR416i-a Gen10+.

When the server hits the issue, we have to do a hard reset since some processess get stuck in D state and it's impossible to finish a graceful reboot.

Currently rolling back to 6.14 and see how it works.

OliverW87 · Dec 17, 2025

Same here. Using VE 9 with an HP DL 380 Gen10 and a RAID-Controller HPE MR416i-p Gen10+
May thats related to this: https://bugzilla.kernel.org/show_bug.cgi?id=220693

Rollback to 6.14 resolved it for now.

carles89 · Dec 17, 2025

6.14 resolved it for us too. We're doing backups since Monday without issues.

Prexonote · Jan 12, 2026

We had this issue also. Rebooted to 6.14 and removed 6.17

For those who are looking for the commands:

Code:

# Boot to 6.14

# Rollback these 2 packages to remove dependency on 6.17 kernel
sudo apt install proxmox-default-kernel=2.0.0 proxmox-ve=9.0.0

sudo apt purge proxmox-kernel-6.17 proxmox-kernel-6.17.4-2-pve-signed

Is there an issue opened for this ? I would like to have an official thread to follow up

bzr · Jan 29, 2026

i get the kernel 6.17.4-2 listed:

anyone tested yet?

Stoiko Ivanov said:
no that would be proxmox-kernel-6.17.4-1-pve - I'll post here when it's available in the public pve repos as well (currently only on pbs-test)

But thanks for the test - at least it rules out that the regression came in between 6.17.2-1 and 6.17.2-2

I don't think it's a ceph-specific problem - the other reporter in the general thread ran into the kernel trace by running `proxmox-boot-tool refresh` (which doesn't do much I/O either)

d.konyayev · Jan 30, 2026

"I'm having the same issue on a Hp dl380 Gen11 Raid mr416i-o, NVM Micron 7400, Rolling back to 6.14 fixed it for me as well."

t.roloff · Feb 3, 2026

Same here: ProLiant DL380 Gen10 Plus, HPE MR416i-p Gen10+

Fresh install, when adding OSDs, some show up as "bluestore" and seem to be ok, some show up as "filestore" - very confusing until I found this thread.

Downgrading to 9.0 solved the issue.

Would be nice to have a more "prominent" announcement somewhere, since this is a very frustating experience (aka "showstopper") for new users. Or, maybe have at least some more feedback on progress regarding this issue.

hd-- · Feb 11, 2026

There is the new Kernel version 6.17.9-1 in the no-subscription repository, which might be worth a test

derekivey · Feb 14, 2026

I hit this issue with a new build on HPE DL360 8SFF GEN11 servers on the latest kernel with all available updates in the no-subscription repository installed. Rolling back to 6.14 using the instructions in #11 resolved it.

d.konyayev · 2026-03-04T13:53:07+0100

Has anyone performed testing on kernel version 6.17.13?

Search

Search

ceph-osd crashes with kernel 6.17.2-1-pve on Dell system

Herman1

New Member

bzr

New Member

bzr

New Member

Herman1

New Member

Stoiko Ivanov

Proxmox Staff Member

Post in thread 'Opt-in Linux 6.17 Kernel for Proxmox VE 9 available on test & no-subscription'

bzr

New Member

Stoiko Ivanov

Proxmox Staff Member

carles89

Renowned Member

OliverW87

Member

carles89

Renowned Member

Prexonote

Active Member

bzr

New Member

d.konyayev

New Member

t.roloff

New Member

hd--

Proxmox Staff Member

derekivey

New Member

d.konyayev

New Member

We value your privacy