Kernel panics

ch_turn · Apr 5, 2025

Hi,
We have a single Dell R7525 server running Proxmox that has been experiencing random freezes and kernel panics. The server is hosted in a data center, so I don't have physical access to it (but do have remote management using Dell's iDRAC). The freezes would lock the server solid, with no info on the console or in the logs, and would require a hard reboot. Since adding "pcie_port_pm = off libata.force = noncq" to the kernel command line, the freezes are gone but now we're seeing kernel panics instead. We started doing scheduled reboots at night in hopes that the issue would not arise before the next reboot, but even this wasn't enough and we've had panics after the server has only been up for a few hours, although it usually goes for more than a week before having an issue. I've been able to see some of the panic output on the console, but it never seems consistent (and it doesn't all fit on the screen and can't really be copied/pasted here). It really feels like memory corruption to me, which causes something to panic at a later time after the actual corruption happens, however:

What we've tried so far:

Added "pcie_port_pm = off libata.force = noncq" to the kernel command line
Installed 6.11.11-2 kernel
Replaced all RAM
Replaced the entire server (except disks) with identical hardware

I've also twice seen pve-root get remounted read-only due to an ext4 error, but there are no other disk-related issues reported anywhere in the journal, no SMART errors, and I've used smartctl to run short tests on the boot disks, with no issues. The main boot disk is a hardware RAID1 array using Dell's PERC H755. Dell's iDRAC reports no issues with these disks either. These are enterprise write-intensive SATA SSDs. We also have a raidz2 ZFS pool of 5 enterprise NVMe SSDs, and they all show as online and healthy as well. The server has panicked without pve-root going read-only as well, so it feels more like a symptom than part of the issue.

I've been a Linux admin for decades at this point, and I'm running out of ideas. The boss is getting more and more upset each time the server goes down, as he has angry clients calling him. I realize I haven't provided too many details about the server here, but I've been through just about everything I can think of from my years of experience, and nothing is helping and I don't have a lot left to try. I'm happy to provide more specific details on anything if requested.

We do not currently have a support subscription. We would be willing to purchase a standard subscription that offers remote ssh support, if this is something within scope that the Proxmox team could help with.

Thanks in advance for anything anyone has to offer.

C.

guruevi · Apr 5, 2025

What does the kernel panic say? Your iDRAC can record the kernel panic. https://www.dell.com/support/manual...ced8f4-162d-4329-ab95-169445213eae&lang=en-us

ch_turn · Apr 5, 2025

guruevi said:
What does the kernel panic say? Your iDRAC can record the kernel panic. https://www.dell.com/support/manual...ced8f4-162d-4329-ab95-169445213eae&lang=en-us

Thanks for this. Unfortunately, it only records the first couple minutes of booting up. Next time it happens I can do a screenshot of what is on the console, although not all the panic info fits on the screen. And from what I've seen go by so far, the panic doesn't always look the same.

ch_turn · Apr 7, 2025

So it happened once again over the weekend, after the server had only been up for about 6 hours. Attached is a screenshot of the console at the time, although I don't think there's enough on the screen to be all that useful. This time however, the following were the last journal entries immediately before the panic:

Code:

Apr 05 09:13:00 pve01 kernel: php-fpm[451522]: segfault at 0 ip 000060ccc806a206 sp 00007ffcdcbbff20 error 6 in php-fpm>
Apr 05 09:13:00 pve01 kernel: Code: 09 00 0f 85 aa 66 00 00 80 7e 08 0a 48 8b 06 75 04 48 8d 70 08 48 8b 06 8b 56 08 48>
Apr 05 09:13:00 pve01 kernel: php-fpm[9377]: segfault at 0 ip 0000000000000000 sp 00007ffcdcbbff20 error 14 likely on C>
Apr 05 09:13:00 pve01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Apr 05 09:13:00 pve01 kernel: php-fpm[2885224]: segfault at 0 ip 0000000000000000 sp 00007ffcdcbbfc88 error 14 likely o>
Apr 05 09:13:00 pve01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Apr 05 09:13:00 pve01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000478
Apr 05 09:13:00 pve01 kernel: #PF: supervisor read access in kernel mode
Apr 05 09:13:00 pve01 kernel: #PF: error_code(0x0000) - not-present page

We have php-fpm running in a container. This still seems to be more of a symptom than the actual problem to me, since it shouldn't be possible for a user process like php-fpm to crash the kernel.

Still wondering if this is something within the scope of Proxmox support, as I'm running out of ideas to find the issue and resolve the panics.

Thanks for any thoughts anyone might have.

C.

guruevi · Apr 7, 2025

Given the errors you’ve seen with read only root, I’m leaning more towards a disk controller issue, potentially a memory issue. Have you tested with memtest86?

We really need the first page of the kernel panic, but you seem to have a MegaRAID controller linked into the kernel crash modules.

Why is PHP-FPM running on your host system? That should be in a VM or container.

ch_turn · Apr 7, 2025

Thanks for your reply. I'll try to get more of the panic, but it's difficult since it doesn't all show on the console and isn't recorded in the journal. I have an ssh session open monitoring the kernel log, so if it happens again maybe that will display something I can capture.

As mentioned in my original post, all hardware has been swapped out (except the SSDs), including the RAM and the entire server itself. While certainly possible, it seems unlikely that all new hardware would be faulty like the old hardware. Not saying it isn't a hardware issue, but it doesn't seem like faulty hardware to me since it's all been replaced.

The MegaRAID controller is the Dell PERC H755. This is where the boot SSDs are (in hardware RAID1). I agree this does seem like a possibility for where the issue lies.

As mentioned, php-fpm is indeed running in a container. It's not running on the host. But of course since processes in containers do run on the host kernel, the php-fpm crash shows in the host journal. The container is stored on the zfspool, not on the RAID1 (MegaRAID/PERC H755).

Thanks.

guruevi · Apr 7, 2025

Through the iDRAC you can update your firmware, that's where I would start, it's not uncommon for certain firmware to cause issues and then swapping the hardware with identical hardware which has the same or older firmware, may be an issue. Why weren't the SSD's swapped out if you have a new server? That's the next likely culprit since that is what did not change, it only needs 1 to be faulty and AMD systems don't have nearly as good error detection/recovery on the PCIe bus as Intel systems.

As far as the swap, did you do it yourself? Typically a Dell tech only swaps out parts (like motherboard etc), I've rarely seen it where they swap out the entire server and even then, they still move all the parts (memory etc) back into the new server.

Although the segfault issues point at memory, this could also mean anything on the PCIe bus with access to memory (eg. NVMe). There should be settings in your UEFI configuration that enable/disable memory and PCIe bus error detection. Make sure they're on, otherwise the hardware will never report up through the OS (eg. Memtest86) if an error did occur.

ch_turn · Apr 8, 2025

Thanks again for your response. The firmware is all up-to-date, we made sure that was done along with the swap. The only reason the SSDs weren't replaced was to avoid a reinstall, because at the time there was really no evidence to indicate it could be a disk issue (I hadn't seen pve-root go read-only at that time), and was hoping it would be something else in the server. I agree that this may be what to try next. Since pve-root is a pair of SSDs in RAID1, we should be able to swap one, let it resync, then swap the other, without needing to restore or reinstall anything (and we do have full backups, of course, just in case).

Thanks for the info about AMD systems not having as good error detection and recovery. All my AMD experience is with consumer-grade hardware (where I've had no issues), this is my first Dell AMD server. All the Dell servers I've worked with in the past have always been Intel.

I did not do the swap myself, as the server is in data center very far away. We had the data center do the swap. They did indeed replace the entire server with a spare, and moved the SSDs over. The RAM had already been completely replaced as the first step before replacing the entire server.

It's possible the php-fpm segfault is unrelated. We have a developer (out of my control) doing some shady things with PHP internals and it has caused segfaults. While this shouldn't be able to crash anything other than the php-fpm process itself, we did have core dumps enabled, so the segfault would have caused a core dump to be written to pve-root. So it's certainly possible that the php-fpm segfault wasn't related, but triggered a big write to the root SSDs, which could cause the kernel panic if there is indeed an issue with one or both of those SSDs.

Still wondering if anyone knows if this sort of issue falls within Proxmox support.

Thanks again.

C.

Search

Search

Kernel panics

ch_turn

New Member

guruevi

Well-Known Member

ch_turn

New Member

ch_turn

New Member

Attachments

guruevi

Well-Known Member

ch_turn

New Member

guruevi

Well-Known Member

ch_turn

New Member

We value your privacy