Kernel panics

ch_turn

New Member
Apr 5, 2025
4
0
1
Hi,
We have a single Dell R7525 server running Proxmox that has been experiencing random freezes and kernel panics. The server is hosted in a data center, so I don't have physical access to it (but do have remote management using Dell's iDRAC). The freezes would lock the server solid, with no info on the console or in the logs, and would require a hard reboot. Since adding "pcie_port_pm = off libata.force = noncq" to the kernel command line, the freezes are gone but now we're seeing kernel panics instead. We started doing scheduled reboots at night in hopes that the issue would not arise before the next reboot, but even this wasn't enough and we've had panics after the server has only been up for a few hours, although it usually goes for more than a week before having an issue. I've been able to see some of the panic output on the console, but it never seems consistent (and it doesn't all fit on the screen and can't really be copied/pasted here). It really feels like memory corruption to me, which causes something to panic at a later time after the actual corruption happens, however:

What we've tried so far:
  • Added "pcie_port_pm = off libata.force = noncq" to the kernel command line
  • Installed 6.11.11-2 kernel
  • Replaced all RAM
  • Replaced the entire server (except disks) with identical hardware
I've also twice seen pve-root get remounted read-only due to an ext4 error, but there are no other disk-related issues reported anywhere in the journal, no SMART errors, and I've used smartctl to run short tests on the boot disks, with no issues. The main boot disk is a hardware RAID1 array using Dell's PERC H755. Dell's iDRAC reports no issues with these disks either. These are enterprise write-intensive SATA SSDs. We also have a raidz2 ZFS pool of 5 enterprise NVMe SSDs, and they all show as online and healthy as well. The server has panicked without pve-root going read-only as well, so it feels more like a symptom than part of the issue.

I've been a Linux admin for decades at this point, and I'm running out of ideas. The boss is getting more and more upset each time the server goes down, as he has angry clients calling him. I realize I haven't provided too many details about the server here, but I've been through just about everything I can think of from my years of experience, and nothing is helping and I don't have a lot left to try. I'm happy to provide more specific details on anything if requested.

We do not currently have a support subscription. We would be willing to purchase a standard subscription that offers remote ssh support, if this is something within scope that the Proxmox team could help with.

Thanks in advance for anything anyone has to offer.

C.
 
So it happened once again over the weekend, after the server had only been up for about 6 hours. Attached is a screenshot of the console at the time, although I don't think there's enough on the screen to be all that useful. This time however, the following were the last journal entries immediately before the panic:

Code:
Apr 05 09:13:00 pve01 kernel: php-fpm[451522]: segfault at 0 ip 000060ccc806a206 sp 00007ffcdcbbff20 error 6 in php-fpm>
Apr 05 09:13:00 pve01 kernel: Code: 09 00 0f 85 aa 66 00 00 80 7e 08 0a 48 8b 06 75 04 48 8d 70 08 48 8b 06 8b 56 08 48>
Apr 05 09:13:00 pve01 kernel: php-fpm[9377]: segfault at 0 ip 0000000000000000 sp 00007ffcdcbbff20 error 14 likely on C>
Apr 05 09:13:00 pve01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Apr 05 09:13:00 pve01 kernel: php-fpm[2885224]: segfault at 0 ip 0000000000000000 sp 00007ffcdcbbfc88 error 14 likely o>
Apr 05 09:13:00 pve01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Apr 05 09:13:00 pve01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000478
Apr 05 09:13:00 pve01 kernel: #PF: supervisor read access in kernel mode
Apr 05 09:13:00 pve01 kernel: #PF: error_code(0x0000) - not-present page

We have php-fpm running in a container. This still seems to be more of a symptom than the actual problem to me, since it shouldn't be possible for a user process like php-fpm to crash the kernel.

Still wondering if this is something within the scope of Proxmox support, as I'm running out of ideas to find the issue and resolve the panics.

Thanks for any thoughts anyone might have.

C.
 

Attachments

  • Screenshot 2025-04-05 105308.png
    Screenshot 2025-04-05 105308.png
    177.3 KB · Views: 3
Given the errors you’ve seen with read only root, I’m leaning more towards a disk controller issue, potentially a memory issue. Have you tested with memtest86?

We really need the first page of the kernel panic, but you seem to have a MegaRAID controller linked into the kernel crash modules.

Why is PHP-FPM running on your host system? That should be in a VM or container.
 
Last edited:
Thanks for your reply. I'll try to get more of the panic, but it's difficult since it doesn't all show on the console and isn't recorded in the journal. I have an ssh session open monitoring the kernel log, so if it happens again maybe that will display something I can capture.

As mentioned in my original post, all hardware has been swapped out (except the SSDs), including the RAM and the entire server itself. While certainly possible, it seems unlikely that all new hardware would be faulty like the old hardware. Not saying it isn't a hardware issue, but it doesn't seem like faulty hardware to me since it's all been replaced.

The MegaRAID controller is the Dell PERC H755. This is where the boot SSDs are (in hardware RAID1). I agree this does seem like a possibility for where the issue lies.

As mentioned, php-fpm is indeed running in a container. It's not running on the host. But of course since processes in containers do run on the host kernel, the php-fpm crash shows in the host journal. The container is stored on the zfspool, not on the RAID1 (MegaRAID/PERC H755).

Thanks.
 
Through the iDRAC you can update your firmware, that's where I would start, it's not uncommon for certain firmware to cause issues and then swapping the hardware with identical hardware which has the same or older firmware, may be an issue. Why weren't the SSD's swapped out if you have a new server? That's the next likely culprit since that is what did not change, it only needs 1 to be faulty and AMD systems don't have nearly as good error detection/recovery on the PCIe bus as Intel systems.

As far as the swap, did you do it yourself? Typically a Dell tech only swaps out parts (like motherboard etc), I've rarely seen it where they swap out the entire server and even then, they still move all the parts (memory etc) back into the new server.

Although the segfault issues point at memory, this could also mean anything on the PCIe bus with access to memory (eg. NVMe). There should be settings in your UEFI configuration that enable/disable memory and PCIe bus error detection. Make sure they're on, otherwise the hardware will never report up through the OS (eg. Memtest86) if an error did occur.
 
Last edited: