Kernel panics

ch_turn

New Member
Apr 5, 2025
2
0
1
Hi,
We have a single Dell R7525 server running Proxmox that has been experiencing random freezes and kernel panics. The server is hosted in a data center, so I don't have physical access to it (but do have remote management using Dell's iDRAC). The freezes would lock the server solid, with no info on the console or in the logs, and would require a hard reboot. Since adding "pcie_port_pm = off libata.force = noncq" to the kernel command line, the freezes are gone but now we're seeing kernel panics instead. We started doing scheduled reboots at night in hopes that the issue would not arise before the next reboot, but even this wasn't enough and we've had panics after the server has only been up for a few hours, although it usually goes for more than a week before having an issue. I've been able to see some of the panic output on the console, but it never seems consistent (and it doesn't all fit on the screen and can't really be copied/pasted here). It really feels like memory corruption to me, which causes something to panic at a later time after the actual corruption happens, however:

What we've tried so far:
  • Added "pcie_port_pm = off libata.force = noncq" to the kernel command line
  • Installed 6.11.11-2 kernel
  • Replaced all RAM
  • Replaced the entire server (except disks) with identical hardware
I've also twice seen pve-root get remounted read-only due to an ext4 error, but there are no other disk-related issues reported anywhere in the journal, no SMART errors, and I've used smartctl to run short tests on the boot disks, with no issues. The main boot disk is a hardware RAID1 array using Dell's PERC H755. Dell's iDRAC reports no issues with these disks either. These are enterprise write-intensive SATA SSDs. We also have a raidz2 ZFS pool of 5 enterprise NVMe SSDs, and they all show as online and healthy as well. The server has panicked without pve-root going read-only as well, so it feels more like a symptom than part of the issue.

I've been a Linux admin for decades at this point, and I'm running out of ideas. The boss is getting more and more upset each time the server goes down, as he has angry clients calling him. I realize I haven't provided too many details about the server here, but I've been through just about everything I can think of from my years of experience, and nothing is helping and I don't have a lot left to try. I'm happy to provide more specific details on anything if requested.

We do not currently have a support subscription. We would be willing to purchase a standard subscription that offers remote ssh support, if this is something within scope that the Proxmox team could help with.

Thanks in advance for anything anyone has to offer.

C.