Hi,
We have a single Dell R7525 server running Proxmox that has been experiencing random freezes and kernel panics. The server is hosted in a data center, so I don't have physical access to it (but do have remote management using Dell's iDRAC). The freezes would lock the server solid, with no info on the console or in the logs, and would require a hard reboot. Since adding "pcie_port_pm = off libata.force = noncq" to the kernel command line, the freezes are gone but now we're seeing kernel panics instead. We started doing scheduled reboots at night in hopes that the issue would not arise before the next reboot, but even this wasn't enough and we've had panics after the server has only been up for a few hours, although it usually goes for more than a week before having an issue. I've been able to see some of the panic output on the console, but it never seems consistent (and it doesn't all fit on the screen and can't really be copied/pasted here). It really feels like memory corruption to me, which causes something to panic at a later time after the actual corruption happens, however:
What we've tried so far:
I've been a Linux admin for decades at this point, and I'm running out of ideas. The boss is getting more and more upset each time the server goes down, as he has angry clients calling him. I realize I haven't provided too many details about the server here, but I've been through just about everything I can think of from my years of experience, and nothing is helping and I don't have a lot left to try. I'm happy to provide more specific details on anything if requested.
We do not currently have a support subscription. We would be willing to purchase a standard subscription that offers remote ssh support, if this is something within scope that the Proxmox team could help with.
Thanks in advance for anything anyone has to offer.
C.
We have a single Dell R7525 server running Proxmox that has been experiencing random freezes and kernel panics. The server is hosted in a data center, so I don't have physical access to it (but do have remote management using Dell's iDRAC). The freezes would lock the server solid, with no info on the console or in the logs, and would require a hard reboot. Since adding "pcie_port_pm = off libata.force = noncq" to the kernel command line, the freezes are gone but now we're seeing kernel panics instead. We started doing scheduled reboots at night in hopes that the issue would not arise before the next reboot, but even this wasn't enough and we've had panics after the server has only been up for a few hours, although it usually goes for more than a week before having an issue. I've been able to see some of the panic output on the console, but it never seems consistent (and it doesn't all fit on the screen and can't really be copied/pasted here). It really feels like memory corruption to me, which causes something to panic at a later time after the actual corruption happens, however:
What we've tried so far:
- Added "pcie_port_pm = off libata.force = noncq" to the kernel command line
- Installed 6.11.11-2 kernel
- Replaced all RAM
- Replaced the entire server (except disks) with identical hardware
I've been a Linux admin for decades at this point, and I'm running out of ideas. The boss is getting more and more upset each time the server goes down, as he has angry clients calling him. I realize I haven't provided too many details about the server here, but I've been through just about everything I can think of from my years of experience, and nothing is helping and I don't have a lot left to try. I'm happy to provide more specific details on anything if requested.
We do not currently have a support subscription. We would be willing to purchase a standard subscription that offers remote ssh support, if this is something within scope that the Proxmox team could help with.
Thanks in advance for anything anyone has to offer.
C.