[SOLVED] One of two identical servers has constant VM kernel panics

Diabolic487

New Member
Jun 10, 2025
6
3
3
Hello, I have a very frustrating issue that I have been unable to resolve. I have two Dell PowerEdge R840s each with 4x Xeon Gold 6136 CPUs (total 96 cores) and 1TB memory. Server2 was recently purchased. I installed Proxmox 8.4 on it, created a cluster with Server1, and then upgraded them to Proxmox 9.0. There were some issues with the upgrade process that are not relevant here, but the long and short of it is that Server1 is now back up and fully operational while Server2 is wholly unusable.

Server2 boots up and runs just fine in and of itself, however when a VM is live migrated over to it, the VM kernel panics within minutes. Rebooting the VM continues to have kernel panics. Migrating it back to Server1 and the VM runs without issue. Creating a new VM has the same results, kernel panics on Server2, works great after migrating to Server1. I have tried switching between q35 and 1440fx, I have gone through each CPU type (x86-64-AES, etc), toggled every toggle, untoggled every toggle, etc. Nothing changes the result.

Both servers are running the same up-to-date Proxmox version with apt fully updated, same microcode, same bios and firmware versions. Bios settings are both set to factory default to reduce any variance, etc. The point is, the servers should be running as close to identically as possible. The only current difference is that Server1 has a single vdev and Server2 has 2 vdevs, and Server2 is running a freshly installed Proxmox 9.0 (I tried wiping the server several times) while Server1 is running an upgraded-from-8.4 OS.

When I run stress tests on the host, the results are fine, no problems whatsoever. When I run the same stress tests in any VM, or even just run the VM at all, kernel panic.

Any assistance on this issue would be greatly appreciated. I am now completely at a loss as to what to try next. Tomorrow I plan to clone Server1's rpool to Server2 but aside from that, I have no idea what to do to debug this further.
 
Still unable to get my server to run VMs. Any spit balling would also be appreciated, I'm really at the end of my rope here!
 
As you probably know a cluster with only 2 nodes is not the way to go. At the least you should install a Qdevice to maintain quorum.

Since you have anyway re-installed PVE on the new Server2, how about trying to install PVE on that Server2 as a single non-clustered node. Then try running similar VMs on it & see if it panics. Once you get that settled - you should be able to re-cluster safely.
 
A few year ago, we ordered 8 servers for a new vmware cluster.
One of the servers exposted unstability: Memtest failed after a few hours.
It took me weeks with dell support to rule out memory stick and mainboard defect to finaly discover one cpu was defect
 
Was able to perform bisecting tests with CPU Affinity and determined that it is an issue with hyper threading on a specific core, in a specific CPU. thanks for your help and suggestions.
 
  • Like
Reactions: leesteken
Good that you discovered that. Maybe mark this thread as Solved. At the top of the thread, choose the Edit thread button, then from the (no prefix) dropdown choose Solved.
 
  • Like
Reactions: Diabolic487