Unable to resolve sudden hard reboots

some3t1m3s

New Member
May 22, 2025
1
0
1
Since weeks I'm trying to resolve the following issue: My Proxmox Server randomly reboots for no apparent reason. It's a 'hard' reboot as if someone presses the reboot button. This happens every 1 hours to 3 days.

The systemd journal shows nothing noteworthy before the reboot, usually there aren't entries directly before the reboot happens anyway, just "-- Boot [..]". The reboots are completely unrelated to load. I'm unable to provoke reboots, eg. when loading a VM with Prime95. The BMC also doesn't log anything, especially not the "Vcore 0.0 V" error that the MC12-LE0 produces with a Ryzen 5950X under certain circumstances.

My initial configuration:
  • Gigabyte MC12-LE0
  • Ryzen 5950X
  • 4x Samsung 32GB DDR4-3200 CL-22-22-22
  • 3x Crucial MX500 250GB
  • Mellanox Connect-X 3 CX311A
  • Seasonic Gold Focus 450W
The case is well ventilated. When I push the CPU with normal workloads it eventually reaches 75°C, that seems to be the limit. Other temperatures from 'sensors' are well below that. I can touch the Connect-X3 heatsink with my hand and it doesn't even feel warm.

Things I tried so far:
  • Reading basically every thread on the internet mentioning "Ryzen" "Proxmox" and "Reboot"
  • Removing all 'optional' hardware (not listed here as obviously unrelated to the problem)
  • Updated Proxmox to the current version
  • Updated the BIOS and BMC
  • Replaced the mainboard with a new ASRock X570D4U
  • Updated the BIOS and BMC
  • Installed amd64-microcode
  • Replaced the PSU with a new bequiet! Pure Power 12 M 750W
  • Stresstested the RAM with memtest86+
  • Disabled Core Watchdog in BIOS (and re-enabled it again after it didn't help, same with all other BIOS options mentioned)
  • Enabled Eco Mode in BIOS
  • Disabled PBO
  • Manually lowered the TDP and the boost limits even further
  • Disabled Global C-States in BIOS
  • Changed the Power Supply Idle Control setting
  • Set all Windows VMs CPU type to x86-64-v2-AES
  • Disabled all Windows VMs altogether
  • Underprovisioned the remaining VMs so vCPUs < actual CPU cores (16)
  • Installed optional 6.11 kernel
  • Considered having a breakdown and applying for the job with the least amount of responsibility
Things I haven't tried so far:
  • Exchanging the CPU
  • Exchanging the RAM despite the fact that it passes memory tests
  • Replacing the Connect-X 3
  • Disabling all VMs altogether
I'd like to point out that I have a similar setup with TrueNAS Scale running on Gigabyte MC12-LE0, Ryzen 5600, Connect-X 3 CX311A, 4x 32GB Kingston Server ECC RAM that runs rock-solid to such a degree that I didn't even bother to update the BIOS.

Do you have any idea or direction how I might be able to narrow down what the culprit is? Do you need any additional information?
 
Last edited: