Since weeks I'm trying to resolve the following issue: My Proxmox Server randomly reboots for no apparent reason. It's a 'hard' reboot as if someone presses the reboot button. This happens every 1 hours to 3 days.
The systemd journal shows nothing noteworthy before the reboot, usually there aren't entries directly before the reboot happens anyway, just "-- Boot [..]". The reboots are completely unrelated to load. I'm unable to provoke reboots, eg. when loading a VM with Prime95. The BMC also doesn't log anything, especially not the "Vcore 0.0 V" error that the MC12-LE0 produces with a Ryzen 5950X under certain circumstances.
My initial configuration:
Things I tried so far:
Do you have any idea or direction how I might be able to narrow down what the culprit is? Do you need any additional information?
The systemd journal shows nothing noteworthy before the reboot, usually there aren't entries directly before the reboot happens anyway, just "-- Boot [..]". The reboots are completely unrelated to load. I'm unable to provoke reboots, eg. when loading a VM with Prime95. The BMC also doesn't log anything, especially not the "Vcore 0.0 V" error that the MC12-LE0 produces with a Ryzen 5950X under certain circumstances.
My initial configuration:
- Gigabyte MC12-LE0
- Ryzen 5950X
- 4x Samsung 32GB DDR4-3200 CL-22-22-22
- 3x Crucial MX500 250GB
- Mellanox Connect-X 3 CX311A
- Seasonic Gold Focus 450W
Things I tried so far:
- Reading basically every thread on the internet mentioning "Ryzen" "Proxmox" and "Reboot"
- Removing all 'optional' hardware (not listed here as obviously unrelated to the problem)
- Updated Proxmox to the current version
- Updated the BIOS and BMC
- Replaced the mainboard with a new ASRock X570D4U
- Updated the BIOS and BMC
- Installed amd64-microcode
- Replaced the PSU with a new bequiet! Pure Power 12 M 750W
- Stresstested the RAM with memtest86+
- Disabled Core Watchdog in BIOS (and re-enabled it again after it didn't help, same with all other BIOS options mentioned)
- Enabled Eco Mode in BIOS
- Disabled PBO
- Manually lowered the TDP and the boost limits even further
- Disabled Global C-States in BIOS
- Changed the Power Supply Idle Control setting
- Set all Windows VMs CPU type to x86-64-v2-AES
- Disabled all Windows VMs altogether
- Underprovisioned the remaining VMs so vCPUs < actual CPU cores (16)
- Installed optional 6.11 kernel
- Considered having a breakdown and applying for the job with the least amount of responsibility
- Exchanging the CPU
- Exchanging the RAM despite the fact that it passes memory tests
- Replacing the Connect-X 3
- Disabling all VMs altogether
Do you have any idea or direction how I might be able to narrow down what the culprit is? Do you need any additional information?
Last edited: