Unable to resolve sudden hard reboots

some3t1m3s

New Member
May 22, 2025
3
0
1
Since weeks I'm trying to resolve the following issue: My Proxmox Server randomly reboots for no apparent reason. It's a 'hard' reboot as if someone presses the reboot button. This happens every 1 hours to 3 days.

The systemd journal shows nothing noteworthy before the reboot, usually there aren't entries directly before the reboot happens anyway, just "-- Boot [..]". The reboots are completely unrelated to load. I'm unable to provoke reboots, eg. when loading a VM with Prime95. The BMC also doesn't log anything, especially not the "Vcore 0.0 V" error that the MC12-LE0 produces with a Ryzen 5950X under certain circumstances.

My initial configuration:
  • Gigabyte MC12-LE0
  • Ryzen 5950X
  • 4x Samsung 32GB DDR4-3200 CL-22-22-22
  • 3x Crucial MX500 250GB
  • Mellanox Connect-X 3 CX311A
  • Seasonic Gold Focus 450W
The case is well ventilated. When I push the CPU with normal workloads it eventually reaches 75°C, that seems to be the limit. Other temperatures from 'sensors' are well below that. I can touch the Connect-X3 heatsink with my hand and it doesn't even feel warm.

Things I tried so far:
  • Reading basically every thread on the internet mentioning "Ryzen" "Proxmox" and "Reboot"
  • Removing all 'optional' hardware (not listed here as obviously unrelated to the problem)
  • Updated Proxmox to the current version
  • Updated the BIOS and BMC
  • Replaced the mainboard with a new ASRock X570D4U
  • Updated the BIOS and BMC
  • Installed amd64-microcode
  • Replaced the PSU with a new bequiet! Pure Power 12 M 750W
  • Stresstested the RAM with memtest86+
  • Disabled Core Watchdog in BIOS (and re-enabled it again after it didn't help, same with all other BIOS options mentioned)
  • Enabled Eco Mode in BIOS
  • Disabled PBO
  • Manually lowered the TDP and the boost limits even further
  • Disabled Global C-States in BIOS
  • Changed the Power Supply Idle Control setting
  • Set all Windows VMs CPU type to x86-64-v2-AES
  • Disabled all Windows VMs altogether
  • Underprovisioned the remaining VMs so vCPUs < actual CPU cores (16)
  • Installed optional 6.11 kernel
  • Considered having a breakdown and applying for the job with the least amount of responsibility
Things I haven't tried so far:
  • Exchanging the CPU
  • Exchanging the RAM despite the fact that it passes memory tests
  • Replacing the Connect-X 3
  • Disabling all VMs altogether
I'd like to point out that I have a similar setup with TrueNAS Scale running on Gigabyte MC12-LE0, Ryzen 5600, Connect-X 3 CX311A, 4x 32GB Kingston Server ECC RAM that runs rock-solid to such a degree that I didn't even bother to update the BIOS.

Do you have any idea or direction how I might be able to narrow down what the culprit is? Do you need any additional information?
 
Last edited:
We have been hit by similar problems in the past (PS: April to June 2024), back then solved by changing to other mainboards.

But recently we had 2 unexplained hard reboots again.

My 42nd research brought up https://www.reddit.com/r/Proxmox/comments/1hanfrj/possible_fix_for_random_reboots_on_proxmox_83/ and https://arstechnica.com/civis/threads/why-did-my-proxmox-box-randomly-reboot-itself.1506147/ and via there https://forum.proxmox.com/threads/w...og-issue-on-asrock-x670e-motherboards.158814/ all pretty much proposing to switch off the hardware watchdog.

Maybe this helps?

I'd be very much interested if it does help ;-)
 
Last edited:
Thank you for your response. As for the different watchdogs mentioned (are they separate things?):
I've already disabled the Core Watchdog on both mainboards, it did not solve the problem. However I'll try disabling the watchdog mentioned in relation to ipmi in one of the threads. There are also some kernel parameters floating around in these threads, while I don't feel great about doing such workarounds maybe they help to narrow down the problem.
I just got a new network card and will test it, I opted to also replace the CPU. Let's see if this leads somewhere.
 
Thanks, Interesting.

FYI 1 (and only 1) of our 2 recent reboots have this at the end of the previous log:
Code:
Mai 21 02:13:10 pveipax02 watchdog-mux[2540416]: client watchdog expired - disable watchdog updates
Mai 21 02:13:13 pveipax02 CRON[184207]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mai 21 02:13:13 pveipax02 CRON[184208]: (root) CMD (/usr/sbin/mt-pvecm-nodecount 2>&1 >/var/log/montools/mt-pvecm-nodecount.log)
Mai 21 02:13:13 pveipax02 watchdog-mux[2540416]: exit watchdog-mux with active connections
Mai 21 02:13:13 pveipax02 pve-ha-lrm[2540489]: loop take too long (60 seconds)
Mai 21 02:13:13 pveipax02 systemd-journald[887]: Received client request to sync journal.
Mai 21 02:13:14 pveipax02 kernel: watchdog: watchdog0: watchdog did not stop!

The other reboot had somewhat high network + disk I/O together with virtio network driver (which we now avoid in favour of e1000 driver).
 
Hello,

> Mai 21 02:13:10 pveipax02 watchdog-mux[2540416]: client watchdog expired - disable watchdog updates

Looks like the software watchdog expired which would result in the node fencing (rebooting), please take a look at our documentation [1] for more information. This would happen whenever a host loses Corosync quorum for over a minute (only if there are HA-resource). Corosync is extremely sensitive to latency spikes so if its network is saturated Corosync would deem it unusable.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
 
> Looks like ...

In this 1 case, yes.

I have looked at documentation many times, and I have looked at log entries from the minutes before the reboot ;-)

Still wondering why the watchdog hit though, we're monitoring the load (i.e. /proc/loadavg) every 5 minutes, there was a bit of load probably from the nightly backups, but some 2.5 or so should not bother a machine with 16 cores (AMD Ryzen 9 7950X3D), 128 GB RAM (usage around 40 GB) and 2x2 TB NVMe. PVE's own monitoring dashboard does not show any particular peaks neither (except for a Gitlab explosion the next day).

It could have been a nightly network hickup of course, sigh (I want configurable fencing timing).

Thanks for joining in anyway!
 
I've searched the logs and would like to point out that there is no "client watchdog expired" error message in my case.
 
Still wondering why the watchdog hit though

As I said, the network configured to Corosync was unusable (from the point of view of Corosync) for a whole minute. We recommend:

- To have a dedicated NIC for corosync, this can be a 1G nic
- To have at least one redundant NIC for corosync
- To give the main NICs directly to Corosync, e.g. its first NIC should not be Linux Bond

Ceph, VM backups, or just VM traffic can easily saturate a 10G link and lead to a node fencing if Corosync is listening on the same interface.

We recommend to keep a latency bellow 5ms at all times for Corosync [1], ideally bellow 1ms.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements