Unable to resolve sudden hard reboots

some3t1m3s · May 22, 2025

Since weeks I'm trying to resolve the following issue: My Proxmox Server randomly reboots for no apparent reason. It's a 'hard' reboot as if someone presses the reboot button. This happens every 1 hours to 3 days.

The systemd journal shows nothing noteworthy before the reboot, usually there aren't entries directly before the reboot happens anyway, just "-- Boot [..]". The reboots are completely unrelated to load. I'm unable to provoke reboots, eg. when loading a VM with Prime95. The BMC also doesn't log anything, especially not the "Vcore 0.0 V" error that the MC12-LE0 produces with a Ryzen 5950X under certain circumstances.

My initial configuration:

Gigabyte MC12-LE0
Ryzen 5950X
4x Samsung 32GB DDR4-3200 CL-22-22-22
3x Crucial MX500 250GB
Mellanox Connect-X 3 CX311A
Seasonic Gold Focus 450W

The case is well ventilated. When I push the CPU with normal workloads it eventually reaches 75°C, that seems to be the limit. Other temperatures from 'sensors' are well below that. I can touch the Connect-X3 heatsink with my hand and it doesn't even feel warm.

Things I tried so far:

Reading basically every thread on the internet mentioning "Ryzen" "Proxmox" and "Reboot"
Removing all 'optional' hardware (not listed here as obviously unrelated to the problem)
Updated Proxmox to the current version
Updated the BIOS and BMC
Replaced the mainboard with a new ASRock X570D4U
Updated the BIOS and BMC
Installed amd64-microcode
Replaced the PSU with a new bequiet! Pure Power 12 M 750W
Stresstested the RAM with memtest86+
Disabled Core Watchdog in BIOS (and re-enabled it again after it didn't help, same with all other BIOS options mentioned)
Enabled Eco Mode in BIOS
Disabled PBO
Manually lowered the TDP and the boost limits even further
Disabled Global C-States in BIOS
Changed the Power Supply Idle Control setting
Set all Windows VMs CPU type to x86-64-v2-AES
Disabled all Windows VMs altogether
Underprovisioned the remaining VMs so vCPUs < actual CPU cores (16)
Installed optional 6.11 kernel
Considered having a breakdown and applying for the job with the least amount of responsibility

Things I haven't tried so far:

Exchanging the CPU
Exchanging the RAM despite the fact that it passes memory tests
Replacing the Connect-X 3
Disabling all VMs altogether

I'd like to point out that I have a similar setup with TrueNAS Scale running on Gigabyte MC12-LE0, Ryzen 5600, Connect-X 3 CX311A, 4x 32GB Kingston Server ECC RAM that runs rock-solid to such a degree that I didn't even bother to update the BIOS.

Do you have any idea or direction how I might be able to narrow down what the culprit is? Do you need any additional information?

Christoph Lechleitner · May 26, 2025

We have been hit by similar problems in the past (PS: April to June 2024), back then solved by changing to other mainboards.

But recently we had 2 unexplained hard reboots again.

My 42nd research brought up https://www.reddit.com/r/Proxmox/comments/1hanfrj/possible_fix_for_random_reboots_on_proxmox_83/ and https://arstechnica.com/civis/threads/why-did-my-proxmox-box-randomly-reboot-itself.1506147/ and via there https://forum.proxmox.com/threads/w...og-issue-on-asrock-x670e-motherboards.158814/ all pretty much proposing to switch off the hardware watchdog.

Maybe this helps?

I'd be very much interested if it does help ;-)

some3t1m3s · May 26, 2025

Thank you for your response. As for the different watchdogs mentioned (are they separate things?):
I've already disabled the Core Watchdog on both mainboards, it did not solve the problem. However I'll try disabling the watchdog mentioned in relation to ipmi in one of the threads. There are also some kernel parameters floating around in these threads, while I don't feel great about doing such workarounds maybe they help to narrow down the problem.
I just got a new network card and will test it, I opted to also replace the CPU. Let's see if this leads somewhere.

Christoph Lechleitner · May 26, 2025

Thanks, Interesting.

FYI 1 (and only 1) of our 2 recent reboots have this at the end of the previous log:

Code:

Mai 21 02:13:10 pveipax02 watchdog-mux[2540416]: client watchdog expired - disable watchdog updates
Mai 21 02:13:13 pveipax02 CRON[184207]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Mai 21 02:13:13 pveipax02 CRON[184208]: (root) CMD (/usr/sbin/mt-pvecm-nodecount 2>&1 >/var/log/montools/mt-pvecm-nodecount.log)
Mai 21 02:13:13 pveipax02 watchdog-mux[2540416]: exit watchdog-mux with active connections
Mai 21 02:13:13 pveipax02 pve-ha-lrm[2540489]: loop take too long (60 seconds)
Mai 21 02:13:13 pveipax02 systemd-journald[887]: Received client request to sync journal.
Mai 21 02:13:14 pveipax02 kernel: watchdog: watchdog0: watchdog did not stop!

The other reboot had somewhat high network + disk I/O together with virtio network driver (which we now avoid in favour of e1000 driver).

Maximiliano · May 26, 2025

Hello,

> Mai 21 02:13:10 pveipax02 watchdog-mux[2540416]: client watchdog expired - disable watchdog updates

Looks like the software watchdog expired which would result in the node fencing (rebooting), please take a look at our documentation [1] for more information. This would happen whenever a host loses Corosync quorum for over a minute (only if there are HA-resource). Corosync is extremely sensitive to latency spikes so if its network is saturated Corosync would deem it unusable.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Christoph Lechleitner · May 26, 2025

> Looks like ...

In this 1 case, yes.

I have looked at documentation many times, and I have looked at log entries from the minutes before the reboot ;-)

Still wondering why the watchdog hit though, we're monitoring the load (i.e. /proc/loadavg) every 5 minutes, there was a bit of load probably from the nightly backups, but some 2.5 or so should not bother a machine with 16 cores (AMD Ryzen 9 7950X3D), 128 GB RAM (usage around 40 GB) and 2x2 TB NVMe. PVE's own monitoring dashboard does not show any particular peaks neither (except for a Gitlab explosion the next day).

It could have been a nightly network hickup of course, sigh (I want configurable fencing timing).

Thanks for joining in anyway!

some3t1m3s · May 26, 2025

I've searched the logs and would like to point out that there is no "client watchdog expired" error message in my case.

Maximiliano · May 27, 2025

Still wondering why the watchdog hit though

As I said, the network configured to Corosync was unusable (from the point of view of Corosync) for a whole minute. We recommend:

- To have a dedicated NIC for corosync, this can be a 1G nic
- To have at least one redundant NIC for corosync
- To give the main NICs directly to Corosync, e.g. its first NIC should not be Linux Bond

Ceph, VM backups, or just VM traffic can easily saturate a 10G link and lead to a node fencing if Corosync is listening on the same interface.

We recommend to keep a latency bellow 5ms at all times for Corosync [1], ideally bellow 1ms.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

some3t1m3s · Jun 8, 2025

I finally found the culprit: The CPU seems to be faulty. I swapped in a brand new 5950X and it has been rock solid since 1 week while running every VM possible.

quarros · Aug 29, 2025

some3t1m3s said:
I finally found the culprit: The CPU seems to be faulty. I swapped in a brand new 5950X and it has been rock solid since 1 week while running every VM possible.

Apologies for asking something off topic, but I'm in almost the same situation as you hardware wise. Same motherboard same amount of ram, NIC and everything except the processor which I need to buy a new one.

Could you please tell me just how much power the 5950x eats at idle and under load with Proxmox?

I ask this because it will be my new main Proxmox host so it will be on 24/365 mostly idle, but randomly several hours or days of intensive use. So 16 core would be a big plus, but not if I can't tame the power consumption.

Additionally can you please tell me if the bios on the board allows us to set power limitations, or set eco mode? When I upgraded the firmware on it with a lent Ryzen, I forgot to check it and now I can't.

Search

Search

Unable to resolve sudden hard reboots

some3t1m3s

New Member

Christoph Lechleitner

Renowned Member

some3t1m3s

New Member

Christoph Lechleitner

Renowned Member

Maximiliano

Proxmox Staff Member

Christoph Lechleitner

Renowned Member

some3t1m3s

New Member

Maximiliano

Proxmox Staff Member

some3t1m3s

New Member

quarros

New Member

We value your privacy