Proxmox Mystery Random Reboots

Is there a way how to debug randomly rebooting machines, some tools thats catch why server reboots. Set and forget, but when server randomly reboots, you know what happend ?

Example

We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15
Kernel, IPMI logs have nothing interesting, just server go down

/proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-6.8.8-3-pve root=UUID=9342a4c5-b779-486d-b9c0-c42184f02c5b ro quiet pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

Tried memtest, no error, tried hirensbootcd with prime95 blend test, still running and nothing wrong

I dont know, dealer dont know, probably Asrock TECH SUPPORT dont know :) ( they are horrible )

Do you have some examples/stories which are you using to debug these server fuckups ?
Thanks
The hard part is this is not a graceful shutdown by any stretch. It just kills.

- For software: Stop all VMs/LXCs, and remove from any clusters. Does it still happen? If no, reintroduce cluster. Still good? reintroduce one VM/LXC at a time.
- For hardware: Pull everything except CPU, one drive, and one stick of ram. Reduce server resources to accommodate. Does this still crash? On and on.

I started with software. Then I worked on hardware. After reducing my server load to almost nothing, I found my HBA was the "issue". Removed that, and it was fine, until it wasn't. Then I reduced server load again, and it never rebooted. Weird. Started pulling everything on the board, but nothing added up. Eventually found out it was my PSU. Ordered another one, it was also bad. Ordered another and it worked great after that.

Not sure why the computer could run as a gaming rig, but not as a hypervisor with a problematic PSU.
 
If you are troubleshooting these things, set panic=0 on the kernel cmdline for the affected hardware or sysctl -w kernel.panic="0" - that way a kernel panic will not automatically reboot and you'll be able to read the error.

This makes it if there is a SOFTWARE fault, then your system will not automatically reboot. Proper servers with management modules (eg. iDRAC, SuperMicro IPMI, iLO) have options to screenshot or video the last few seconds of the console before the system reboots, if you don't have that feature, just write a script to take an IPMI screenshot every few seconds or save the serial port output (if your kernel output to serial) in a log file.

If it is a hardware fault, server hardware, again, most likely, can self-diagnose memory, CPU and other errors. If you're on consumer hardware, you're going to be manually troubleshooting which part is causing it. Reboots in consumer systems are often caused by heat or power problems (too many things in a case that wasn't intended for server loads), that's where I would look first, the next thing to go is typically a boot drive/boot drive controller problem (esp. with SATA or USB drives), memory then CPU. Modern boards should be able to kick out PCIe cards that are acting up and not crash, but older hardware, or again, power/heat related issue or if that PCIe card is your boot disk controller.
 
Wondering if these random reboot have anything to do with memory not being on QVL.

Anyone experience random reboot when using QVL memory for motherboard?
 
From our experience, the cause of "random reboots" is the following :
We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15

Until now, 10 of our servers with Asrock B650-D4U have been affected, replacing the motherboard by a Supermicro model fixes the issue for good righ away.
It has been said that some batches of these asrock mobo are faulty, no details available, except that board with serial numbers beginning with serial number H5 or H6 should be fine. (all our imapcted ones where H1-S0xexxx, they have been replaced by asrockrack support no question asked but we are not too confident to put them back in production now...

We used mostly Crucial Pro RAM DDR5 and CORSAIR Vengeance DDR5 so not on Asrock QVL BUT we notice that the issue mostly happens after a few months of operations so would be surprising if it was memory related, one of the dead board has been running fine for 6 months.

Also intersting reads :
https://forum.level1techs.com/t/asrock-b650d4u-code-00-server-motherboard-failure/216110/21
https://forum.asrock.com/forum_posts.asp?TID=40795&PID=156897&title=post-code-00#156897

At our side we are replacing them with Supermicro and wondering what we are going to do with the B650D4U we recieve exchanged from RMA (that are new models)
 
From our experience, the cause of "random reboots" is the following :
We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15

Until now, 10 of our servers with Asrock B650-D4U have been affected, replacing the motherboard by a Supermicro model fixes the issue for good righ away.
It has been said that some batches of these asrock mobo are faulty, no details available, except that board with serial numbers beginning with serial number H5 or H6 should be fine. (all our imapcted ones where H1-S0xexxx, they have been replaced by asrockrack support no question asked but we are not too confident to put them back in production now...

We used mostly Crucial Pro RAM DDR5 and CORSAIR Vengeance DDR5 so not on Asrock QVL BUT we notice that the issue mostly happens after a few months of operations so would be surprising if it was memory related, one of the dead board has been running fine for 6 months.

Also intersting reads :
https://forum.level1techs.com/t/asrock-b650d4u-code-00-server-motherboard-failure/216110/21
https://forum.asrock.com/forum_posts.asp?TID=40795&PID=156897&title=post-code-00#156897

At our side we are replacing them with Supermicro and wondering what we are going to do with the B650D4U we recieve exchanged from RMA (that are new models)
Thank you for your helpful post!
Exactly my problem too: B650D4U with serial number: H4-S0R60000xx with BIOS version: 20.05 & BMC version: 5.03.00 is stuck in Dr. Debug code 00 after experiencing 6 months of random reboots in proxmox.

I got this board as a replacement for another failed board with serial number: H1-S0XE0016xx

I would like to avoid asRock Rack boards completely. Looking into Supermicro H13SAE and other consumer boards with iKVM. Wondering what consumer board options are out there.
 
Last edited:
SuperMicro is a nice consumer/prosumer brand. HP and Dell both have workstations/office-grade devices (so not jet engine servers) with iLO and iDRAC respectively. All of those can be had for cheap if you’re looking second hand too. TYAN used to have great stuff, not sure if they’re still there, it’s been a while.
 
At our side, so far, great experience with Supermicro H13SAE and we juste recieved a Microcloud with 8*H13SRD-F, curious to see the power footprint compared to individual nodes. Anyway, shame for this issue with Asrock, we had a good experience with X570D4U, they should communicate about the issue and recall all the boards / offer some coupon on another model / do something..