Proxmox Mystery Random Reboots

Is there a way how to debug randomly rebooting machines, some tools thats catch why server reboots. Set and forget, but when server randomly reboots, you know what happend ?

Example

We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15
Kernel, IPMI logs have nothing interesting, just server go down

/proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-6.8.8-3-pve root=UUID=9342a4c5-b779-486d-b9c0-c42184f02c5b ro quiet pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

Tried memtest, no error, tried hirensbootcd with prime95 blend test, still running and nothing wrong

I dont know, dealer dont know, probably Asrock TECH SUPPORT dont know :) ( they are horrible )

Do you have some examples/stories which are you using to debug these server fuckups ?
Thanks
The hard part is this is not a graceful shutdown by any stretch. It just kills.

- For software: Stop all VMs/LXCs, and remove from any clusters. Does it still happen? If no, reintroduce cluster. Still good? reintroduce one VM/LXC at a time.
- For hardware: Pull everything except CPU, one drive, and one stick of ram. Reduce server resources to accommodate. Does this still crash? On and on.

I started with software. Then I worked on hardware. After reducing my server load to almost nothing, I found my HBA was the "issue". Removed that, and it was fine, until it wasn't. Then I reduced server load again, and it never rebooted. Weird. Started pulling everything on the board, but nothing added up. Eventually found out it was my PSU. Ordered another one, it was also bad. Ordered another and it worked great after that.

Not sure why the computer could run as a gaming rig, but not as a hypervisor with a problematic PSU.
 
If you are troubleshooting these things, set panic=0 on the kernel cmdline for the affected hardware or sysctl -w kernel.panic="0" - that way a kernel panic will not automatically reboot and you'll be able to read the error.

This makes it if there is a SOFTWARE fault, then your system will not automatically reboot. Proper servers with management modules (eg. iDRAC, SuperMicro IPMI, iLO) have options to screenshot or video the last few seconds of the console before the system reboots, if you don't have that feature, just write a script to take an IPMI screenshot every few seconds or save the serial port output (if your kernel output to serial) in a log file.

If it is a hardware fault, server hardware, again, most likely, can self-diagnose memory, CPU and other errors. If you're on consumer hardware, you're going to be manually troubleshooting which part is causing it. Reboots in consumer systems are often caused by heat or power problems (too many things in a case that wasn't intended for server loads), that's where I would look first, the next thing to go is typically a boot drive/boot drive controller problem (esp. with SATA or USB drives), memory then CPU. Modern boards should be able to kick out PCIe cards that are acting up and not crash, but older hardware, or again, power/heat related issue or if that PCIe card is your boot disk controller.
 
Wondering if these random reboot have anything to do with memory not being on QVL.

Anyone experience random reboot when using QVL memory for motherboard?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!