The hard part is this is not a graceful shutdown by any stretch. It just kills.Is there a way how to debug randomly rebooting machines, some tools thats catch why server reboots. Set and forget, but when server randomly reboots, you know what happend ?
Example
We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15
Kernel, IPMI logs have nothing interesting, just server go down
/proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.8-3-pve root=UUID=9342a4c5-b779-486d-b9c0-c42184f02c5b ro quiet pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0
Tried memtest, no error, tried hirensbootcd with prime95 blend test, still running and nothing wrong
I dont know, dealer dont know, probably Asrock TECH SUPPORT dont know ( they are horrible )
Do you have some examples/stories which are you using to debug these server fuckups ?
Thanks
- For software: Stop all VMs/LXCs, and remove from any clusters. Does it still happen? If no, reintroduce cluster. Still good? reintroduce one VM/LXC at a time.
- For hardware: Pull everything except CPU, one drive, and one stick of ram. Reduce server resources to accommodate. Does this still crash? On and on.
I started with software. Then I worked on hardware. After reducing my server load to almost nothing, I found my HBA was the "issue". Removed that, and it was fine, until it wasn't. Then I reduced server load again, and it never rebooted. Weird. Started pulling everything on the board, but nothing added up. Eventually found out it was my PSU. Ordered another one, it was also bad. Ordered another and it worked great after that.
Not sure why the computer could run as a gaming rig, but not as a hypervisor with a problematic PSU.