Proxmox hang : difficulty finding the origin

Singman

Well-Known Member
Sep 13, 2019
42
1
48
57
Hi,

Since months, my Proxmox setup hang really often, sometimes stable for days, sometimes could not get up more than a couple of minutes.
Setup
Motherboard : ASUS Pro WS X570-ACE
CPU : AMD Ryzen 9 5900X
Memory : 2x Kingston server premier 16Gb (DDR4 ECC CL19 DIMM 2Rx8 Mémoire serveur Hynix D - KSM26ED8/16HD)
Memory : 2x Kingston Server Premier 8GB (DDR4 ECC CL19 DIMM 1Rx8 Mémoire serveur Hynix D - KSM26ES8/8HD)
Boot : SSD M2.2280 nvme 480 Gb
Storage : 4x SS Crucial MX500 1Tb (CT1000MX500SSD1)
Cooling : Watercooling Fractal Design Celsius S24 Blackout

First, I checked memory; no error but I had the warranty activated and got 2 new RAM. I also bought brand new 2x 8Gb to try them (and kept them :))
Then I changed the motherboard, still crashing.

I have absolutly NO MESSAGE in dmesg or journalctl before crash, the system just hang and nothing work. No warning before. It just happen.
Memtest86+ give no errors.
I used a stress Linux UISB key to see if the cpu work good, no error, no temperature limit.

I only see one other source of the problem : software. That Proxmox is 5 years old and got updated many times (current version is 8.4.1). But reinstalling everything, even if I have backup with PBS, will be a hard work, it's my homelab and the conf is not really straight forward.

Do you have any idea ? How to debug that configuration ?
 
I don't have anything similar to your setup, but in general this is what I would try:

I'd probably try updating the BIOS on that MB.

Try pinning to a previous kernel - to see if that removes issues.

Another thing - you don't provide any GPU or NW details. These can often be a source of issues.

If you are using the on-board LANs/NICs maybe try using your own.
(I see the board seems to have 2 NICs; Realtek® RTL8117 & Intel® I211-AT, maybe try changing your setup with which one you use for what).

(I also see your board uses ASUS LAN Guard - not sure of this HW-implemented device, but if possible try deactivating this BIOS-side).

Anyway good luck, as stated above I don't use any of the above.