Random crashes of the node in cluster

drthrax232

Member
Apr 11, 2022
13
0
6
Hi!
Yesterday I had an issue with one of my pve host machines. On Zabbix monitoring system I saw that pve host is not resonsive and checked what is wrong. After logging into KVM IPMI I was presented with standard pve logging screen but it was totally unresponsive and stuck - not even cursor flashing. CTRL + ALT + DEL not responsive too. It happened 2 times in total with ~2 hour timespan between incidents.

I did an upgrade to 8.4.1 version and all the issues seems to be fixed - it stays stable for around 12 hours for now.

I have seen another thread with similar issue https://forum.proxmox.com/threads/proxmox-mystery-random-reboots.125001/ and applied
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
line to the GRUB with no effect.

Important things to note:
1. I have encountered this issue on other the same hardware config servers multiple times, disks are healthy, rams don't show any kind of rw errors, cpu and mobo are fine.
2. On the same server that had this issues I have installed debian 12 and ran it for around 48 hours to test out if issues will be present there - it was rock solid stable.
3. I have 5 other machines with the same setup and different pve versions in cluster. They are totally fine (kernel versions: 6.8.12-5 | 6.8.12-9 | 6.5.11-8 | 6.8.4-2 | 6.5.11-8) none of which show that kind of malfunctions.

Even tho it looks stable for now - What could have caused it? Anyone has any clues?
Sadly i forgot to get package versions from pre-upgrade state. But I will look for them in logs.


Specs of the testbench that had this issue
MOBO: X470D4U
CPU: AMD Ryzen 7 5800X
RAM: 128GB - tested with no errors on memtest

Current pve-manager version after update - stable aprox. +/- 12 hours
2025-04-22 00:00:10
pve-manager/8.4.1/2a5fa54a8503f96d

Previous pve-manager version after update - randomly getting stuck
2025-04-21 15:57:10
pve-manager/8.3.0/c1689ccb1065a83b

Current kernel version after update - stable aprox. +/- 12 hours
2025-04-22 00:00:10
Linux 6.8.12-9-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-9 (2025-03-16T19:18Z)

Previous kernel version after update - randomly getting stuck
2025-04-21 15:57:10
Linux 6.8.12-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z)
 
just a random guess but can you try to disable c-states in bios?
some users with amd cpu's could resolve their issues by doing so
 
just a random guess but can you try to disable c-states in bios?
some users with amd cpu's could resolve their issues by doing so
Later that day I'll try that - will install the same version on analogical setup and check if it will crash - then disable c-states and update you on the issue. Thanks!