Sudden node freezes on Ryzen 3700X / ZFS, no logs

Norman Uittenbogaart

Renowned Member
Feb 28, 2012
155
6
83
Rotterdam, Netherlands, Netherlands
I have a node that has been running for 6 years without any issues.
Recently, it started freezing randomly every few minutes or hours.


Hardware:
  • Ryzen 3700X on Asrock X470D4U
  • OS on SSD (ZFS mirror pool of 4 HDDs + 900P Optane)

Symptoms / Attempts:
  • Initially freezes occurred every ~5 minutes.
  • Reinstalled Proxmox VE 9.0-3; freezes now occur every few hours.
  • BIOS: Disabled C-States, no significant effect.
  • Tried different kernels (older and latest 6.17), no difference.
  • Power supply: Tried a separate PSU for main system and drives, no effect.
  • ZFS tuning: L2ARC/SLOG temporarily disabled; system still freezes.
  • kdump loaded but no crash dump is created, indicating a complete hang rather than a kernel panic.

Other notes:
  • ECC RAM (should report errors)
  • ZFS is latest version from Proxmox VE
  • No logs are generated at all; system just hangs completely.

Question:
  • What else could be causing these sudden freezes?
  • Could it be hardware-related, ZFS/tuning, or something else?

Any suggestions for debugging would be greatly appreciated.
 
Looking on the internet and here in the forum, it almost looks like something has changed in the latest kernels in the C states which don't work well with Ryzen.
As mentioned above we had a stable system for over 6 years and this week it started freezing.
It seems that disabling everything above C-state 1 fixed it as it has been running now for almost 24 hours stable again.

It took about 30 reboots of trial and error to find it though.
 
  • Like
Reactions: Onslow
or your hardware degraded after several years of use.
dont dismiss hardware issues.
if c-states were generally a problem on ryzen cpus you would see a lot more threads about similar issues.
it might be a combination of kernel, hardware and firmware.
things you could try if you have the time is to check the bios of the mainboard.
is it the latest one available from asrock with the latest agesa available for the processor?
second could be the powersupply.
those things age and depending on the original quality of the powersupply it may be that it cant cope with loadchanges when c-states are active (unstable voltages).
if you have another powersupply give it a try to see if the behaviour changes.
these are just my ideas on some possible factors that could play a role here.
 
or your hardware degraded after several years of use.
dont dismiss hardware issues.
if c-states were generally a problem on ryzen cpus you would see a lot more threads about similar issues.
it might be a combination of kernel, hardware and firmware.
things you could try if you have the time is to check the bios of the mainboard.
is it the latest one available from asrock with the latest agesa available for the processor?
second could be the powersupply.
those things age and depending on the original quality of the powersupply it may be that it cant cope with loadchanges when c-states are active (unstable voltages).
if you have another powersupply give it a try to see if the behaviour changes.
these are just my ideas on some possible factors that could play a role here.
Hi thanks for your reply!
I did replace the powersupply as a test, unfortunately this wasn't the issue as problems remained.
Then I updated the bios which included the agesa.
Unfortunately that also didn't solve it.
Disabling the C-states dit help as in the crashes where not immediate but took a few hours, mostly happened when the server went into idle.
But wasn't until disabling PBO the server was completely stable.