PVE freezes, needing hard reset

RoelVB

New Member
Jan 20, 2025
8
0
1
Hi all,

I've a very annoying issue I can't seem to figure out.
One of my PVE hosts freezes every once in a while (not while under a heavy load). Currently it seems to be happening twice a day.

journalctl doesn't show anything around the time of freezing.

The issue started happening after adding two SATA SSD's to the "local-zfs"/"rpool". Initially it was a single mirror, now it's two mirrors in one pool.
1742052553827.png

According to SMART the SSD's are working fine.

I'm not really sure what to do/check here. My initial thought we be removing the recently added SSD's, but I cannot think of a reasonable way of doing this, since they're part of the pool now.

Any advise is really appreciate.

Thanks in advance!
 
Last edited:
You can remove a disk from the pool or just yank it, that’s what it is intended for.

You can’t always trust SMART ‘ok’ status, you have to check the output of smartctl -a /dev/… and even then it may ‘lie’ on some cheap devices.

What is on the console when it crashes/halts, a crash without a kernel dump or messages typically points to hardware issues. Is it truly ‘frozen’ or waiting for disk or network just gone?
 
Thank you for your reply @guruevi.

You can remove a disk from the pool or just yank it, that’s what it is intended for.
Might try this at some point, to make sure it's not the SSD (while it makes me a little anxious, running in a degraded state).

You can’t always trust SMART ‘ok’ status, you have to check the output of smartctl -a /dev/… and even then it may ‘lie’ on some cheap devices
Sorry. I meant I did check using smartctl -a

What is on the console when it crashes/halts, a crash without a kernel dump or messages typically points to hardware issues. Is it truly ‘frozen’ or waiting for disk or network just gone?
There's nothing on the console. It's still at the login screen.
And there's no response at all, nothing on input and no network connection.



Currently I seems like it might be an overheating issue with the SSD's. IMO that should not completely freeze the system without any message.
But yesterday I made sure the SSD's can get a little more airflow and it has been stable since then. Fingers crossed!
 
Last edited:
I just got a used Dell OptiPlex 3070 Micro to supplement my cluster and have been experiencing something similar.

Within a couple of hours (sometimes just minutes) after boot, I see the following symptoms: no response on the NIC, no screen output (or just black screen perhaps - I haven't had the screen on when it happens, I have turned it on once I notice the system isn't working), and finally the case is noticably warmer to the touch than when the system is running without issue.

I thought it might happen due to the somewhat widely spread (?) issue of the wrong Realtek NIC driver being selected (r8169 instead of r8168), since the first thing I noticed was that I lost connectivity, but after noticing the other symptoms I'm not so sure.

The system has no particular load when it happens, sometimes none at all. memtest86+ reports no errors on the RAM. No journalctl logs 5 minutes before or after the freeze happens (I know quite precisely when the freeze happens as I have Uptime Kuma pinging it regularly).

I thought that maybe it was a power management issue, so my latest attempt at debugging the issue has been to disable C states, SpeedStep and ASPM in the BIOS/UEFI. Right now the system has been running for a record 6 hours without downtime.

The reason I respond to your post, is that I found out the system has a cheap Kioxia NVMe SSD (that's mounted in the WiFi M.2 slot btw?). It's the only non-factory installed component in the system and I find it plausible that it might have problems with supporting different power states properly.

If disabling power management features works I'm of course happy - but let's see. When I have the patience I may try to enable some of the power management features one-by-one and see if one of them triggers the issue. Right now I'm thinking ASPM because of the PCIe based SSD...

Did better cooling fix your your issue? Which SSDs have you installed?

UPDATE 25-03-2025
For posterity: the system has now been running for more than 3 days without issues, so in my case it was for sure a power management issue. I haven't yet validated which of the 3 settings I changed actually made the difference.
 
Last edited:
I just got a used Dell OptiPlex 3070 Micro to supplement my cluster and have been experiencing something similar.

Within a couple of hours (sometimes just minutes) after boot, I see the following symptoms: no response on the NIC, no screen output (or just black screen perhaps - I haven't had the screen on when it happens, I have turned it on once I notice the system isn't working), and finally the case is noticably warmer to the touch than when the system is running without issue.

I thought it might happen due to the somewhat widely spread (?) issue of the wrong Realtek NIC driver being selected (r8169 instead of r8168), since the first thing I noticed was that I lost connectivity, but after noticing the other symptoms I'm not so sure.

The system has no particular load when it happens, sometimes none at all. memtest86+ reports no errors on the RAM. No journalctl logs 5 minutes before or after the freeze happens (I know quite precisely when the freeze happens as I have Uptime Kuma pinging it regularly).

I thought that maybe it was a power management issue, so my latest attempt at debugging the issue has been to disable C states, SpeedStep and ASPM in the BIOS/UEFI. Right now the system has been running for a record 6 hours without downtime.

The reason I respond to your post, is that I found out the system has a cheap Kioxia NVMe SSD (that's mounted in the WiFi M.2 slot btw?). It's the only non-factory installed component in the system and I find it plausible that it might have problems with supporting different power states properly.

If disabling power management features works I'm of course happy - but let's see. When I have the patience I may try to enable some of the power management features one-by-one and see if one of them triggers the issue. Right now I'm thinking ASPM because of the PCIe based SSD...

Did better cooling fix your your issue? Which SSDs have you installed?
I haven't had any issues after make sure the SSDs get some more airflow.

I have quite a few SSDs in my system, but the ones that seems to have been causing my issue are Samsung Evo (860/870 1TB) SATA SSDs
 
Last edited:
Now I'm confused. The system froze again, but the SSDs didn't exceed 32 degrees.
So it seems I'm looking in the wrong place.

It still seems to be something temperature related issue, but I have no clue how to investigate this any further. I don't see any excessive temperatures.
The reason why I'm looking for something temperature related is that when I leave the front panel open, it doesn't freeze.

I'm using the following hardware:
  • Mainboard: ASUS Pro WS TRX50-SAGE
  • CPU: Threadripper 7960X
  • RAM: 4x Kingston Fury Renegade Pro 32GB
  • PSU: Seasonic 1200W
  • ASUS IPMI Expansion card (this is still operational while frozen and can be used to reset the system)
  • Storage
    • 4x Samsung Pro NVMe (through ASUS Hyper M.2 card)
    • 4x Samsung Evo SATA (on-board SATA)
      • The freezing started to happen after added 2 SSDs to this ZFS pool!
    • 2x SK Hynix NVMe (on-board M.2)
  • Adaptec 7 series RAID controller (passed through to VM)
    • 3x 10TB SATA HDD
    • 3x 12TB SATA HDD
  • nVidia RTX A4000 (passed through to VM)
 
This problem is not related to your SSDs, it is a known Proxmox issue. The issue was nearly fixed, but it seems it comes back. Also recognize the bug again on many servers...
 
Check Samsung SSD Firmware Versions. There are Models out there with a Temperature Bug which causes to report higher temps as present, and thats the cause for system to freeze.