Proxmox freezing after a few hours

aurelle

New Member
Jan 18, 2025
7
0
1
Hi everyone,

About a year ago, my Proxmox VE 7 instance started freezing regularly without any clear cause, and I’ve been unable to resolve the issue despite extensive troubleshooting.

The Problem​

When the system freezes, the following happens:
  • The hypervisor and all VMs become unreachable from the network.
  • If a GPU is connected, the virtual console freezes, and the TTY’s blinking cursor stops.
  • The machine becomes completely unresponsive, requiring a hard reset.
Freezing occurs after a few hours, sometimes days, but without any discernible pattern or event triggering it.

What I’ve Tried So Far​

Here’s a list of everything I’ve done to diagnose and resolve the issue:
  1. Upgraded to Proxmox VE 8 and ensured all components are updated to the latest stable versions.
  2. Updated the motherboard firmware to the latest non-beta version.
  3. Installed kdump to catch kernel panics, but no evidence of kernel panics has been recorded.
  4. Ran MemTest86 for a week—no errors reported.
  5. CPU Stress Testing: Booted into a live Linux environment and ran stress-testing tools (mprime) for a week with no crashes or overheating.
  6. Drive Checks:
    • Verified all drive health via S.M.A.R.T. data—everything appears normal.
    • Unplugged all SATA drives except the one running Proxmox.
  7. Reinstalled Proxmox VE on a new drive.
  8. Swapped or removed both NVMe drives.
  9. Moved the machine to a different location in the house to rule out electrical issues.

Observations​

  • The issue occurs even with a clean Proxmox installation without VMs or containers, ruling out any specific VM as the cause.
  • Changing hardware components seems to slightly affect the frequency of the freezing but doesn’t solve the problem.
  • CPU hardware error logs appeared in system logs months before and after the freezing began but have not reappeared in the last 50+ boots.

System Specifications​

Here are the core components that I haven’t been able to replace:
  • CPU: AMD Ryzen 5950x
  • RAM: Kingston Fury 2400MHz DDR4 (4x32GB)
  • Motherboard: Asrock B550 Taichi (specs)
I’ve reached a dead end and would greatly appreciate any advice or suggestions for further troubleshooting. Have I missed anything, or is there something else I should try?

Thank you in advance for your time and support!
 
Last edited:
Hello aurelle! Please try to check the journal slightly before the issues begin by calling journalctl --since with a time shortly (e.g. at least 30 minutes) before the server has crashed. Maybe it shows some logs that helps you debug further, or you can post them here if in doubt.

Also, I'm not sure yet if that's relevant, but do you happen to remember what CPU hardware error logs you were seeing? Since you're mentioning that they stopped showing up, did you change something (software update, hardware change, etc.) around the time they stopped?
 
I have the same problem. But I can't find any errors in the journal:

Jan 23 19:02:02 pve systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Jan 23 19:02:02 pve systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jan 23 19:02:02 pve systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Jan 23 19:02:02 pve systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jan 23 19:17:01 pve CRON[417832]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 23 19:17:01 pve CRON[417833]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 23 19:17:01 pve CRON[417832]: pam_unix(cron:session): session closed for user root
Jan 23 19:49:09 pve pvestatd[1057]: auth key pair too old, rotating..
Jan 23 20:17:01 pve CRON[438368]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 23 20:17:01 pve CRON[438369]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 23 20:17:01 pve CRON[438368]: pam_unix(cron:session): session closed for user root
Jan 23 21:00:03 pve pvescheduler[452895]: <root@pam> starting task UPID:pve:0006E920:12158FDC:67929FC3:vzdump::root@pam:
Jan 23 21:00:03 pve pvescheduler[452896]: INFO: starting new backup job: vzdump 100 104 103 107 102 --mailnotification failure --notes-template '{{guestname}}' --quiet 1 --mode snapshot --compress zstd --fleecing 0 --prune-backups 'keep-last=3' --notification-mode legacy-sendmail --storage NetBackup
Jan 23 21:00:03 pve pvescheduler[452896]: INFO: Starting Backup of VM 100 (qemu)
Jan 23 21:01:10 pve pvescheduler[452896]: INFO: Finished Backup of VM 100 (00:01:07)
Jan 23 21:01:10 pve pvescheduler[452896]: INFO: Starting Backup of VM 102 (lxc)
Jan 23 21:02:06 pve pvescheduler[452896]: INFO: Finished Backup of VM 102 (00:00:56)
Jan 23 21:02:06 pve pvescheduler[452896]: INFO: Starting Backup of VM 103 (lxc)
Jan 23 21:02:08 pve pvescheduler[452896]: INFO: Finished Backup of VM 103 (00:00:02)
Jan 23 21:02:08 pve pvescheduler[452896]: INFO: Starting Backup of VM 104 (lxc)
Jan 23 21:03:10 pve pvescheduler[452896]: INFO: Finished Backup of VM 104 (00:01:02)
Jan 23 21:03:10 pve pvescheduler[452896]: INFO: Starting Backup of VM 107 (lxc)
Jan 23 21:03:37 pve pvescheduler[452896]: INFO: Finished Backup of VM 107 (00:00:27)
Jan 23 21:03:37 pve pvescheduler[452896]: INFO: Backup job finished successfully
Jan 23 21:17:01 pve CRON[458883]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 23 21:17:01 pve CRON[458884]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 23 21:17:01 pve CRON[458883]: pam_unix(cron:session): session closed for user root
Jan 23 22:17:01 pve CRON[479206]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 23 22:17:01 pve CRON[479207]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 23 22:17:01 pve CRON[479206]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jan 24 07:33:12 pve kernel: Linux version 6.8.12-5-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) ()
Jan 24 07:33:12 pve kernel: Command line: initrd=\EFI\proxmox\6.8.12-5-pve\initrd.img-6.8.12-5-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
After the hourly cron execution (there are no hourly tasks) the system freezes completly.
Just a Hardware-Reset is possible.

Any idea?
 
Hello smidi! I'm not sure if both of you have exactly the same issue, but aurelle has already posted some suggestions on what you could try. Please try some of them to see if they help.

Without further information about the cause of the freezes, it's hard to pinpoint the issue precisely. There are all kinds of reasons why freezes could happen.

In addition to what aurelle tried, you can also try the opt-in Kernel 6.11 and see if that helps.
 
Also, I'm not sure yet if that's relevant, but do you happen to remember what CPU hardware error logs you were seeing?

I have an example of those from last June. At some point they stopped to appear and didn't seem to trigger a reboot or a freeze when they occured.
Aside from that, no interesting logs at all.

Code:
Jun 28 05:53:26 proxmox kernel: mce: [Hardware Error]: CPU 18: Machine Check: 0 Bank 5: bea0000001000108
Jun 28 05:53:26 proxmox kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffa9660592 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Jun 28 05:53:26 proxmox kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1719546800 SOCKET 0 APIC 5 microcode a201205

That said however, I'm coming back with good news!

By digging in a few forums I found that my specific processor, the 5950x, is known to sometimes produce that kind of issue. See here.

What I did is that I went into the BIOS and configured a positive PBO offset of 4. It has been more than a day and proxmox has been running perfectly stable. I still need more time to make sure that's what really fixed it but proxmox used to freeze every hour before that so it looks good!
 
Last edited:
Nice, thanks for the info! Glad to hear that you found a solution! Interesting - in fact, I'm using an AMD Ryzen 5950X at home and never had issues with Linux using many kernel versions over the years (currently using kernel 6.12.10). I might have had luck at the silicon lottery, I guess :)

Just in case this happens again, I'm wondering whether it could help to install the microcode updates for your CPU (which might be newer than the microcode delivered by your motherboard's BIOS). For that you would need to enable the non-free-firmware Debian repository and install the CPU-vendor specific microcode package, in your case using apt install amd64-microcode.

But feel free to leave your system as it is this if it runs stable ;)
 
Well, it froze again after 2 days. I tried increasing the offset to 6 as suggested in the article.
I also did what you suggested for the microcode.

If that doesn't fix it I will be assuming it's a motherboard issue, as I tested everything else extensively. I don't think the CPU might be bad especially since it ran for more than a year without issues before that.

Just being curious, what motherboard are you using @l.leahu-vladucu ? I'm considering this (if I can find it).
 
Sorry to hear that. Something that comes to mind is to try another kernel version (e.g. the opt-in Kernel 6.11), but given the errors you are seeing (or the lack of errors), I doubt that this would help. It can't hurt trying it either, though.

Otherwise you can try swapping components (CPU, motherboard) like you suggested and see if that helps.

Just being curious, what motherboard are you using @l.leahu-vladucu ? I'm considering this (if I can find it).
My AMD Ryzen 5950X is used for a desktop computer, not as a server. That being said, I'm using a Gigabyte X570 Aorus Master with the latest F39d BIOS and microcode updates, and it works really well. Currently using kernel 6.12.10, but I never had issues with any previous kernels either. I have only slightly adapted the BIOS configuration, but nothing around voltages (with the exception of enabling XMP for memory).
 
Sorry to hear that. Something that comes to mind is to try another kernel version (e.g. the opt-in Kernel 6.11), but given the errors you are seeing (or the lack of errors), I doubt that this would help. It can't hurt trying it either, though.

Otherwise you can try swapping components (CPU, motherboard) like you suggested and see if that helps.


My AMD Ryzen 5950X is used for a desktop computer, not as a server. That being said, I'm using a Gigabyte X570 Aorus Master with the latest F39d BIOS and microcode updates, and it works really well. Currently using kernel 6.12.10, but I never had issues with any previous kernels either. I have only slightly adapted the BIOS configuration, but nothing around voltages (with the exception of enabling XMP for memory).
I was able to swap the memory to definitely rule out it being bad and it still crashed after that, so that's one thing.

I just ordered a new motherboard, the one I linked earlier, it should come by Friday so I will keep you updated as to whether it fixes it or not. Since increasing the PBO offset seems to matter (defers the crash for some time) the motherboard being faulty would I believe make some sense.
 
I'm done replacing the motherboard, the proxmox server is back up and running, now let's see what happens
 
So after a few weeks I can update this post with what seems to be a fix to my issue. (TL;DR: CPU problem)

The new motherboard didn't fix the problem, the proxmox server kept crashing again and again, in a similar fashion as it was with the old board.

Since PBO+4 seemed to have slightly improved the situation I thought I could try increasing the offset to PBO+6. Well, since I did that, the server has been running more stable than ever. 15 days so far, unprecedented.

I'm not sure what to expect in the future, is it going to degrade further or not. Anyways, this seems to corroborate what is being discussed in this thread on AMD forums: https://community.amd.com/t5/pc-pro...y-unresponsive-crashing-hard-reset/m-p/663575
 
Last edited:
Thanks for the update. Since the CPU doesn't seem to be stable even without overclocking, I would usually recommend - if possible - to send it back if you still have warranty for it. However, as far as I know, using PBO voids the warranty.
 
Thanks for the update. Since the CPU doesn't seem to be stable even without overclocking, I would usually recommend - if possible - to send it back if you still have warranty for it. However, as far as I know, using PBO voids the warranty.
Unfortunately the CPU was no longer under warranty, that's why I tried PBO in the first place.
Apparently people with this issue have had great success with PBO as soon as they found what was the right offset for them. A little bit unsettling but hey, it works.