Persistent VM instability with Ryzen 9 9950X3D and Proxmox 8/9

No idea how similar our problems are.. but I've also had these terrible hangs of VM since about half a year.
I am running a 3900x on a MSI X570 Unify with 128GB. Never gave me issues as a personal PC and then Proxmox for a year.. then I upgraded Proxmox some time this year and it all went downhill.
Reinstalled it multiple times but it would always start hanging VM's after a while. Nothing in dmesg/journalctl..

What fixed the issue was pinning the kernel to: 6.5.13-6-pve.
Rock solid after that.... until I unpinned it today and upgraded to 9.0 :).
Currently on Linux 6.14.11-1-pve (2025-08-26T16:06Z) and VM's start hanging after about an hour again.

Currently looking to see if I can find another kernel that is more stable for me and works on 9.0.
 
Try to run your RAM @3600 with the voltage the DOCP-Profile intends (1.25V?). A sign of memcontroller being rough at the edge is when your initial mem-training (after CMOS-Reset) takes over 15 minutes.
If that does not help, try 2 modules in productive usage 2-3 days...maybe to be safe initially @3600 and later you can also try DOCP-Profile. If that fails also, you can be 99% sure it's not the RAM.
Then try CPU ASPM off for 2-3 days, some PCIE-cards don't like that too aggressive. If that works, you can try L0-only.
Tried 2 dimms, 1.20V (modules are for 1.1), I have VMs rebooting/locking after a few hours. I dont think this is related to the memory dimms, as I've tried several pairs with basically the same results. I'm going to try ASPM off by your sugestion.
 
No idea how similar our problems are.. but I've also had these terrible hangs of VM since about half a year.
I am running a 3900x on a MSI X570 Unify with 128GB. Never gave me issues as a personal PC and then Proxmox for a year.. then I upgraded Proxmox some time this year and it all went downhill.
Reinstalled it multiple times but it would always start hanging VM's after a while. Nothing in dmesg/journalctl..

What fixed the issue was pinning the kernel to: 6.5.13-6-pve.
Rock solid after that.... until I unpinned it today and upgraded to 9.0 :).
Currently on Linux 6.14.11-1-pve (2025-08-26T16:06Z) and VM's start hanging after about an hour again.

Currently looking to see if I can find another kernel that is more stable for me and works on 9.0.
This is the most evident pattern. With kernel 6.8 there was VM reboots and locks but after much more time, several days, with 6.14 it's a disaster the same thing happens after a few hours. Excluding hardware problem, and to be sincere I'm at the end of the list as the only thing that I didn't change was the PSU, there is something broken with this kernel/OS and hardware configuration.
 
I'm in a similar scenario with proxmox 9 and amd - proxmox 8 was solid, but the current version my VMs are restarting every few hours and I'm not sure where to even begin to find logs that may help.
 
I'm in a similar scenario with proxmox 9 and amd - proxmox 8 was solid, but the current version my VMs are restarting every few hours and I'm not sure where to even begin to find logs that may help.
Can you share the hardware and software config? We can start drawing up a pattern. I now believe that this is a kernel/qemu-kvm problem.
 
No entries, ever.
Same here..
Pinned it back to 6.5.13-6-pve.. this somehow ruined docker inside the crashing VM's.. kept rebooting containers so went back to 6.14.11-2-pve.
1 VM is "solid" now.. the other one keeps crashing if there's some load..
Might deploy new VM's and see if that fixes it.
 
no oom entries ever

I have 5 different amd servers with unique configurations, all with ECC and different hardware passthroughs, but this is the only one with any issues

ASRock X570 Creator - bios P5.61
AMD Ryzen 9 3900XT
Kingston 9965745-022.A00G 32GB * 4 = 128GB

6.14.11-1-pve
zfs-2.3.4-pve1

vms that crash (passthrough hardware):

ubuntu
LSI SAS2008
LSI SAS2116

w11
GeForce RTX 2070
Wi-Fi 6 AX200

not sure what else to provide?
 
no oom entries ever

I have 5 different amd servers with unique configurations, all with ECC and different hardware passthroughs, but this is the only one with any issues

ASRock X570 Creator - bios P5.61
AMD Ryzen 9 3900XT
Kingston 9965745-022.A00G 32GB * 4 = 128GB

6.14.11-1-pve
zfs-2.3.4-pve1

vms that crash (passthrough hardware):

ubuntu
LSI SAS2008
LSI SAS2116

w11
GeForce RTX 2070
Wi-Fi 6 AX200

not sure what else to provide?
I my case Windows, Linux and FreeBSD VMs end up rebooting or freezing, except a few OSX debug VMs. No passthrough hardware whatsoever.
 
I my case Windows, Linux and FreeBSD VMs end up rebooting or freezing, except a few OSX debug VMs. No passthrough hardware whatsoever.
I replied to you on your post on Reddit the other day saying I'd had success running the memory lower, I was wrong, I hadn't...
No matter what I've done I can't fix the freezing and crashing of the VM's. My friend has just built almost the identical system to me;

9950X
2 x 48Gb Corsair 6000Mhz CL36 - CMK96GX5M2E6000Z36
ASUS TUF Gaming X870-Plus WiFi
1TB NVMe - SK Hynix Platinum P41
4TB NVMe - Samsung 990 Pro
Corsair RM850e 850w PSU (2025)

With one difference, I'm using the Asus TUF Gaming X870-Plus Wifi motherboard (using latest BIOS1078), and he decided he didn't want to pay as much for a motherboard and got the Asus TUF Gaming B650-Plus Wifi motherboard (using latest BIOS 3281), he also opted for 2 x 64Gb vs my 2 x 48Gb dimms instead. I've just had him run the command 'stress-ng --vm 8 --vm-bytes 80% --vm-keep --verify --timeout 10m --metrics-brief' which mine gives errors 7/8, he gets none.
Looking at BIOS's his seems to have a slighty newer equivalent update than either the X870-Plus Wifi , or the ProArt X870E-CREATOR WIFI when compared to TUF GAMING B650-PLUS WIFI I also have a Asus ROG Strix X870E-E based gaming system that has the slightly newer looking equivalent BIOS (I'm unsure about other vendors motherboards equivalents and their respective BIOSes).
He's only just built his system and I'm trying to talk him into downgrading the BIOS (I have and will reply when he's done so) to see if he has the same issues when running the same equivalent BIOS (on the B650 board this looks to be 3279). I've tried everything thats been mentioned in this thread (and on the Reddit thread), and can only see that its down to the X series of motherboards (regardless of vendor), or that it needs that slightly newer BIOS version
 
Last edited:
I replied to you on your post on Reddit the other day saying I'd had success running the memory lower, I was wrong, I hadn't...
No matter what I've done I can't fix the freezing and crashing of the VM's. My friend has just built almost the identical system to me;

9950X
2 x 48Gb Corsair 6000Mhz CL36 - CMK96GX5M2E6000Z36
ASUS TUF Gaming X870-Plus WiFi
1TB NVMe - SK Hynix Platinum P41
4TB NVMe - Samsung 990 Pro
Corsair RM850e 850w PSU (2025)

With one difference, I'm using the Asus TUF Gaming X870-Plus Wifi motherboard (using latest BIOS1078), and he decided he didn't want to pay as much for a motherboard and got the Asus TUF Gaming B650-Plus Wifi motherboard (using latest BIOS 3281), he also opted for 2 x 64Gb vs my 2 x 48Gb dimms instead. I've just had him run the command 'stress-ng --vm 8 --vm-bytes 80% --vm-keep --verify --timeout 10m --metrics-brief' which mine gives errors 7/8, he gets none.
Looking at BIOS's his seems to have a slighty newer equivalent update than either the X870-Plus Wifi , or the ProArt X870E-CREATOR WIFI when compared to TUF GAMING B650-PLUS WIFI I also have a Asus ROG Strix X870E-E based gaming system that has the slightly newer looking equivalent BIOS (I'm unsure about other vendors motherboards equivalents and their respective BIOSes).
He's only just built his system and I'm trying to talk him into downgrading the BIOS (I have and will reply when he's done so) to see if he has the same issues when running the same equivalent BIOS (on the B650 board this looks to be 3279). I've tried everything thats been mentioned in this thread (and on the Reddit thread), and can only see that its down to the X series of motherboards (regardless of vendor), or that it needs that slightly newer BIOS version
Ok so he tried the equivalent BIOS on the B650 and had no issue. Seems like its just an X series issue?
 
Ok so he tried the equivalent BIOS on the B650 and had no issue. Seems like its just an X series issue?
Hi @Luke_113, great to have a debug partner.

Didn't know about that stress-ng option, will be using it. Just tried now and with kernel 6.8.12-13 with 10m all the vms passed, going to run with 60m. Have you tried kernel 6.8? It does seem more stable (but after a day ou two they crash the same), and thats the reason I'm more inclined to a kernel/qemu/kvm issue with this cpu/chipset. However I did notice one thing, I also have an Samsung 990 Pro and a Corsair PSU (Corsair HX1500i). Since you have two ssd, have you tried to take completely out the Samsung? I know its a long shoot...
 
Last edited:
Did you guys use the CPU microcode updates (i.e. "apt install amd64-microcode" and reboot)?
Yes, I'll be honest my knowledge of this is very limited and the vast majority of what I've been playing about with trying to find a combination that works was based on what I could find online which is obviously limited, but I've worked alot with ChatGPT (which is surprisingly good) to help me iron out errors etc... and this was one of the things I did try.

@EventHorizon yeah I've been following your threads daily to see if there has been any update. As per my previous post this to me looks to be a specific issue with X class AMD motherboards. I've not tried removing the 990 Pro, I basically ended up here as I picked up a GMKtec EVO-T1 with the Intel 285H (great little box btw) and had the same 1Tb/4Tb drives set up in that but felt it wasn't powerful enough, hence built the much more powerful 9950X based system, and had zero issues using the drives on that setup, nor the VM instability.
I tried looking for 6.5.13-6-pve that @proxiemoxie mentioned had been previously working, chatgpt even had me check on some chinese university website but I couldn't find whats needed, but I've not tried the 6.8.12-13 kernel you've mentioned being more stable. I can't tell you how many hours I've put into trying to work this out over the last 7 days but probably 30 odd hours, and tbh atm I'm just running game servers and have just chucked W11 on the baremetal for the time being, hoping/waiting that this gets fixed asap as I was really enjoying Proxmox, but want to be able to use the system for the task I built it for haha
 
@Luke_113: I would try without the Samsung SSD, as this problem must be something systemic in the sense that is a group of things causing it, not a single one. It could work great with Intel platform, but with the AMD ecosystem have some kind of transient fault. One other news is that today Asus released a new motherboard bios, at least for my board. Going to try it out tonight.
 
@Luke_113: I would try without the Samsung SSD, as this problem must be something systemic in the sense that is a group of things causing it, not a single one. It could work great with Intel platform, but with the AMD ecosystem have some kind of transient fault. One other news is that today Asus released a new motherboard bios, at least for my board. Going to try it out tonight.
Well I'm still getting reboots under Windows 11, alot less, but they're still occuring, so I can only guess that its BIOS related then. Do let me know how the new BIOS works, I'm hoping that fixes this memory issue