Sudden Bulk stop of all VMs ?

At our side :
-Nodes with Supermicro Boards are stable until now. (32 days uptime)
-Nodes with B650D4U with suspicious serial number (M80-GC025XXXXXX) always end up rebooting, even with latest bios/microdode/kernel parameters (max uptime 24 days)
-Nodes with older than 6 Months B650D4U and unsuspicous serial number are stable (130+ days uptime)
-Nodes with revent B650D4U and unsuspicious serial number are maybe stable (23 days) with latest microcode and kernal parameters but 23 days is not enough to validate.

We had no time to do cpu related testing yet.
 
It's difficult to claim the issue as fixed until you see its not. When the reboot only happens after 12-14 days (in my case) you have to wait at least that long to know it's really fixed :)

Anyway I changed the CPU of all my VMs to "qemu64" instead of "host" and also updated both nodes I have to the latest version. My problematic node has now an uptime of 7 days
 
Thanks for sharing all those details!

One of our new Hetzner servers uses almost the same mainboard (ASRockRack B665D4U-1L) and has the same problem.
Our serial is M80-G4007900353, so kinda below your highest good one, but that doesn't really say too much especially with the slightly different model.

Hetzner did perform hardware tests yesterday with no result (i.e. hardware is considered OK), and we upgraded to kernel 6.8.8-4-pve, and already had another reboot. I'll ask to be transfered to an ASUSTeK "Pro WS 665-ACE" or the like, which runs our other nodes smoothly.

Hetzner was like "we don't usually do this, but we make an exception". It did solve our reboots on that machine.
With some rescue system magic (they've integrated on-the-fly installation of ZFS support) we were able to avoid a reinstallation and make the PVE on the moved harddiscs bootable.

Sorry for the late reply, I thought I had already posted that.


We also had similar problems on a 6 year old NUC, but the verdict there, from yesterday, is broken hardware. A few hours later that machine stopped even trying to boot.

On a third machine we seemed to have solved the problem by switching Debian and PBS VMs from network mode virtio to e1000 (and no the network performance is still at full GBit speed resp. some 930 MBit/s).
 
I have an uptime now of almost 16 days on my problematic node.
All I did was changing the CPU to qemu64 instead of host. I have one VM which needs host or some of the installation will not work. If I keep this VM with CPU "host" on the problematic node it will crash/restart. I moved this one VM to the other node with an older microcode version and the node didn't reboot.

So I do believe (in my case at least) with AMD Ryzen 9 5950X it's related to microcode version.

So if you still have issues, try setting CPU of all your VMs to qemu64.
 
i feel this issue might even span into multiple issues that we are all experiencing

on my end, i was struggling to have more than 5 days uptime - i solved that through the method i described previously but that still hasn't fixed the issue for me. instead the issue now is that it will reboot once it reaches around 28 days total uptime.

this is annoying because its very hard to replicate if it requires this in order to test the theories on changes :/

the only thing that we have is that it will reliably restart by itself around this time
 
@fiona long time no see.
after instability came back we did some digging and found this quite recent reddit thread. people discussed setting some kmod options for kvm and it seemed to improve their systems. we also applied these and again gained some stability. there also seems to be an open kernel bug opened by these guys.

EDIT: i see @jeenam already referenced the stuff above; is there anything going on on your side? this seem to affect some people after all.
 
@fiona long time no see.
after instability came back we did some digging and found this quite recent reddit thread. people discussed setting some kmod options for kvm and it seemed to improve their systems. we also applied these and again gained some stability. there also seems to be an open kernel bug opened by these guys.

EDIT: i see @jeenam already referenced the stuff above; is there anything going on on your side? this seem to affect some people after all.
Glad to hear there is a workaround now! We are not actively working on this, because AFAIK, we don't have an affected CPU model to be able to reproduce/test. And because it most likely is very low-level stuff, I'm not sure there would be much hope to fix it if not being an AMD (kernel) developer/expert.
 
  • Like
Reactions: intelliIT
0YIkVCW4BPiK.png


i'm coming up to the magical reboot day (happens around 28 - 30 days in) - is there anything i could do to help capture the reason for this reboot?
 
The kernel bug tracker has been updated - https://bugzilla.kernel.org/show_bug.cgi?id=219009

Status:RESOLVED CODE_FIX

Mario Limonciello (AMD) 2024-11-05 17:22:18 UTC
Thanks everyone for your feedback and testing.

The following change will go into 6.12 and back to the stable kernels to fix this issue. It is essentially doing the same effect that kvm_amd.vls=0 did.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=a5ca1dc46a6b610dd4627d8b633d6c84f9724ef0

Now to get the the patch backported to the Proxmox kernel. Not sure who is involved with that process.
 
im not running any nested virtualisation workloads, do you think its still possible for me to be affected by this?

It doesn't matter if you're running nested virtualization. If you use CPU type = host with a Zen 4 CPU and run a Windows VM, Proxmox will eventually cause the system to reboot.
 
The kernel bug tracker has been updated - https://bugzilla.kernel.org/show_bug.cgi?id=219009

Status:RESOLVED CODE_FIX

Mario Limonciello (AMD) 2024-11-05 17:22:18 UTC
Thanks everyone for your feedback and testing.

The following change will go into 6.12 and back to the stable kernels to fix this issue. It is essentially doing the same effect that kvm_amd.vls=0 did.


https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=a5ca1dc46a6b610dd4627d8b633d6c84f9724ef0

Now to get the the patch backported to the Proxmox kernel. Not sure who is involved with that process.
Thanks for the reference! I proposed the fix for inclusion: https://lore.proxmox.com/pve-devel/20241106095535.2535-1-f.ebner@proxmox.com/T/#u
 
ok i have applied the patch by appending the relevant files, the month countdown begins until i report back with if i random reboot

edit: also @jeenam i can confirm that this issue exists even without windows because for the first 9 months or so of owning this CPU since launch I never used anything windows and still faced this issue. I pray this modification is the end of it and we can all walk away happy with working CPUs
 
Last edited:
ok i have applied the patch by appending the relevant files, the month countdown begins until i report back with if i random reboot

edit: also @jeenam i can confirm that this issue exists even without windows because for the first 9 months or so of owning this CPU since launch I never used anything windows and still faced this issue. I pray this modification is the end of it and we can all walk away happy with working CPUs

Thank you for confirming. Yes I hope this fixes it and doesn't tank performance. Unlike the folks here who are running Production workloads, the only thing this has limited me from doing is playing Halo Infinite as using CPU Type = host is the only way Windows will pass Easy Anti-Cheat checks.
 
As of now there are new 6.8.12-4 kernel packages available on our testing repos which include the referenced patch that should address this issue.

I added the testing repo to sources.list and performed an update to install the latest kernel from the test repository. It also updated the package pve-qemu-kmv to version 9.0.2-3. Now when attempting to boot Windows VM configured with GPU passthrough it reports the following error and will not boot the Windows VM:

TASK ERROR: can't reset PCI device '0000:03:00.0'

Someone reported in this forum thread that they believe the problem is caused by the updated pve-qemu-kvm package and suggested the following to downgrade the package version:

apt-get install pve-qemu-kvm=8.1.5-6

However, it appears the 8.1.5-6 package version will not work with the 6.8.12-4 kernel as it will report the following error when attempting to start a VM:

TASK ERROR: Installed QEMU version '8.1.5' is too old to run machine type 'pc-q35-9.0+pve0', please upgrade node '<hostname>'

Summary: I would test the new kernel from the testing repo but it breaks GPU passthrough functionality, which is needed for the Windows VM that I am running.
 
  • Like
Reactions: mrpops2ko
Just found this thread that reports breakage of PCI Passthrough - https://forum.proxmox.com/threads/warning-updating-these-packages-broke-my-pci-passthrough.156848/

Hilariously apt warned me that by attempting to downgrade libpve-common-perl to v8.2.5 it would remove the proxmox-ve package. I was too lazy to figure that one out so I just nuked the entire system and started from scratch.

In order to prevent libpve-common-perl from updating I first updated libpve-common-perl to v8.2.5: apt-get install libpve-common-perl=8.2.5. Then marked the package so it would not be updated: apt-mark hold libpve-common-perl. After that performed apt dist-upgrade to install the 6.8.12-4 kernel from testing and update all other packages.

GPU Passthrough is now working again. I will report back once I've tested kernel 6.8.12-4 to see if it fixes the Zen 4 CPU bug.
 
  • Like
Reactions: mrpops2ko
However, it appears the 8.1.5-6 package version will not work with the 6.8.12-4 kernel as it will report the following error when attempting to start a VM:

TASK ERROR: Installed QEMU version '8.1.5' is too old to run machine type 'pc-q35-9.0+pve0', please upgrade node '<hostname>'
This sounds like you could try to change the config of the VM at:
Select your VM on the left side > Hardware > Select "Machine" > Button "Edit" > Tick "Advanced" > Set the "Version" to an older one.

This will change the virtual Hardware to an older version. Be aware that Windows may lose it's activation on "hardware changes".
 
This sounds like you could try to change the config of the VM at:
Select your VM on the left side > Hardware > Select "Machine" > Button "Edit" > Tick "Advanced" > Set the "Version" to an older one.

This will change the virtual Hardware to an older version. Be aware that Windows may lose it's activation on "hardware changes".

The problem was the busted libpve-common-perl v8.2.6 package that was recently released that borked PCI Passthrough. See this thread.
 
Update on kernel 6.8.12-4 that was released to the testing repo. I'm typing from a Windows VM running on a Proxmox server with that kernel installed. NO crashes. Problem appears to be fixed. I ran multiple programs that would consistently cause the machine to reboot. The machine is rock solid again.

If you're going to upgrade to test this, see my previous comments and DO NOT upgrade libpve-common-perl or you may be in for a world of hurt.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!