Host system crash after VM power off/Reboot

WolfLink115

New Member
Nov 30, 2025
4
0
1
I am uncertain as to if this specific issue has already been asked, and I dunno if this is the right place to ask this so I apologize in advance-

Anyways, I am having this issue where if I power off or restart a VM (assuming it is using the GPU passthrough), the host dmesg logs starts flagging CPU soft lock warnings, and on top of that the entire host becomes so sluggish to the point of being severely unusable (WebUI also becomes unusable). I have tried researching the issue, and people online mention that it might be a GPU vendor-reset issue, however the dmesg logs don't mention anything about failed vendor-resets, in fact it mentions that the vendor-reset executes fine without issues.

My specs are:
CPU: AMD Ryzen 9 7900X
GPU: AMD Radeon RX 7800 XT
RAM: Corsair Vengeance 64 GB DDR5 @ 6000 MT/s

VM Specs:
8 vCPUs
32 GB RAM
100 GB Storage (with other storage drives passed through to the VM)
GPU properly passed through (at least to my knowledge) with the latest AMD Drivers installed in the VM.

/etc/pve/qemu-server/100.conf:
agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 8
cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd;+hv-tlbflush
efidisk0: Windows:vm-100-disk-1,efitype=4m,ms-cert=2023,pre-enrolled-keys=1,size=4M
hostpci0: 0000:03:00,pcie=1
machine: pc-q35-10.1
memory: 32768
meta: creation-qemu=10.1.2,ctime=1764615215
name: WolfLink115-Win11
net0: e1000=BC:24:11:95:6B:C6,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win11
scsi0: Windows:vm-100-disk-0,size=100G
scsi1: /dev/disk/by-id/ata-CT1000MX500SSD1_2323E6E12A7D,size=976762584K
scsi2: /dev/disk/by-id/ata-WD_Blue_SA510_2.5_500GB_232864800680,size=488386584K
scsi3: /dev/disk/by-id/ata-WDC_WD20JDRW-11C7VS0_WD-WX32A341F4HP,size=1907697M
scsihw: virtio-scsi-single
smbios1: uuid=aa5e3256-21d3-4338-943b-17e7b8597c8f,manufacturer=TWljcm8gU3RhciBJbnRlcm5hdGlvbmFsIENvLiwgTHRkLg==,product=TVMtN0Q3Mw==,version=MS4w,serial=VG8gYmUgZmlsbGVkIGluIGJ5IE8uRS5N,sku=VG8gYmUgZmlsbGVkIGluIGJ5IE8uRS5N,family=VG8gYmUgZmlsbGVkIGluIGJ5IE8uRS5N,base64=1
sockets: 1
tpmstate0: Windows:vm-100-disk-2,size=4M,version=v2.0
usb0: host=5-1.4.3
usb1: host=5-1.4.4
usb2: host=1-4
usb3: host=2-5.4
vga: none
vmgenid: 46fbca4c-f941-46b9-8b55-bca59ffc983b

Host configuration (had some help from friends as well as tried asking ChatGPT but most of the things it mentioned were ofc not working:
/etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_aspm=off amdgpu.runpm=0 initcall_blacklist=sysfb_init video=efifb:off video=vesa:off modprobe.blacklist=amdgpu,radeon pci=nommconf"

/etc/modules-load.d/modules.conf:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

/etc/modprobe.d/pve-blacklist.conf:
blacklist amdgpu
blacklist radeon

/etc/modprobe.d/kvm.conf:
options kvm ignore_msrs=1

/etc/modprobe.d/vfio.conf:
options vfio-pci ids=1002:747e,1002:ab30 disable_vga=1
softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci

That's all of the information I can think of to show. I would really appreciate it if someone could at least point me in the right direction with fixing this issue. I am new to Proxmox, and I love it so far. Thanks in advance though!
 
Last edited:
The GPU probably does not reset properly (which is why you use/need initcall_blacklist=sysfb_init). Then you cannot stop and start the VM again without it casusing problems for the VM. Since the GPU is not virtual and connected to the real PCIe bus, this can cause problems in the Proxmox host as well. People have had problems like this before.
Maybe you can find a work-around for your type of GPU (like vendor-reset for older GPUs) on the internet? Or maybe you need a different GPU or learn to live with it.
 
The GPU probably does not reset properly (which is why you use/need initcall_blacklist=sysfb_init). Then you cannot stop and start the VM again without it casusing problems for the VM. Since the GPU is not virtual and connected to the real PCIe bus, this can cause problems in the Proxmox host as well. People have had problems like this before.
Maybe you can find a work-around for your type of GPU (like vendor-reset for older GPUs) on the internet? Or maybe you need a different GPU or learn to live with it.
Do the kernel dmesg logs show whether or not it is a vendor-reset issue? The reason I am asking this is because the dmesg log specifically states that the vendor-reset executes perfectly fine, doesn't seem to error out at all. The thing that is really standing out to me is the CPU softlock errors that continue popping up after the VM is shut down. Another thing that is weird to me is that it seems to sometimes just work perfectly fine when rebooting the VM, but only sometimes.
 
Do the kernel dmesg logs show whether or not it is a vendor-reset issue?
vendor-reset is the name of a project to fix FLR for AMD GPUs and not an issue.
Maybe the system logs will show signs of not properly resetting, but your system crashes and might not write the logs to disk. Your GPU is know to potentially have reset problem.
The reason I am asking this is because the dmesg log specifically states that the vendor-reset executes perfectly fine, doesn't seem to error out at all.
vendor-reset does not work for your GPU and does nothing, and therefore also doesn't show errors. See Supported Devices here: https://github.com/gnif/vendor-reset
The thing that is really standing out to me is the CPU softlock errors that continue popping up after the VM is shut down. Another thing that is weird to me is that it seems to sometimes just work perfectly fine when rebooting the VM, but only sometimes.
Unfortunately, reset issues are known for almost all AMD GPU generations (except 6000-series) but not every make and model has it (and not every 6000-serie GPU resets fine).
Maybe there is a fix or work-around for your (specific make and model of) GPU but I don't know it, unfortunately.
 
Maybe the drivers leave the GPU in a state that the same driver does not expect or cannot handle on the next VM boot. Have you tried a recent Linux VM to see if it has the same problem. Maybe starting a Linux VM after the Windows VM helps as a work-around. Or maybe ejecting the GPU before shutting down the VM helps as a work-around. Or kicking the GPU from the PCIe bus and doing a rescan sometimes helps (before starting the VM again).
 
Maybe the drivers leave the GPU in a state that the same driver does not expect or cannot handle on the next VM boot. Have you tried a recent Linux VM to see if it has the same problem. Maybe starting a Linux VM after the Windows VM helps as a work-around. Or maybe ejecting the GPU before shutting down the VM helps as a work-around. Or kicking the GPU from the PCIe bus and doing a rescan sometimes helps (before starting the VM again).
I haven't tried using a Linux VM instead of a Windows VM, though that would probs be a good thing to try. The issue with that is I don't exactly have enough space to do that at the moment, but I will test that when I get the ability to. I know that ejecting the GPU before shutting down the VM does help a bit, however attempting to remount it then boot the VM up again has seemingly mixed results. I honestly wonder if my GPU is just dying because it sometimes struggles to do certain things, even when I was running Windows normally.