Host Freeze When GPU Passthrough VM Reboot

UnknownO

New Member
Nov 6, 2022
8
1
1
Hello,
I'm from China, first of all, apologies for my bad English

I passthrough a RX580 2048SP GPU to a Windows VMs, It's can work and can play genshin impact using high quality.
Once I used the Windows Start menu to restart the virtual machine, and the virtual machine froze on the first screen, and the host is freeze, the web console can not be open, the physics console's cursor is not flashing. So I can only hard restart the host.
when I used the Windows Start menu again to restart the VM, it didn't freeze again

Before stuck I do:
1.Install AMD GPU Drives(Display and Media) in Windows Update
2.Install Windows Update(Cumulative updates in October)
3.Install Genshin Impact Client
4.Watch Video In Sakura anime

My VM Config:
Code:
args: -cpu 'host,-hypervisor,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,kvm=off,hv_vendor_id=intel' #PS: Hide VM traces to run Genshin
boot: order=sata0;ide2;net0
cores: 8
hostpci0: 0000:05:00,pcie=1,x-vga=1
ide2: hdd-disk-6:iso/Win10_22H2_Chinese_Simplified_x64.iso,media=cdrom,size=5936354K
machine: pc-q35-7.0
memory: 8192
meta: creation-qemu=7.0.0,ctime=1667724630
name: Windows10-0
net0: e1000=7E:25:E2:0E:F0:E7,bridge=vmbr2,firewall=1
numa: 0
onboot: 1
ostype: win10
sata0: ssd-disk-0:120/vm-120-disk-0.qcow2,size=64G,ssd=1
sata1: hdd-disk-0:120/vm-120-disk-0.qcow2,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=ff64687f-e1ba-4a44-b692-21d84a9916ff
sockets: 1
startup: order=8,up=15,down=60
usb0: host=1-1.2.2
usb1: host=1-1.2.3
vmgenid: 22052218-6dfa-45ed-940d-3aec221ea4c7

HOST LOGS:
Code:
Nov  6 18:28:06 pve kernel: [ 9801.522471] DMAR: DRHD: handling fault status reg 40
Nov  6 18:28:06 pve kernel: [ 9801.523647] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x10005293c
Nov  6 18:28:06 pve kernel: [ 9801.527784] DMAR: Invalidation Time-out Error (ITE) cleared
Nov  6 18:28:06 pve kernel: [ 9801.527844] DMAR: VT-d detected Invalidation Time-out Error: SID 0
Due to the word limit it is not possible to post directly in the sticker, please visit this link
https://api.llilii.cn/sharefiles/pve-logs.txt

Finally, thank you all for your discussions and answers
 
Last edited:
Reset issues are common for that generation of AMD GPUs. They are often solved by installing vendor-reset. Here is the guide I used a long time ago.
And I have a question if every time I update the pve kernel version, Do I need to re-install/update vendor-reset with dkms?
Or is my issue due to a reset bug in AMD? Because I read the instructions, which stated that AMD's reset bug would cause the GPU to not be exploited a second time, and I could restart Windows multiple times using the Windows restart menu, only a few times it would cause the host to freeze
My log slices:
Code:
Nov  6 18:27:21 pve kernel: [ 9752.239566] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100052844
Nov  6 18:27:21 pve kernel: [ 9752.240780] DMAR: Invalidation Time-out Error (ITE) cleared
Nov  6 18:27:21 pve kernel: [ 9752.240883] DMAR: VT-d detected Invalidation Time-out Error: SID 0
Nov  6 18:27:21 pve kernel: [ 9752.240883] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x50000000003, qw1 = 0xfde0f001
....
Nov  6 18:28:06 pve kernel: [ 9801.522471] DMAR: DRHD: handling fault status reg 40
Nov  6 18:28:06 pve kernel: [ 9801.523647] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x10005293c
Nov  6 18:28:06 pve kernel: [ 9801.527784] DMAR: Invalidation Time-out Error (ITE) cleared
Nov  6 18:28:06 pve kernel: [ 9801.527844] DMAR: VT-d detected Invalidation Time-out Error: SID 0
The full log is at #1's Link


Looking forward to your reply!
 
Last edited:
And I have a question if every time I update the pve kernel version, Do I need to re-install/update vendor-reset with dkms?
dkms does that for you (if the kernel headers are also installed).
Or is my issue due to a reset bug in AMD? Because I read the instructions, which stated that AMD's reset bug would cause the GPU to not be exploited a second time, and I could restart Windows multiple times using the Windows restart menu, only a few times it would cause the host to freeze
Most likely it's the reset bug. Rebooting from inside the VM works mostly fine but stopping and starting the VM (or Reboot from the Proxmox GUI) usually fails.
It's always a good idea to install vendor-reset with those AMD GPUs and make sure to select device_specific if you are using pve-kernel 5.15 or higher.
 
dkms does that for you (if the kernel headers are also installed).

Most likely it's the reset bug. Rebooting from inside the VM works mostly fine but stopping and starting the VM (or Reboot from the Proxmox GUI) usually fails.
It's always a good idea to install vendor-reset with those AMD GPUs and make sure to select device_specific if you are using pve-kernel 5.15 or higher.
thank you!!!

Also, I have a small question:
Code:
echo 'device_specific' > /sys/bus/pci/devices/<pci_device_id_here>/reset_method
Does this command take effect permanently after only one execution, or does it need to be specified manually every time the system booting? Whether it is to restart the system or use apt to update the system?

and after using vender-reset, whether you need to add amdgpu to blacklist.conf
 
Last edited:
echo device_specific >/sys/bus/pci/devices/0000:05:00/reset_method
Does this command take effect permanently after only one execution, or does it need to be specified manually every time the system booting? Whether it is to restart the system or use apt to update the system?
It needs to run (once) before starting the VM, every time that the Proxmox host (re)starts. You could use a hookscript or a cron job, or whatever you find the most convenient.
 
It needs to run (once) before starting the VM, every time that the Proxmox host (re)starts. You could use a hookscript or a cron job, or whatever you find the most convenient.
my amd gpu had 0000:05:00.0 and 0000:05:00.1, but no 0000:05:00.should I set both 0000:05:00.0 and 0000:05:00.1 reset method to device_specific?
like
Code:
echo device_specific >/sys/bus/pci/devices/0000:05:00.0/reset_method
echo device_specific >/sys/bus/pci/devices/0000:05:00.1/reset_method
 
my amd gpu had 0000:05:00.0 and 0000:05:00.1, but no 0000:05:00.should I set both 0000:05:00.0 and 0000:05:00.1 reset method to device_specific?
like
Code:
echo device_specific >/sys/bus/pci/devices/0000:05:00.0/reset_method
echo device_specific >/sys/bus/pci/devices/0000:05:00.1/reset_method
Sorry, my mistake, only the VGA function of the GPU: echo device_specific >/sys/bus/pci/devices/0000:05:00.0/reset_method
Check if it is working (after starting the VM) with journalctl | grep POLARIS
 
Sorry, my mistake, only the VGA function of the GPU: echo device_specific >/sys/bus/pci/devices/0000:05:00.0/reset_method
Check if it is working (after starting the VM) with journalctl | grep POLARIS
Thanks, I'll try this method next week
To summarize:
1. Install vendor-reset using dkms
2. Use the virtual machine hook to set the reset_method of the GPU card to device_specific (the audio part of the GPU card does not need to be processed)
3. need to disable the AMD GPU driver using blacklist
Thank you again!
 
Thanks, I'll try this method next week
To summarize:
1. Install vendor-reset using dkms
It's a little more involved that just dkms. Please follow the guide I linked to. Make sure to load the vendor-reset driver in /etc/modules.
2. Use the virtual machine hook to set the reset_method of the GPU card to device_specific (the audio part of the GPU card does not need to be processed)
Yes (but it does not need to be a hookscript, use whatever way you like).
3. need to disable the AMD GPU driver using blacklist
I said nothing about this. Since your passthrough is working already, you don't need to change anything about the amdgpu driver.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!