GPU Passthrough only works after suspend/resume of the host

Sep 28, 2019
6
0
1
38
So I do have this GPU passthrough working, but the process is ... weird.

Hardware:
* Ryzen 9 3900X
* ASUS Pro WS-X570 ACE
* XFX Radeon RX 580 GTS XXX Edition (passthrough)
* VisionTek Radeon 5450 (host)

I primarily followed this tutorial and I got the following in dmesg

Code:
[  177.198320] vfio-pci 0000:0a:00.0: enabling device (0002 -> 0003)
[  177.198649] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
[  177.198658] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
[  177.198664] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370
[  177.220653] vfio-pci 0000:0a:00.1: enabling device (0000 -> 0002)
[  178.309130] dpc 0000:00:03.2:pcie008: DPC containment event, status:0x1f01 source:0x0000
[  178.309136] dpc 0000:00:03.2:pcie008: DPC unmasked uncorrectable error detected
[  178.309147] pcieport 0000:00:03.2: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  178.309149] pcieport 0000:00:03.2:   device [1022:1483] error status/mask=00100000/04400000
[  178.309152] pcieport 0000:00:03.2:    [20] UnsupReq               (First)
[  178.309154] pcieport 0000:00:03.2:   TLP Header: 34000000 0a000010 00000000 80008000
[  178.444946] vfio_bar_restore: 0000:0a:00.1 reset recovery - restoring bars
[  178.460801] vfio_bar_restore: 0000:0a:00.0 reset recovery - restoring bars
[  178.564702] pcieport 0000:00:03.2: AER: Device recovery successful
The fix ... is to suspend and resume the host. So if I do systemctl suspend and then resume the host the VM boots and the GPU passes through correctly.


I documented my whole setup process here: https://github.com/edalquist/proxmox/blob/master/gpu_passthrough.md#debugging-

Here is the details of the VM:

Code:
# qm showcmd 100 --pretty
/usr/bin/kvm \
  -id 100 \
  -name vm100 \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/100.pid \
  -daemonize \
  -smbios 'type=1,uuid=a8af83c1-41eb-4036-af14-cabadefeab30' \
  -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/pve-edk2-firmware//OVMF_CODE.fd' \
  -drive 'if=pflash,unit=1,format=raw,id=drive-efidisk0,file=/dev/zvol/rpool/data/vm-100-disk-1' \
  -smp '22,sockets=1,cores=22,maxcpus=22' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vga none \
  -nographic \
  -no-hpet \
  -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=proxmox,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_ipi,enforce,kvm=off' \
  -m 32768 \
  -device 'vmgenid,guid=b0385b03-56d6-4a6e-b3be-48c3632e4498' \
  -readconfig /usr/share/qemu-server/pve-q35.cfg \
  -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
  -device 'vfio-pci,host=0a:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' \
  -device 'vfio-pci,host=0a:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' \
  -chardev 'socket,path=/var/run/qemu-server/100.qga,server,nowait,id=qga0' \
  -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
  -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:e6a02c888316' \
  -drive 'file=/dev/zvol/rpool/data/vm-100-disk-0,if=none,id=drive-ide0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' \
  -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/var/lib/vz/template/iso/Windows.iso,if=none,id=drive-sata0,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=200' \
  -drive 'file=/var/lib/vz/template/iso/virtio-win-0.1.171.iso,if=none,id=drive-sata1,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ahci0.1,drive=drive-sata1,id=sata1,bootindex=201' \
  -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=DA:01:58:D1:DD:D7,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'type=pc-q35-3.1' \
  -global 'kvm-pit.lost_tick_policy=discard'
 

Stefan_R

Proxmox Staff Member
Staff member
Jun 4, 2019
148
23
18
Vienna
A very peculiar error indeed. Looking at the reddit thread you posted in your setup process, it seems that this might be an AGESA/BIOS issue. Have you checked for BIOS updates for your board?

One other thing I can think of might be to try and reset the card directly (without suspending), e.g. (as root):

Code:
echo 1 > /sys/bus/pci/devices/[your GPUs ID]/reset
and then see if it works.
 
Sep 28, 2019
6
0
1
38
No luck with the reset:

Code:
$ echo 1 > /sys/bus/pci/devices/0000\:0b\:00.0/reset
-bash: echo: write error: Inappropriate ioctl for device
I'm on the latest BIOS for the motherboard and the latest driver. Wondering if it is some other incarnation of "the AMD reset bug" I'm finding mentions of.
 

Stefan_R

Proxmox Staff Member
Staff member
Jun 4, 2019
148
23
18
Vienna
The "reset bug" you mention usually only means that cards which have been initialized once (mostly after a successful VM start) will not be useable any more. Maybe something else is using your GPU before you start your VM - maybe your BIOS? Try switching slots for your GPUs or toggling the CSM/Legacy Support in your BIOS setup, which often causes another slot to be used for booting.

Otherwise, since the '/reset' didn't work, there are other ways of resetting/power cycling a PCI device. Check the answers on this page for example, although be aware that such "hard" resets may cause severe system instability.
 
Sep 28, 2019
6
0
1
38
Thanks for the confirmation, I figured it was more likely something initializing the card at boot. It happens with the GPU in any of the 3 slots and enabling CSM doesn't seem to fix it. I'm not having much luck finding any BIOS options to get the 2nd card ignored.
 

ThNetAdmin

New Member
Oct 12, 2019
1
0
1
23
I have exactly the same problem with:

AMD 3700X
AsRoc X570m pro
AMD RX580 (passthrough)
No host GPU
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!