GPU Passthrough on Ryzen w x470D4U

emlynb

Member
Jul 22, 2020
8
0
21
48
I have just installed Proxmox 6.2 on a X470D4U board with a Ryzen 3650x processor. This was previously running Debian buster, with qemu / KVM set up and I had passthrough working fine with it (and still do if I boot back into it).

When I try to pass the WX7100 GPU (first PCIe slot) through to windows 10 in a VM, it fails to boot up and I get status 'internal-error' on the icon in the web UI (see attached image).

If I boot back into the old Debian system, I have no issues passing it through.

My config is:

Code:
args: -machine 'type=q35,kernel_irqchip=on' -cpu 'host,kvm=off,hv_vendor_id=null'
balloon: 0
bios: ovmf
bootdisk: virtio0
cores: 16
cpu: host
efidisk0: local-zfs:vm-110-disk-0,size=1M
hostpci0: 2b:00,pcie=1,x-vga=1,romfile=WX7100.rom
ide2: local:iso/Win10_2004_English_x64.iso,media=cdrom
machine: q35
memory: 8192
name: Windows
net0: e1000=96:06:E0:9F:D0:F9,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
sata0: local:iso/virtio-win-0.1.189.iso,media=cdrom,size=488766K
scsihw: virtio-scsi-pci
smbios1: uuid=1bb70bd5-5ea6-45df-b73b-907dfa598c98
sockets: 1
vga: none
virtio0: hdd-mirror:vm-110-disk-0,size=256G
vmgenid: 04845727-55b9-4163-9056-4c5f04741692

Kernel command line:
Code:
Command line: initrd=\EFI\proxmox\5.4.60-1-pve\initrd.img-5.4.60-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs vfio-pci.ids=1002:67c4,1002:aaf0,10de:1c03,10de:10f1 video=efifb:off vga=normal iommu=pt amd_iommu=on kvm_amd.npt=1

How do I go about seeing what is causing the internal error?

I have a second NVIDIA 1060 GPU in the system too, which is passed through fine to the linux VM that uses it.



Interestingly, a similar config works fine for passing through an Nvidia 1030 card on another Proxmox box.
 

Attachments

  • Screen Shot 2020-09-15 at 11.12.13 AM.png
    Screen Shot 2020-09-15 at 11.12.13 AM.png
    8.8 KB · Views: 7
Last edited:
args: -machine 'type=q35,kernel_irqchip=on' -cpu 'host,kvm=off,hv_vendor_id=null'
you should not need that line

what does dmesg/syslog say during the start?
 
I needed that line to get the Nvidia card to passthrough properly and not run into error 43 in the drivers. I left it in for passing through the WX7100 card as I saw no issue.

Dmesg - anything specific you're looking for?

There are two GPUs in this machine, one of which passes through fine. It is only the primary PCIe slot that does not - and only under Proxmox. Under Debian / Libvirt, it all works properly - so how can I dig into where the 'internal-error' status is coming from?

I wondered if efifb was somehow mucking it up, so I tried disabling it. Doesn't appear to work entirely.

[ 0.000000] Kernel command line: initrd=\EFI\proxmox\5.4.60-1-pve\initrd.img-5.4.60-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs vfio-pci.ids=1002:67c4,1002:aaf0,10de:1c03,10de:10f1 video=efifb:off vga=normal iommu=pt amd_iommu=on kvm_amd.npt=1 [ 0.245386] pci 0000:2b:00.0: BAR 0: assigned to efifb

But, once I had added the video=efifb:off, then efifb no longer grabs the offending part of iomem:

200f300000-7fffffffff : PCI Bus 0000:00 7fc0000000-7fd1ffffff : PCI Bus 0000:2c 7fc0000000-7fcfffffff : 0000:2c:00.0 7fc0000000-7fcfffffff : vfio-pci 7fd0000000-7fd1ffffff : 0000:2c:00.0 7fd0000000-7fd1ffffff : vfio-pci 7fe0000000-7ff01fffff : PCI Bus 0000:2b 7fe0000000-7fefffffff : 0000:2b:00.0 7ff0000000-7ff01fffff : 0000:2b:00.0
 
So after some tinkering with CSM, boot options etc, I've now got it to the point where it will pick up the correct graphics card for efifb:
Code:
root@pve:~# dmesg | grep efi
[    0.000000] Command line: initrd=\EFI\proxmox\5.4.60-1-pve\initrd.img-5.4.60-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs vfio-pci.ids=1002:67c4,1002:aaf0,10de:1c03,10de:10f1 textonly vga=normal video=vesafb:off,astdrmfb,efifb:off amd_iommu=on rd.driver.pre=vfio-pci
[    0.000000] efi: EFI v2.70 by American Megatrends
[    0.000000] efi:  ACPI 2.0=0xbc823000  ACPI=0xbc823000  SMBIOS=0xbd24c000  SMBIOS 3.0=0xbd24b000  MEMATTR=0xb75a6018  ESRT=0xb9dbc798
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.000000] Kernel command line: initrd=\EFI\proxmox\5.4.60-1-pve\initrd.img-5.4.60-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs vfio-pci.ids=1002:67c4,1002:aaf0,10de:1c03,10de:10f1 textonly vga=normal video=vesafb:off,astdrmfb,efifb:off amd_iommu=on rd.driver.pre=vfio-pci
[    0.248800] pci 0000:22:00.0: BAR 0: assigned to efifb
[    0.253772] Registered efivars operations
[    0.649971] efifb: probing for efifb
[    0.649980] efifb: framebuffer at 0xf4000000, using 1876k, total 1875k
[    0.649981] efifb: mode is 800x600x32, linelength=3200, pages=1
[    0.649982] efifb: scrolling: redraw
[    0.649984] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    1.656014] tsc: Refined TSC clocksource calibration: 3493.437 MHz
[   30.088235] bond0: (slave enp36s0): link status definitely up, 1000 Mbps full duplex
[   30.504232] bond0: (slave enp35s0): link status definitely up, 1000 Mbps full duplex
root@pve:~#

I still cannot pass the WX7100 card through to my VM though, when I start the machine up I see this:
Code:
[ 2091.084019] device tap110i0 entered promiscuous mode
[ 2091.101955] fwbr110i0: port 1(fwln110i0) entered blocking state
[ 2091.101973] fwbr110i0: port 1(fwln110i0) entered disabled state
[ 2091.102325] device fwln110i0 entered promiscuous mode
[ 2091.102519] fwbr110i0: port 1(fwln110i0) entered blocking state
[ 2091.102531] fwbr110i0: port 1(fwln110i0) entered forwarding state
[ 2091.104762] vmbr0: port 5(fwpr110p0) entered blocking state
[ 2091.104783] vmbr0: port 5(fwpr110p0) entered disabled state
[ 2091.105155] device fwpr110p0 entered promiscuous mode
[ 2091.105346] vmbr0: port 5(fwpr110p0) entered blocking state
[ 2091.105359] vmbr0: port 5(fwpr110p0) entered forwarding state
[ 2091.107583] fwbr110i0: port 2(tap110i0) entered blocking state
[ 2091.107601] fwbr110i0: port 2(tap110i0) entered disabled state
[ 2091.107965] fwbr110i0: port 2(tap110i0) entered blocking state
[ 2091.107982] fwbr110i0: port 2(tap110i0) entered forwarding state
[ 2093.151028] vfio-pci 0000:2b:00.0: enabling device (0000 -> 0003)
[ 2093.151749] vfio-pci 0000:2b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 2093.151775] vfio-pci 0000:2b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 2093.151796] vfio-pci 0000:2b:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
[ 2094.291546] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:03.1
[ 2094.291576] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 2094.291597] pcieport 0000:00:03.1: AER:   device [1022:1483] error status/mask=00100000/04400000
[ 2094.291614] pcieport 0000:00:03.1: AER:    [20] UnsupReq               (First)
[ 2094.291628] pcieport 0000:00:03.1: AER:   TLP Header: 34000000 2b000010 00000000 80008000
[ 2094.291673] pcieport 0000:00:03.1: AER: Device recovery successful
root@pve:~#

This issue happens when inside the working Debian buster install though too, so I have not been concerned by it.

So the next step seems to be trying to understand why qemu is unhappy here.
 
OK and after even more googling / tinkering here's what is happening....

PVE is taking the AER to be fatal when in actuality it has recovered. Adding the pci=noaer parameter to the boot prevents the error messages hitting syslog, which in turn keeps PVE's QEMU happy.
 
@r.jochum - yes I have functional passthrough on this card now too, in addition to the other card.

I guess the version of KVM / Qemu in Debian buster is different when it comes to looking at the errors.