[SOLVED] AMD GPU Passthrough - IOTLB_INV_TIMEOUT

Oct 16, 2020
2
0
1
24
Guten Tag,
da ich hier neu bin, erst einmal eine ganz kurze Vorstellung:
Ich betreibe seit einigen Jahren einen Server auf Citrix Xen Basis, allerdings wurden die Preise nun stark erhöht und Lizenzen auch nicht mehr verlängert, sodass ich nun auf Proxmox umgestiegen bin. Auf XEN hatte ich diese Kombination bereits lauffähig, da ich die GPU zur Berechnung von Simulationen benötige.

Ich bin für jegliche Hilfe dankbar.
MfG Ludwig


Problem:
GPU Passthrough nicht möglich, Fehlercode IOTLB_INV_TIMEOUT
Liegt dies am bekannten AMD RESET BUG, wenn ja, gibt es bereits einen Fix dafür?
Ehrlich gesagt bin ich nun auch überfragt, was ich als nächstes probieren soll :(

Hardware:
Ryzen 1600X
ASROCK X470D4U
PCIe_Slot_6: LSI Logic / Symbios Logic SAS2008
PCIe_Slot_5: Intel i350T4
PCIe_Slot_4: GPU AMD RX570

Software:
PVE Kernel Version: 5.4.65-1
VM_102: Win10_2004


Code:
### VENDOR ID AMD RX570 ###
1002:67df
1002:aaf0


### /etc/default/grub ###
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

tried, but not working: iommu=soft


### /etc/modprobe.d/vfio.conf ###
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1 disable_idle_d3=1


### LOG ###
Oct 17 14:15:30 pve pvedaemon[5290]: start VM 102: UPID:pve:000014AA:0000402F:5F8AE062:qmstart:102:root@pam:
Oct 17 14:15:30 pve pvedaemon[2972]: <root@pam> starting task UPID:pve:000014AA:0000402F:5F8AE062:qmstart:102:root@pam:
Oct 17 14:15:30 pve systemd[1]: Created slice qemu.slice.
Oct 17 14:15:30 pve systemd[1]: Started 102.scope.
Oct 17 14:15:31 pve kernel: vfio-pci 0000:2c:00.0: enabling device (0000 -> 0003)
Oct 17 14:15:31 pve kernel: vfio-pci 0000:2c:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Oct 17 14:15:31 pve kernel: vfio-pci 0000:2c:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Oct 17 14:15:31 pve kernel: vfio-pci 0000:2c:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
Oct 17 14:15:31 pve kernel: vfio-pci 0000:2c:00.1: enabling device (0000 -> 0002)
Oct 17 14:15:32 pve kernel: vfio-pci 0000:2c:00.1: vfio_bar_restore: reset recovery - restoring BARs
Oct 17 14:15:32 pve kernel: vfio-pci 0000:2c:00.0: vfio_bar_restore: reset recovery - restoring BARs
Oct 17 14:15:32 pve kernel: pcieport 0000:00:03.2: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
Oct 17 14:15:32 pve kernel: pcieport 0000:00:03.2: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Oct 17 14:15:32 pve kernel: pcieport 0000:00:03.2: AER:   device [1022:1453] error status/mask=00200000/04400000
Oct 17 14:15:32 pve kernel: pcieport 0000:00:03.2: AER:    [21] ACSViol                (First)
Oct 17 14:15:32 pve kernel: pcieport 0000:00:03.2: AER: Device recovery successful
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve QEMU[5307]: kvm: vfio_err_notifier_handler(0000:2c:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
Oct 17 14:15:33 pve QEMU[5307]: kvm: vfio_err_notifier_handler(0000:2c:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
Oct 17 14:15:33 pve pvedaemon[2972]: <root@pam> end task UPID:pve:000014AA:0000402F:5F8AE062:qmstart:102:root@pam: OK
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: AMD-Vi: Completion-Wait loop timed out
Oct 17 14:15:33 pve kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=2c:00.0 address=0x7fb59ec90]              <---


### LSPCI ###
root@pve:~# lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01)
03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01)
20:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
20:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
20:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
20:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
20:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
20:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
21:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
22:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
23:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
24:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
25:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
2b:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)       <---
2c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef)                <---
2c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
2d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
2d:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
2d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
2e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
2e:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
2e:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller


### IOMMU GROUPS ###
root@pve:~# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/17/devices/0000:01:00.3
/sys/kernel/iommu_groups/7/devices/0000:00:04.0
/sys/kernel/iommu_groups/25/devices/0000:2e:00.2
/sys/kernel/iommu_groups/15/devices/0000:01:00.1
/sys/kernel/iommu_groups/5/devices/0000:00:03.1
/sys/kernel/iommu_groups/23/devices/0000:2d:00.3
/sys/kernel/iommu_groups/13/devices/0000:00:18.3
/sys/kernel/iommu_groups/13/devices/0000:00:18.1
/sys/kernel/iommu_groups/13/devices/0000:00:18.6
/sys/kernel/iommu_groups/13/devices/0000:00:18.4
/sys/kernel/iommu_groups/13/devices/0000:00:18.2
/sys/kernel/iommu_groups/13/devices/0000:00:18.0
/sys/kernel/iommu_groups/13/devices/0000:00:18.7
/sys/kernel/iommu_groups/13/devices/0000:00:18.5
/sys/kernel/iommu_groups/3/devices/0000:00:02.0
/sys/kernel/iommu_groups/21/devices/0000:2d:00.0
/sys/kernel/iommu_groups/11/devices/0000:00:08.1
/sys/kernel/iommu_groups/1/devices/0000:00:01.1
/sys/kernel/iommu_groups/18/devices/0000:03:00.0
/sys/kernel/iommu_groups/18/devices/0000:20:00.0
/sys/kernel/iommu_groups/18/devices/0000:20:03.0
/sys/kernel/iommu_groups/18/devices/0000:25:00.0
/sys/kernel/iommu_groups/18/devices/0000:24:00.0
/sys/kernel/iommu_groups/18/devices/0000:20:02.0
/sys/kernel/iommu_groups/18/devices/0000:03:00.1
/sys/kernel/iommu_groups/18/devices/0000:23:00.0
/sys/kernel/iommu_groups/18/devices/0000:20:08.0
/sys/kernel/iommu_groups/18/devices/0000:22:00.0
/sys/kernel/iommu_groups/18/devices/0000:20:01.0
/sys/kernel/iommu_groups/18/devices/0000:20:04.0
/sys/kernel/iommu_groups/18/devices/0000:21:00.0
/sys/kernel/iommu_groups/18/devices/0000:03:00.2
/sys/kernel/iommu_groups/8/devices/0000:00:07.0
/sys/kernel/iommu_groups/26/devices/0000:2e:00.3
/sys/kernel/iommu_groups/16/devices/0000:01:00.2
/sys/kernel/iommu_groups/6/devices/0000:00:03.2
/sys/kernel/iommu_groups/24/devices/0000:2e:00.0
/sys/kernel/iommu_groups/14/devices/0000:01:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:03.0
/sys/kernel/iommu_groups/22/devices/0000:2d:00.2
/sys/kernel/iommu_groups/12/devices/0000:00:14.3
/sys/kernel/iommu_groups/12/devices/0000:00:14.0
/sys/kernel/iommu_groups/2/devices/0000:00:01.3
/sys/kernel/iommu_groups/20/devices/0000:2c:00.1    <---
/sys/kernel/iommu_groups/20/devices/0000:2c:00.0    <---
/sys/kernel/iommu_groups/10/devices/0000:00:08.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/19/devices/0000:2b:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:07.1
 

Attachments

  • x470d4u_wiring.jpg
    x470d4u_wiring.jpg
    94.1 KB · Views: 4
Last edited:
Oct 16, 2020
2
0
1
24
Ich habe mich noch einige Stunden damit beschäftigt, anscheinend gibt es hierfür keine Lösung. Schade, denn auf Xen hat das gut funktioniert.

Alle modernen AMD Karten besitzen einen Reset Bug, d.h. es kann softwareseitig kein Reset durchgeführt werden. Dieser wird jedoch benötigt, um die Karte in einen PreBoot Zustand zu zwingen, sodass der VM Treiber die Karte übernehmen kann.
Es gibt einen mehr oder weniger guten Patch dafür, allerdings keine Hilfe seitens AMD.

Verwunderlich, denn jeder der GPU Passthrough mit einer AMD Karte betreibt sollte davon betroffen sein. Somit muss ich mir nun wohl oder übel eine Nvidia Quadro besorgen, auf Error 43 habe ich keine Lust und vor allem keine Zeit mehr.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!