AMD GPU Passthrough recently stopped working

machone

Member
Aug 19, 2022
8
0
6
I think it was related to a kernel upgrade but I can't be sure.

I'm now on Proxmox 8.4.1 and kernel 6.8.12-10. It worked solidly for over a year but I was on kernel 5.13.19-6, possibly also a 7.x version of Proxmox.
My understanding is with the newer kernel versions, some of the requirements of changed in terms of GRUB's `cmdline` as well as blacklisting drivers. Typically I get no video output on the GPU and the Windows VM shows a code 43 on the GPU. That's a pretty generic message and I'm not sure how to interrogate for more information. I have tried re-installing drivers in Windows many times. Not convinced at this point that it's a Windows issue rather than a Proxmox/Linux driver/passthrough issue.

Asrock X570M Pro 4
Ryzen 7 5700G
AMD Radeon 6800XT
Notes: I've tried re-seating the GPU. Resizable BAR and 4G decoding are off. IOMMU is enabled.

agent: 1,fstrim_cloned_disks=1
args: -cpu host,-hypervisor,kvm=off
bios: ovmf
boot: order=virtio0
cores: 16
cpu: x86-64-v2-AES
efidisk0: nvme_zfs:vm-110-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:0b:00,pcie=1,rombar=0
machine: pc-q35-7.1
memory: 16384
meta: creation-qemu=7.1.0,ctime=1678932602
name: W11-Gaming
net0: virtio=00:E0:4C:0D:BA:8E,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=e43d67ff-71bc-49f4-8581-350379242ee0,manufacturer=QVNSb2Nr,product=WDU3ME0gUHJvNA==,serial=TTgwLUYxMDEyMzAwMzE4,base64=1
sockets: 1
tablet: 1
tpmstate0: nvme_zfs:vm-110-disk-0,size=4M,version=v2.0
usb0: host=0bda:8771
vga: std
virtio0: local-lvm:vm-110-disk-1,backup=0,discard=on,iothread=1,replicate=0,size=650G
vmgenid: e6ff60a2-9b2b-483c-953b-574c01a0f5d0
Note: I've tried toggling rombar, creating new VMs with a newer/newest q35 machine version, turning off virtual display, re-installing graphics drivers.

Relevant host config files:
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_DEFAULT="Advanced options for Proxmox VE GNU/Linux>Proxmox VE GNU/Linux, with Linux 5.13.19-6-pve"
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
#GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on"
#GRUB_CMDLINE_LINUX_DEFAULT="iommu=on initcall_blacklist=sysfb_init video=simplefb:off"
GRUB_CMDLINE_LINUX_DEFAULT="iommu=on"
#GRUB_CMDLINE_LINUX_DEFAULT=""
# initcall_blacklist=sysfb_init
# amdgpu.dc=0 video=simplefb:off video=efifb:off"
# nofb video=vesafb:off video=efifb:off video=simplefb:off"
# pcie_acs_override=downstream,multifunction
# multifunction nofb nomodeset video=efifb:off"
# amd_iommu=on iommu=pt video=vesafb:off,efifb:off"
# iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
vendor-reset
#vfio_pci
#vfio
#vfio_iommu_type1
#vfio_virqfd

# Generated by sensors-detect on Tue Mar 7 03:17:26 2023
# Chip drivers
nct6775

#blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia

options vfio_iommu_type1 allow_unsafe_interrupts=1

options kvm ignore_msrs=1 ignore_report_msrs=1

# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

options vfio-pci ids=1002:73bf,1002:1638 disable_vga=1

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] (rev 01)
04:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
06:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
06:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
08:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
0a:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB
0c:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980
0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c8)
0d:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
0d:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
0d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
0d:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
0d:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
0e:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)
0e:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)

I do get the following message when I
Code:
qm start 110
:
Code:
error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:0b:00.0', but trying to continue as not all devices need a reset
swtpm_setup: Not overwriting existing state file.

I believe this is related to vendor_reset, and I recently read somewhere that my card doesn't need that. I'm not even clear on what vendor reset even does or whether or not I need it.

Any help would be appreciated. Note that I have onboard gfx built into my CPU as well as the discrete AMD GPU, so that might make things a little more complex. I'm not trying to do anything with the onboard GPU, only the 6800XT.
 
Last edited:
I do get the following message when I qm start 110:
error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device failed to reset PCI device '0000:0b:00.0', but trying to continue as not all devices need a reset
This is a common Proxmox message (for 6000-series AMD GPU and many other devices) and not an error (nor an indication of a problem).
I believe this is related to vendor_reset, and I recently read somewhere that my card doesn't need that. I'm not even clear on what vendor reset even does or whether or not I need it.
It's not related to vendor-reset. vendor-reset does not support 6000-series AMD GPUs like yours (see https://github.com/gnif/vendor-reset ) and therefore you do not need it. I have not yet heard of a 6800 (and up) that does not reset properly by itself. It's always possible that some device does not work with passthrough like some lower 6000-series GPU where it seems to depend on the brand and specific model.
Any help would be appreciated.
Maybe it's a Windows AMD GPU driver issue. Try booting your VM with a Ubuntu 24.04 LTS installer ISO (but don't install it!) to see if it shows output on a physical display connected to your GPU. Make sure to set the virtual Display to None (vga: none).
 
1. Try boot into the previous working kernel and check if it works.
Passthrough is kernel related.

If it works ok, then try find a kernel that works for you. Normally update to the latest (if it's broken, wait for the fix and revert back to the previous working one)

2. I don't really understand your config, are you trying to blacklisting the AMD driver on the host?
Currently your config doesn't seem to blacklist the driver. You can do that by revert back the #
blacklist amdgpu in the config
and GRUB_CMDLINE_LINUX_DEFAULT="iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off"

Otherwise, the host will load driver, you need to early bind to vfio-pci
use softdep xxxx pre: vfio-pci
when lspci, it should see vfio-pci as your AMD graphics driver in use
 
Last edited:
This is a common Proxmox message (for 6000-series AMD GPU and many other devices) and not an error (nor an indication of a problem).

It's not related to vendor-reset. vendor-reset does not support 6000-series AMD GPUs like yours (see https://github.com/gnif/vendor-reset ) and therefore you do not need it. I have not yet heard of a 6800 (and up) that does not reset properly by itself. It's always possible that some device does not work with passthrough like some lower 6000-series GPU where it seems to depend on the brand and specific model.

Maybe it's a Windows AMD GPU driver issue. Try booting your VM with a Ubuntu 24.04 LTS installer ISO (but don't install it!) to see if it shows output on a physical display connected to your GPU. Make sure to set the virtual Display to None (vga: none).
Okay, I removed vendor-reset. Followed your suggestion re: Ubuntu - nope, nothing. Black screen, "no signal". I messed around with rombar and primary GPU. vga: none the whole time.
 
Okay, I removed vendor-reset. Followed your suggestion re: Ubuntu - nope, nothing. Black screen, "no signal". I messed around with rombar and primary GPU. vga: none the whole time.
I did not realize that this passthrough worked before for you. I use the same PVE and kernel version with a 6950XT and that works fine (with a Linux VM). I do have ROM-Bar enabled but no other blacklisting or vfio-pci binding or kernel parameters as I use the GPU for the host when the VM is not running. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind before starting the VM to remove the host console from the amdgpu driver.
 
1. Try boot into the previous working kernel and check if it works.
Passthrough is kernel related.

If it works ok, then try find a kernel that works for you. Normally update to the latest (if it's broken, wait for the fix and revert back to the previous working one)

2. I don't really understand your config, are you trying to blacklisting the AMD driver on the host?
Currently your config doesn't seem to blacklist the driver. You can do that by revert back the #
blacklist amdgpu in the config
and GRUB_CMDLINE_LINUX_DEFAULT="iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off"

Otherwise, the host will load driver, you need to early bind to vfio-pci
use softdep xxxx pre: vfio-pci
when lspci, it should see vfio-pci as your AMD graphics driver in use
Thanks - I re-enabled that GRUB line and I re-enabled the amdgpu blacklisting. I also reverted back to the older kernel. No change.
 
I did not realize that this passthrough worked before for you. I use the same PVE and kernel version with a 6950XT and that works fine (with a Linux VM). I do have ROM-Bar enabled but no other blacklisting or vfio-pci binding or kernel parameters as I use the GPU for the host when the VM is not running. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind before starting the VM to remove the host console from the amdgpu driver.
Interesting, it sounds pretty plug-and-play for you. I don't use the discrete GPU on the host so I've gone with blacklisting drivers.
I found that with the virtual console enabled, I got no output there (active console but just a black screen) with ROM-bar enabled, but good output with it disabled. I will try your way (no blacklisting/binding/kern params + removing the host console) tomorrow and see how it goes.