[SOLVED] AMD GPU Passthrough recently stopped working

machone

Member
Aug 19, 2022
19
1
8
I think it was related to a kernel upgrade but I can't be sure.

I'm now on Proxmox 8.4.1 and kernel 6.8.12-10. It worked solidly for over a year but I was on kernel 5.13.19-6, possibly also a 7.x version of Proxmox.
My understanding is with the newer kernel versions, some of the requirements of changed in terms of GRUB's `cmdline` as well as blacklisting drivers. Typically I get no video output on the GPU and the Windows VM shows a code 43 on the GPU. That's a pretty generic message and I'm not sure how to interrogate for more information. I have tried re-installing drivers in Windows many times. Not convinced at this point that it's a Windows issue rather than a Proxmox/Linux driver/passthrough issue.

Asrock X570M Pro 4
Ryzen 7 5700G
AMD Radeon 6800XT
Notes: I've tried re-seating the GPU. Resizable BAR and 4G decoding are off. IOMMU is enabled.

agent: 1,fstrim_cloned_disks=1
args: -cpu host,-hypervisor,kvm=off
bios: ovmf
boot: order=virtio0
cores: 16
cpu: x86-64-v2-AES
efidisk0: nvme_zfs:vm-110-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:0b:00,pcie=1,rombar=0
machine: pc-q35-7.1
memory: 16384
meta: creation-qemu=7.1.0,ctime=1678932602
name: W11-Gaming
net0: virtio=00:E0:4C:0D:BA:8E,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=e43d67ff-71bc-49f4-8581-350379242ee0,manufacturer=QVNSb2Nr,product=WDU3ME0gUHJvNA==,serial=TTgwLUYxMDEyMzAwMzE4,base64=1
sockets: 1
tablet: 1
tpmstate0: nvme_zfs:vm-110-disk-0,size=4M,version=v2.0
usb0: host=0bda:8771
vga: std
virtio0: local-lvm:vm-110-disk-1,backup=0,discard=on,iothread=1,replicate=0,size=650G
vmgenid: e6ff60a2-9b2b-483c-953b-574c01a0f5d0
Note: I've tried toggling rombar, creating new VMs with a newer/newest q35 machine version, turning off virtual display, re-installing graphics drivers.

Relevant host config files:
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_DEFAULT="Advanced options for Proxmox VE GNU/Linux>Proxmox VE GNU/Linux, with Linux 5.13.19-6-pve"
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
#GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on"
#GRUB_CMDLINE_LINUX_DEFAULT="iommu=on initcall_blacklist=sysfb_init video=simplefb:off"
GRUB_CMDLINE_LINUX_DEFAULT="iommu=on"
#GRUB_CMDLINE_LINUX_DEFAULT=""
# initcall_blacklist=sysfb_init
# amdgpu.dc=0 video=simplefb:off video=efifb:off"
# nofb video=vesafb:off video=efifb:off video=simplefb:off"
# pcie_acs_override=downstream,multifunction
# multifunction nofb nomodeset video=efifb:off"
# amd_iommu=on iommu=pt video=vesafb:off,efifb:off"
# iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
vendor-reset
#vfio_pci
#vfio
#vfio_iommu_type1
#vfio_virqfd

# Generated by sensors-detect on Tue Mar 7 03:17:26 2023
# Chip drivers
nct6775

#blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia

options vfio_iommu_type1 allow_unsafe_interrupts=1

options kvm ignore_msrs=1 ignore_report_msrs=1

# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

options vfio-pci ids=1002:73bf,1002:1638 disable_vga=1

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
02:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] (rev 01)
04:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
06:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
06:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
08:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
0a:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB
0c:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980
0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c8)
0d:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
0d:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
0d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
0d:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
0d:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
0e:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)
0e:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)

I do get the following message when I
Code:
qm start 110
:
Code:
error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:0b:00.0', but trying to continue as not all devices need a reset
swtpm_setup: Not overwriting existing state file.

I believe this is related to vendor_reset, and I recently read somewhere that my card doesn't need that. I'm not even clear on what vendor reset even does or whether or not I need it.

Any help would be appreciated. Note that I have onboard gfx built into my CPU as well as the discrete AMD GPU, so that might make things a little more complex. I'm not trying to do anything with the onboard GPU, only the 6800XT.
 
Last edited:
I do get the following message when I qm start 110:
error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device failed to reset PCI device '0000:0b:00.0', but trying to continue as not all devices need a reset
This is a common Proxmox message (for 6000-series AMD GPU and many other devices) and not an error (nor an indication of a problem).
I believe this is related to vendor_reset, and I recently read somewhere that my card doesn't need that. I'm not even clear on what vendor reset even does or whether or not I need it.
It's not related to vendor-reset. vendor-reset does not support 6000-series AMD GPUs like yours (see https://github.com/gnif/vendor-reset ) and therefore you do not need it. I have not yet heard of a 6800 (and up) that does not reset properly by itself. It's always possible that some device does not work with passthrough like some lower 6000-series GPU where it seems to depend on the brand and specific model.
Any help would be appreciated.
Maybe it's a Windows AMD GPU driver issue. Try booting your VM with a Ubuntu 24.04 LTS installer ISO (but don't install it!) to see if it shows output on a physical display connected to your GPU. Make sure to set the virtual Display to None (vga: none).
 
1. Try boot into the previous working kernel and check if it works.
Passthrough is kernel related.

If it works ok, then try find a kernel that works for you. Normally update to the latest (if it's broken, wait for the fix and revert back to the previous working one)

2. I don't really understand your config, are you trying to blacklisting the AMD driver on the host?
Currently your config doesn't seem to blacklist the driver. You can do that by revert back the #
blacklist amdgpu in the config
and GRUB_CMDLINE_LINUX_DEFAULT="iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off"

Otherwise, the host will load driver, you need to early bind to vfio-pci
use softdep xxxx pre: vfio-pci
when lspci, it should see vfio-pci as your AMD graphics driver in use
 
Last edited:
This is a common Proxmox message (for 6000-series AMD GPU and many other devices) and not an error (nor an indication of a problem).

It's not related to vendor-reset. vendor-reset does not support 6000-series AMD GPUs like yours (see https://github.com/gnif/vendor-reset ) and therefore you do not need it. I have not yet heard of a 6800 (and up) that does not reset properly by itself. It's always possible that some device does not work with passthrough like some lower 6000-series GPU where it seems to depend on the brand and specific model.

Maybe it's a Windows AMD GPU driver issue. Try booting your VM with a Ubuntu 24.04 LTS installer ISO (but don't install it!) to see if it shows output on a physical display connected to your GPU. Make sure to set the virtual Display to None (vga: none).
Okay, I removed vendor-reset. Followed your suggestion re: Ubuntu - nope, nothing. Black screen, "no signal". I messed around with rombar and primary GPU. vga: none the whole time.
 
Okay, I removed vendor-reset. Followed your suggestion re: Ubuntu - nope, nothing. Black screen, "no signal". I messed around with rombar and primary GPU. vga: none the whole time.
I did not realize that this passthrough worked before for you. I use the same PVE and kernel version with a 6950XT and that works fine (with a Linux VM). I do have ROM-Bar enabled but no other blacklisting or vfio-pci binding or kernel parameters as I use the GPU for the host when the VM is not running. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind before starting the VM to remove the host console from the amdgpu driver.
 
1. Try boot into the previous working kernel and check if it works.
Passthrough is kernel related.

If it works ok, then try find a kernel that works for you. Normally update to the latest (if it's broken, wait for the fix and revert back to the previous working one)

2. I don't really understand your config, are you trying to blacklisting the AMD driver on the host?
Currently your config doesn't seem to blacklist the driver. You can do that by revert back the #
blacklist amdgpu in the config
and GRUB_CMDLINE_LINUX_DEFAULT="iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off"

Otherwise, the host will load driver, you need to early bind to vfio-pci
use softdep xxxx pre: vfio-pci
when lspci, it should see vfio-pci as your AMD graphics driver in use
Thanks - I re-enabled that GRUB line and I re-enabled the amdgpu blacklisting. I also reverted back to the older kernel. No change.
 
I did not realize that this passthrough worked before for you. I use the same PVE and kernel version with a 6950XT and that works fine (with a Linux VM). I do have ROM-Bar enabled but no other blacklisting or vfio-pci binding or kernel parameters as I use the GPU for the host when the VM is not running. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind before starting the VM to remove the host console from the amdgpu driver.
Interesting, it sounds pretty plug-and-play for you. I don't use the discrete GPU on the host so I've gone with blacklisting drivers.
I found that with the virtual console enabled, I got no output there (active console but just a black screen) with ROM-bar enabled, but good output with it disabled. I will try your way (no blacklisting/binding/kern params + removing the host console) tomorrow and see how it goes.
 
I did not realize that this passthrough worked before for you. I use the same PVE and kernel version with a 6950XT and that works fine (with a Linux VM). I do have ROM-Bar enabled but no other blacklisting or vfio-pci binding or kernel parameters as I use the GPU for the host when the VM is not running. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind before starting the VM to remove the host console from the amdgpu driver.
Okay - I tried this and I got the same results: black screen/no signal.
 
Thanks - I re-enabled that GRUB line and I re-enabled the amdgpu blacklisting. I also reverted back to the older kernel. No change.
Did you revert back the commented out line on the kernel modules? Currently, it looks like you only have vendor-reset enabled? All others are commented out.

vfio_pci
vfio
vfio_iommu_type1
vfio_virqfd

Also you have two blacklist files? Can you make it into one? e.g. /etc/modprobe.d/pve-blacklist.conf
Put everything in one place
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist nvidiafb


Your current config missed iommu=pt (it says iommu=on ? should be amd_iommu=on on old kernel? )

Do you have a backup on the old configs? Probably do a comparison. Revert back to the working version should work and probably tweak from the working version.
 
Did you revert back the commented out line on the kernel modules? Currently, it looks like you only have vendor-reset enabled? All others are commented out.

vfio_pci
vfio
vfio_iommu_type1
vfio_virqfd

Also you have two blacklist files? Can you make it into one? e.g. /etc/modprobe.d/pve-blacklist.conf
Put everything in one place
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist nvidiafb


Your current config missed iommu=pt (it says iommu=on ? should be amd_iommu=on on old kernel? )

Do you have a backup on the old configs? Probably do a comparison. Revert back to the working version should work and probably tweak from the working version.

I'm booted into the older kernel (5.13.19-6), I threw in every kernel parameter I could to try and suppress any framebuffer output from the host:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amdgpu.dc=0 amd_iommu=on iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off,efifb:off,vesafb:off nofb nomodeset"

I had disabled vendor-reset (apparently my card is not affected) and re-enabled the vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd modules;
I found that I had a bad hardware address listed in /etc/modprobe.d/vfio.conf, so I updated that and added all of the other GPU sub-devices as well:

#softdep amdgpu pre: vfio-pci
options vfio-pci ids=1002:73bf,1002:ab28,1002:73a6,1002:73a4 disable_vga=1

I consolidated the /etc/modprobe.d/blacklist.conf and /etc/modprobe.d/pve-blacklist.conf:

# This file contains a list of modules which are not supported by Proxmox VE
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu
blacklist radeon
#blacklist nouveau
#blacklist nvidia

Disabled the /etc/modprobe.d/kvm.conf options as I'm not sure they're necessary;
Enabled the iommu unsafe interrupt options;

ran update-grub and update-initramfs -u -k all; rebooted.

I do get text output as the system boots. First GRUB, obviously, and I would expect to only see see output until the "loading initramfs..." message but instead I get dmsg/console output showing systemd, etc. The output pauses as vfio_pci loads, and then 10-20s later it resumes showing PCI iommu-related messages, then pauses again indefinitely. The "pauses" must be the host not outputting to the card as I can't get a console/login to show up even if I hit enter a few times on my keyboard, or use alt+ctrl+Fx.

Aside from the console/dmsg output, I'm seeing everything I would expect to see in dmsg, lspci etc. I toggled every combination of "primary gpu" and "ROMbar" with the Ubuntu VM and nothing causes any new output to show up on that GPU. I can't help but feel like this is either the host outputting video to the GPU when it shouldn't (driver blacklisting isn't working?), or reset related. E.g. the GPU should be resetting before the VM grabs it but isn't, or something, but I'm not familiar with what's supposed to happen there.
 
I found some interesting output in dmesg:

[ 35.551523] ------------[ cut here ]------------
[ 35.551814] i2c-designware-pci 0000:0b:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0x0 flags=0x0020]
[ 35.551846] WARNING: CPU: 0 PID: 3988 at kernel/irq/devres.c:143 devm_free_irq+0x6c/0x80
[ 35.552879] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 libcurve25519_generic libchacha libblake2s_generic cfg80211 veth ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_mark xt_physdev xt_addr
type xt_comment xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel scsi_transport_iscsi nf_tables nvme_fabrics 8021q garp mrp bonding tls softdog intel_rapl_msr intel_rapl_common edac_mce_amd snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kv
m_amd snd_hda_codec_hdmi kvm snd_hda_intel btusb snd_intel_dspcfg btrtl crct10dif_pclmul snd_intel_sdw_acpi btbcm ghash_clmulni_intel btintel snd_hda_codec aesni_intel snd_hda_core bluetooth snd_hwdep crypto_simd snd_pcm cryptd snd_timer ecdh_generic ecc snd ucsi_ccg rapl typec_ucsi ccp
[ 35.552915] soundcore wmi_bmof typec pcspkr joydev input_leds mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775 hwmon_vid binfmt_misc nfnetlink_log vfio_pci vfio_virqfd irqbypass nfnetlink vfio_iommu_type1 vfio drm efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4
btrfs blake2b_generic xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbhid hid nvme mpt3sas xhci_pci xhci_pci_renesas crc32_pclmul i2c_piix4 raid_class igb ahci i2c_algo_bit i2c_designware_pci libahci xhci_hcd dca nvme_core scsi_transport_sas wmi video
[ 35.557754] CPU: 0 PID: 3988 Comm: fix_gpu_pass.sh Tainted: P O 5.13.19-6-pve #1
[ 35.558064] Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P5.63 08/22/2024
[ 35.558372] RIP: 0010:devm_free_irq+0x6c/0x80
[ 35.558676] Code: 69 00 85 c0 75 24 4c 89 ee 44 89 e7 e8 1d d5 ff ff 48 8b 45 e8 65 48 2b 04 25 28 00 00 00 75 0e 48 83 c4 18 41 5c 41 5d 5d c3 <0f> 0b eb d8 e8 9b 1d af 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[ 35.559589] RSP: 0018:ffff952d4790fc00 EFLAGS: 00010282
[ 35.559890] RAX: 00000000fffffffe RBX: ffff894601b1b000 RCX: ffff894601b1b368
[ 35.560181] RDX: 0000000000000001 RSI: 0000000000000286 RDI: ffff894601b1b364
[ 35.560484] RBP: ffff952d4790fc28 R08: ffff8946121f70d8 R09: 0000000000000286
[ 35.560789] R10: 0000000000000000 R11: ffff8946080654b0 R12: 000000000000002c
[ 35.561091] R13: ffff8946121f7018 R14: ffff894601b1b0c8 R15: ffff8946bad23500
[ 35.561393] FS: 00007fca0ac7d740(0000) GS:ffff8954dda00000(0000) knlGS:0000000000000000
[ 35.561697] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 35.562003] CR2: 00005585f1469628 CR3: 000000010a99c000 CR4: 0000000000750ef0
[ 35.562312] PKRU: 55555554
[ 35.562615] Call Trace:
[ 35.562910] <TASK>
[ 35.563199] i2c_dw_pci_remove+0x5f/0x70 [i2c_designware_pci]
[ 35.563490] pci_device_remove+0x3e/0xb0
[ 35.563773] __device_release_driver+0x181/0x240
[ 35.564052] device_release_driver+0x29/0x40
[ 35.564323] pci_stop_bus_device+0x79/0xa0
[ 35.564588] pci_stop_bus_device+0x30/0xa0
[ 35.564850] pci_stop_and_remove_bus_device_locked+0x1b/0x30
[ 35.565103] remove_store+0x7b/0x90
[ 35.565349] dev_attr_store+0x17/0x30
[ 35.565590] sysfs_kf_write+0x3f/0x50
[ 35.565824] kernfs_fop_write_iter+0x13b/0x1d0
[ 35.566052] new_sync_write+0x114/0x1a0
[ 35.566274] vfs_write+0x1c5/0x260
[ 35.566490] ksys_write+0x67/0xe0
[ 35.566698] __x64_sys_write+0x1a/0x20
[ 35.566899] do_syscall_64+0x61/0xb0
[ 35.567095] ? irqentry_exit_to_user_mode+0x9/0x20
[ 35.567286] ? irqentry_exit+0x19/0x30
[ 35.567469] ? exc_page_fault+0x8f/0x170
[ 35.567646] ? asm_exc_page_fault+0x8/0x30
[ 35.567820] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 35.567986] RIP: 0033:0x7fca0ad78300
[ 35.568151] Code: 40 00 48 8b 15 01 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d e1 22 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[ 35.568649] RSP: 002b:00007ffea6f912c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 35.568825] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fca0ad78300
[ 35.569000] RDX: 0000000000000002 RSI: 00005585f1468620 RDI: 0000000000000001
[ 35.569174] RBP: 00005585f1468620 R08: 00007fca0ae52c68 R09: 0000000000000073
[ 35.569348] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000002
[ 35.569521] R13: 00007fca0ae53760 R14: 0000000000000002 R15: 00007fca0ae4e9e0
[ 35.569689] </TASK>
[ 35.569853] ---[ end trace 228ff6ddfed6c80f ]---
[ 35.570017] ------------[ cut here ]------------
[ 35.570177] Trying to free already-free IRQ 44
[ 35.570340] WARNING: CPU: 0 PID: 3988 at kernel/irq/manage.c:1832 free_irq+0x1fd/0x370
[ 35.570514] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 libcurve25519_generic libchacha libblake2s_generic cfg80211 veth ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_mark xt_physdev xt_addr
type xt_comment xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel scsi_transport_iscsi nf_tables nvme_fabrics 8021q garp mrp bonding tls softdog intel_rapl_msr intel_rapl_common edac_mce_amd snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kv
m_amd snd_hda_codec_hdmi kvm snd_hda_intel btusb snd_intel_dspcfg btrtl crct10dif_pclmul snd_intel_sdw_acpi btbcm ghash_clmulni_intel btintel snd_hda_codec aesni_intel snd_hda_core bluetooth snd_hwdep crypto_simd snd_pcm cryptd snd_timer ecdh_generic ecc snd ucsi_ccg rapl typec_ucsi ccp
[ 35.570529] soundcore wmi_bmof typec pcspkr joydev input_leds mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775 hwmon_vid binfmt_misc nfnetlink_log vfio_pci vfio_virqfd irqbypass nfnetlink vfio_iommu_type1 vfio drm efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4
btrfs blake2b_generic xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbhid hid nvme mpt3sas xhci_pci xhci_pci_renesas crc32_pclmul i2c_piix4 raid_class igb ahci i2c_algo_bit i2c_designware_pci libahci xhci_hcd dca nvme_core scsi_transport_sas wmi video
[ 35.574600] CPU: 0 PID: 3988 Comm: fix_gpu_pass.sh Tainted: P W O 5.13.19-6-pve #1
[ 35.574886] Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P5.63 08/22/2024
[ 35.575176] RIP: 0010:free_irq+0x1fd/0x370
[ 35.575466] Code: e8 c8 bf 1c 00 48 83 c4 10 4c 89 f8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 75 d0 48 c7 c7 c8 24 fb a6 4c 89 4d c8 e8 2b 8c a9 00 <0f> 0b 4c 8b 4d c8 4c 89 f7 4c 89 ce e8 62 34 b0 00 49 8b 47 40 48
[ 35.576352] RSP: 0018:ffff952d4790fbb8 EFLAGS: 00010086
[ 35.576646] RAX: 0000000000000000 RBX: ffff8946121f7018 RCX: ffff8954dda209c8
[ 35.576939] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8954dda209c0
[ 35.577230] RBP: ffff952d4790fbf0 R08: 0000000000000000 R09: ffff952d4790f998
[ 35.577519] R10: ffff952d4790f990 R11: ffffffffa7755428 R12: 000000000000002c
[ 35.577811] R13: ffff894611dff160 R14: ffff894611dff0a4 R15: ffff894611dff000
[ 35.578103] FS: 00007fca0ac7d740(0000) GS:ffff8954dda00000(0000) knlGS:0000000000000000
[ 35.578400] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 35.578695] CR2: 00005585f1469628 CR3: 000000010a99c000 CR4: 0000000000750ef0
[ 35.578993] PKRU: 55555554
[ 35.579290] Call Trace:
[ 35.579584] <TASK>
[ 35.579872] devm_free_irq+0x53/0x80
[ 35.580158] i2c_dw_pci_remove+0x5f/0x70 [i2c_designware_pci]
[ 35.580442] pci_device_remove+0x3e/0xb0
[ 35.580717] __device_release_driver+0x181/0x240
[ 35.580988] device_release_driver+0x29/0x40
[ 35.581253] pci_stop_bus_device+0x79/0xa0
[ 35.581512] pci_stop_bus_device+0x30/0xa0
[ 35.581763] pci_stop_and_remove_bus_device_locked+0x1b/0x30
[ 35.582009] remove_store+0x7b/0x90
[ 35.582249] dev_attr_store+0x17/0x30
[ 35.582482] sysfs_kf_write+0x3f/0x50
[ 35.582707] kernfs_fop_write_iter+0x13b/0x1d0
[ 35.582929] new_sync_write+0x114/0x1a0
[ 35.583144] vfs_write+0x1c5/0x260
[ 35.583351] ksys_write+0x67/0xe0
[ 35.583552] __x64_sys_write+0x1a/0x20
[ 35.583747] do_syscall_64+0x61/0xb0
[ 35.583935] ? irqentry_exit_to_user_mode+0x9/0x20
[ 35.584119] ? irqentry_exit+0x19/0x30
[ 35.584297] ? exc_page_fault+0x8f/0x170
[ 35.584465] ? asm_exc_page_fault+0x8/0x30
[ 35.584632] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 35.584797] RIP: 0033:0x7fca0ad78300
[ 35.584954] Code: 40 00 48 8b 15 01 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d e1 22 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[ 35.585439] RSP: 002b:00007ffea6f912c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 35.585611] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fca0ad78300
[ 35.585784] RDX: 0000000000000002 RSI: 00005585f1468620 RDI: 0000000000000001
[ 35.585958] RBP: 00005585f1468620 R08: 00007fca0ae52c68 R09: 0000000000000073
[ 35.586131] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000002
[ 35.586299] R13: 00007fca0ae53760 R14: 0000000000000002 R15: 00007fca0ae4e9e0
[ 35.586464] </TASK>
[ 35.586620] ---[ end trace 228ff6ddfed6c810 ]---
[ 35.586868] xhci_hcd 0000:0b:00.2: remove, state 4
[ 35.587034] usb usb6: USB disconnect, device number 1
[ 35.587281] xhci_hcd 0000:0b:00.2: USB bus 6 deregistered
[ 35.587454] xhci_hcd 0000:0b:00.2: remove, state 1
[ 35.587620] usb usb5: USB disconnect, device number 1
[ 35.587786] usb 5-2: USB disconnect, device number 2
[ 35.799454] xhci_hcd 0000:0b:00.2: USB bus 5 deregistered
[ 35.816887] vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
[ 35.836902] pci 0000:0b:00.0: Removing from iommu group 22
[ 35.837254] pci 0000:0b:00.1: Removing from iommu group 23
[ 35.837650] pci 0000:0b:00.2: Removing from iommu group 24
[ 35.838059] pci 0000:0b:00.3: Removing from iommu group 25
[ 35.838346] pci_bus 0000:0b: busn_res: [bus 0b] is released
[ 35.838689] pci 0000:0a:00.0: Removing from iommu group 21
[ 38.880855] pci 0000:0a:00.0: [1002:1479] type 01 class 0x060400
[ 38.881237] pci 0000:0a:00.0: PME# supported from D0 D3hot D3cold
[ 38.881618] pci 0000:0a:00.0: Adding to iommu group 21
[ 38.882112] pci 0000:0b:00.0: [1002:73bf] type 00 class 0x030000
[ 38.882324] pci 0000:0b:00.0: reg 0x10: [mem 0x7800000000-0x7bffffffff 64bit pref]
[ 38.882525] pci 0000:0b:00.0: reg 0x18: [mem 0x7c00000000-0x7c0fffffff 64bit pref]
[ 38.882721] pci 0000:0b:00.0: reg 0x20: [io 0xf000-0xf0ff]
[ 38.882915] pci 0000:0b:00.0: reg 0x24: [mem 0xfcb00000-0xfcbfffff]
[ 38.883109] pci 0000:0b:00.0: reg 0x30: [mem 0xfcc00000-0xfcc1ffff pref]
[ 38.883363] pci 0000:0b:00.0: PME# supported from D1 D2 D3hot D3cold
[ 38.883608] pci 0000:0b:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:00:01.3 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 38.883986] pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 38.884193] pci 0000:0b:00.0: Adding to iommu group 22
[ 38.884495] pci 0000:0b:00.1: [1002:ab28] type 00 class 0x040300
[ 38.884678] pci 0000:0b:00.1: reg 0x10: [mem 0xfcc24000-0xfcc27fff]
[ 38.884941] pci 0000:0b:00.1: PME# supported from D1 D2 D3hot D3cold
[ 38.885192] pci 0000:0b:00.1: Adding to iommu group 23
[ 38.885487] pci 0000:0b:00.2: [1002:73a6] type 00 class 0x0c0330
[ 38.885671] pci 0000:0b:00.2: reg 0x10: [mem 0xfca00000-0xfcafffff 64bit]
[ 38.885928] pci 0000:0b:00.2: PME# supported from D0 D3hot D3cold
[ 38.886173] pci 0000:0b:00.2: Adding to iommu group 24
[ 38.886463] pci 0000:0b:00.3: [1002:73a4] type 00 class 0x0c8000
[ 38.886648] pci 0000:0b:00.3: reg 0x10: [mem 0xfcc20000-0xfcc23fff 64bit]
[ 38.886899] pci 0000:0b:00.3: PME# supported from D0 D3hot
[ 38.887137] pci 0000:0b:00.3: Adding to iommu group 25
[ 38.887450] pci 0000:0a:00.0: PCI bridge to [bus 0b]
[ 38.887624] pci 0000:0a:00.0: bridge window [io 0xf000-0xffff]
[ 38.887795] pci 0000:0a:00.0: bridge window [mem 0xfca00000-0xfccfffff]
[ 38.887963] pci 0000:0a:00.0: bridge window [mem 0x7800000000-0x7c0fffffff 64bit pref]
[ 38.888172] pci 0000:0a:00.0: BAR 15: no space for [mem size 0x600000000 64bit pref]
[ 38.888330] pci 0000:0a:00.0: BAR 15: failed to assign [mem size 0x600000000 64bit pref]
[ 38.888497] pci 0000:0a:00.0: BAR 14: assigned [mem 0xfca00000-0xfccfffff]
[ 38.888667] pci 0000:0a:00.0: BAR 13: assigned [io 0xf000-0xffff]
[ 38.888843] pci 0000:0b:00.0: BAR 0: no space for [mem size 0x400000000 64bit pref]
[ 38.889039] pci 0000:0b:00.0: BAR 0: failed to assign [mem size 0x400000000 64bit pref]
[ 38.889213] pci 0000:0b:00.0: BAR 2: no space for [mem size 0x10000000 64bit pref]
[ 38.889386] pci 0000:0b:00.0: BAR 2: failed to assign [mem size 0x10000000 64bit pref]
[ 38.889561] pci 0000:0b:00.0: BAR 5: assigned [mem 0xfca00000-0xfcafffff]
[ 38.889737] pci 0000:0b:00.2: BAR 0: assigned [mem 0xfcb00000-0xfcbfffff 64bit]
[ 38.889916] pci 0000:0b:00.0: BAR 6: assigned [mem 0xfcc00000-0xfcc1ffff pref]
[ 38.890087] pci 0000:0b:00.1: BAR 0: assigned [mem 0xfcc20000-0xfcc23fff]
[ 38.890258] pci 0000:0b:00.3: BAR 0: assigned [mem 0xfcc24000-0xfcc27fff 64bit]
[ 38.890432] pci 0000:0b:00.0: BAR 4: assigned [io 0xf000-0xf0ff]
[ 38.890602] pci 0000:0a:00.0: PCI bridge to [bus 0b]
[ 38.890771] pci 0000:0a:00.0: bridge window [io 0xf000-0xffff]
[ 38.890943] pci 0000:0a:00.0: bridge window [mem 0xfca00000-0xfccfffff]

NOTE: at this point, dmesg output "pauses" from the host to the GPU. But over ssh, dmesg output continues:

[ 38.891311] vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 38.908799] pci 0000:0b:00.1: D0 power state depends on 0000:0b:00.0
[ 38.928797] pci 0000:0b:00.2: D0 power state depends on 0000:0b:00.0
[ 38.929524] xhci_hcd 0000:0b:00.2: xHCI Host Controller
[ 38.929769] xhci_hcd 0000:0b:00.2: new USB bus registered, assigned bus number 5
[ 38.930105] xhci_hcd 0000:0b:00.2: hcc params 0x0260ffe5 hci version 0x110 quirks 0x0000000000000010
[ 38.930748] usb usb5: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.13
[ 38.930947] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 38.931140] usb usb5: Product: xHCI Host Controller
[ 38.931316] usb usb5: Manufacturer: Linux 5.13.19-6-pve xhci-hcd
[ 38.931494] usb usb5: SerialNumber: 0000:0b:00.2
[ 38.931726] hub 5-0:1.0: USB hub found
[ 38.931906] hub 5-0:1.0: 2 ports detected
[ 38.932142] xhci_hcd 0000:0b:00.2: xHCI Host Controller
[ 38.932320] xhci_hcd 0000:0b:00.2: new USB bus registered, assigned bus number 6
[ 38.932497] xhci_hcd 0000:0b:00.2: Host supports USB 3.1 Enhanced SuperSpeed
[ 38.932686] usb usb6: We don't know the algorithms for LPM for this host, disabling LPM.
[ 38.932896] usb usb6: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.13
[ 38.933084] usb usb6: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 38.933267] usb usb6: Product: xHCI Host Controller
[ 38.933452] usb usb6: Manufacturer: Linux 5.13.19-6-pve xhci-hcd
[ 38.933641] usb usb6: SerialNumber: 0000:0b:00.2
[ 38.933870] hub 6-0:1.0: USB hub found
[ 38.934059] hub 6-0:1.0: 1 port detected
[ 38.934282] pci 0000:0b:00.3: D0 power state depends on 0000:0b:00.0
[ 38.934674] ucsi_ccg 0-0008: failed to get FW build information
...
 
Uhhh .. okay interesting.

I went into my BIOS, checked the BAR and Above 4G Decoding settings - they were enabled where I had thought they were disabled.
After some reading, I decided to leave them on anyway.

Then, after booting, I thought I'd try removing and re-scanning the devices:

echo 1 | sudo tee /sys/bus/pci/devices/0000\:0b\:00.{0,1,2,3}/remove
echo 1 | sudo tee /sys/bus/pci/rescan

Interestingly, this gets me video output on both my W11 and Ubuntu VMs - all options enabled (primary GPU, ROM-bar, all functions, pcie), but only up to when the Proxmox boot logo should disappear and be taken over by the VM guest OS. I can RDP in to the W11 machine after it's booted, but there's still no video output from the guest once booted.
 
I'm booted into the older kernel (5.13.19-6), I threw in every kernel parameter I could to try and suppress any framebuffer output from the host:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amdgpu.dc=0 amd_iommu=on iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off,efifb:off,vesafb:off nofb nomodeset"

I had disabled vendor-reset (apparently my card is not affected) and re-enabled the vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd modules;
I found that I had a bad hardware address listed in /etc/modprobe.d/vfio.conf, so I updated that and added all of the other GPU sub-devices as well:

#softdep amdgpu pre: vfio-pci
options vfio-pci ids=1002:73bf,1002:ab28,1002:73a6,1002:73a4 disable_vga=1

I consolidated the /etc/modprobe.d/blacklist.conf and /etc/modprobe.d/pve-blacklist.conf:

# This file contains a list of modules which are not supported by Proxmox VE
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu
blacklist radeon
#blacklist nouveau
#blacklist nvidia

Disabled the /etc/modprobe.d/kvm.conf options as I'm not sure they're necessary;
Enabled the iommu unsafe interrupt options;

ran update-grub and update-initramfs -u -k all; rebooted.

I do get text output as the system boots. First GRUB, obviously, and I would expect to only see see output until the "loading initramfs..." message but instead I get dmsg/console output showing systemd, etc. The output pauses as vfio_pci loads, and then 10-20s later it resumes showing PCI iommu-related messages, then pauses again indefinitely. The "pauses" must be the host not outputting to the card as I can't get a console/login to show up even if I hit enter a few times on my keyboard, or use alt+ctrl+Fx.

Aside from the console/dmsg output, I'm seeing everything I would expect to see in dmsg, lspci etc. I toggled every combination of "primary gpu" and "ROMbar" with the Ubuntu VM and nothing causes any new output to show up on that GPU. I can't help but feel like this is either the host outputting video to the GPU when it shouldn't (driver blacklisting isn't working?), or reset related. E.g. the GPU should be resetting before the VM grabs it but isn't, or something, but I'm not familiar with what's supposed to happen there.

1. I noticed your GPU has 4 things in the group. Make sure IOMMU is 'Enabled' (not AUTO) in the BIOS.

0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB

With things in the group, it will be hard to passthrough. This doesn't look correct. Can you check if you can put the card into primary GPU slot. The motherboard you have seems to non-isolated iommu group.

2. your command line should be:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nomodeset amd_iommu=on iommu=pt initcall_blacklist=sysfb_init"

3. If you have make the above correctly, and you still can't fix the iommu group. 4 things in the group, you can still passthrough with a passed in rom file I believe. (Just like amd iGPU, very messy gpu hanging under a big group of other devices)

options vfio-pci ids=1002:73bf,1002:ab28 disable_vga=1

Given the first 2 is the GPU + audio, don't put the rest in. Follow this
https://github.com/isc30/ryzen-gpu-passthrough-proxmox
to do the passthrough just like iGPU (with messy grouping)
 
1. I noticed your GPU has 4 things in the group. Make sure IOMMU is 'Enabled' (not AUTO) in the BIOS.

0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB

With things in the group, it will be hard to passthrough. This doesn't look correct. Can you check if you can put the card into primary GPU slot. The motherboard you have seems to non-isolated iommu group.
This is normal and all four function are part of the 6800/6800XT/6900XT device.
2. your command line should be:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nomodeset amd_iommu=on iommu=pt initcall_blacklist=sysfb_init"
amd_iommu=on is never needed (and is invalid thus ignored). nomodeset and initcall_blacklist=sysfb_init are not needed when the host does not boot using the GPU (as the OP is using iGPU) and early binding to vfio-pci for all 4 functions is probably better. iommu=pt only does something for non-passedthrough devices and probably does not have a performance impact on this host. Without quiet, you'll see more information when booting the host, which might be useful for troubleshooting. I'm a little sorry for finding fault with all your suggestions but I do genuinely think they are not needed in this case.

Uhhh .. okay interesting.

I went into my BIOS, checked the BAR and Above 4G Decoding settings - they were enabled where I had thought they were disabled.
After some reading, I decided to leave them on anyway.
I have them on and I actually resize both BARs after removing any driver from the VGA-function and before starting a VM with it. I did not test the performance improvement of this.
Then, after booting, I thought I'd try removing and re-scanning the devices:

echo 1 | sudo tee /sys/bus/pci/devices/0000\:0b\:00.{0,1,2,3}/remove
echo 1 | sudo tee /sys/bus/pci/rescan

Interestingly, this gets me video output on both my W11 and Ubuntu VMs - all options enabled (primary GPU, ROM-bar, all functions, pcie), but only up to when the Proxmox boot logo should disappear and be taken over by the VM guest OS. I can RDP in to the W11 machine after it's booted, but there's still no video output from the guest once booted.
Primary GPU is a work-around for NVidia GPUs and probably not needed. Kicking the GPU function from the bus and rescanning is similar to a bus-reset, which is what I expect Proxmox/VFIO to use for your device anyway. Which does not match observations apparently, strange...
 
This is normal and all four function are part of the 6800/6800XT/6900XT device.

amd_iommu=on is never needed (and is invalid thus ignored). nomodeset and initcall_blacklist=sysfb_init are not needed when the host does not boot using the GPU (as the OP is using iGPU) and early binding to vfio-pci for all 4 functions is probably better. iommu=pt only does something for non-passedthrough devices and probably does not have a performance impact on this host. Without quiet, you'll see more information when booting the host, which might be useful for troubleshooting. I'm a little sorry for finding fault with all your suggestions but I do genuinely think they are not needed in this case.
Thanks for pointing out. I didn't know 6800/6800XT/6900XT has more things. I've only used amd 6600 with only two items, and nvidia gpus all have two items.

Yeah, you are right, I started the build with cpu with no igpu for display. And the GPU will used to display host info a little bit until it passes to the VM after initial booting sequence. I left amd_iommu=on to be there because it was copied from a working intel version. And Yes, It was recommended to be added at that time along with iommu=pt. (Of course, I didn't verify each settings from Pci passthrough guide)
I've used the same config for passthrough across a few platform upgrades (from the old Intel days to all AMD nows). Some of the configs might needs to be updated.
Thanks for correcting those settings.
 
This is normal and all four function are part of the 6800/6800XT/6900XT device.

amd_iommu=on is never needed (and is invalid thus ignored). nomodeset and initcall_blacklist=sysfb_init are not needed when the host does not boot using the GPU (as the OP is using iGPU) and early binding to vfio-pci for all 4 functions is probably better. iommu=pt only does something for non-passedthrough devices and probably does not have a performance impact on this host. Without quiet, you'll see more information when booting the host, which might be useful for troubleshooting. I'm a little sorry for finding fault with all your suggestions but I do genuinely think they are not needed in this case.


I have them on and I actually resize both BARs after removing any driver from the VGA-function and before starting a VM with it. I did not test the performance improvement of this.

Primary GPU is a work-around for NVidia GPUs and probably not needed. Kicking the GPU function from the bus and rescanning is similar to a bus-reset, which is what I expect Proxmox/VFIO to use for your device anyway. Which does not match observations apparently, strange...
Okay, I cleared out all of my kernel parameters.

On the newer kernel, I get this message when starting a vm from the command line:
error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:0b:00.0', but trying to continue as not all devices need a reset

and
$ cat /sys/bus/pci/devices/0000\:0b\:00.0/reset_method
bus

So it seems like it's trying to do a bus reset but failing. I also notice if I rundmesg -wH in a tmux pane while starting the VM, when it fails to do the reset I don't get any video output. When I manually do the reset and then start the vm, I get the Proxmox boot logo and then just black. There's still output on the GPU - my monitor is on, and I never get the "no signal" message so I can tell it's active, just not actually showing anything. This is consistent between VMs and kernel versions (5.13 / 6.8).

The reset issue feels like a passthrough or host BIOS problem, no? Even so, some output from the VM during its boot cycle but not after POST (or whatever) is odd.

Maybe it's not relevant but when this used to work, I would see no video output on the GPU after "starting initial ramdisk" on host boot.

Anyway, I decided to try using softdep and disabling driver blacklisting:
/etc/modprobe.d/vfio.conf:
softdep amdgpu pre: vfio-pci
#options vfio-pci ids=1002:73bf,1002:ab28,1002:73a6,1002:73a4 disable_vga=1

/etc/modprobe.d/pve-blacklist.conf:
Code:
# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

#blacklist amdgpu
#blacklist radeon
#blacklist nouveau
#blacklist nvidia
#
#blacklist ucsi_ccg

and, as expected, I got proper console output including the Proxmox login message and a high resolution on boot. Then, it stopped outputting to the GPU entirely and my monitor gave me the 'signal lost' message (great!). Same behaviour when starting VMs though - I have to manually reset it before getting any output and even still, it's just blank after the Proxmox logo.

EDIT:
$ sudo journalctl -p err -f
Code:
May 09 12:15:18 proxmox pmxcfs[1599]: [status] crit: can't initialize service
May 09 12:15:40 proxmox smartd[1134]: Device: /dev/nvme0, number of Error Log entries increased from 359 to 363
May 09 12:15:40 proxmox kernel: i2c-designware-pci 0000:0b:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0x0 flags=0x0020]
May 09 12:15:44 proxmox kernel: [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR* Problem resizing BAR0 (-16).
May 09 12:15:44 proxmox kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v10_0> failed -19
May 09 12:15:44 proxmox kernel: amdgpu 0000:0b:00.0: amdgpu: amdgpu_device_ip_init failed
May 09 12:15:44 proxmox kernel: amdgpu 0000:0b:00.0: amdgpu: Fatal error during GPU init
May 09 12:15:44 proxmox kernel: ucsi_ccg 0-0008: failed to get FW build information
May 09 12:17:19 proxmox kernel: ucsi_ccg 0-0008: failed to get FW build information
May 09 12:18:05 proxmox kernel: ucsi_ccg 0-0008: failed to get FW build information
 
Last edited:
Maybe it's not relevant but when this used to work, I would see no video output on the GPU after "starting initial ramdisk" on host boot.
This is a sign of blacklist driver to the host.
Based on this description, you’ll be more likely to get blacklist driver on the host working instead of early bind. Blacklist is more robust than early bind because it depends less on hardware compatibility.
 
This is a sign of blacklist driver to the host.
Based on this description, you’ll be more likely to get blacklist driver on the host working instead of early bind. Blacklist is more robust than early bind because it depends less on hardware compatibility.
Hmm. And yet, it didn't seem to be working-- I was getting console/dmsg output on the GPU even with drivers blacklisted.
 
I was getting console/dmsg output on the GPU even with drivers blacklisted.
Were the letters a bit bigger (or the resolution lower)? That's probably the Linux console driver using the UEFI/BIOS display-mode set by the bootloader. It can show you scrolling text without a GPU specific driver.
 
Were the letters a bit bigger (or the resolution lower)? That's probably the Linux console driver using the UEFI/BIOS display-mode set by the bootloader. It can show you scrolling text without a GPU specific driver.
Ah okay yes, that explains it. Low resolution.