[HELP] GPU Pass-through Broken After 7.2.4 Update

May 21, 2022
6
1
3
Hello everyone,

I have been using GPU pass-through perfectly well for months ever since I first setup this system. I use it for my 3D modeling tasks since my personal laptop can't really handle it. Also gaming. ;)

I just did my first update since I installed a few months ago and my GPU pass-through is no longer working. The card is still getting passed through it seems but now Windows is showing the dreaded code 43 error and I have no idea what to do or where to even begin figuring out what changed.

Here is what I checked so far:
  • I checked that all of my config changes, since I noted everything I did, and all the changes still seem to be there.
  • I also did an update-grub and update-initramfs -u again just in case.
  • I removed and re-added the PCI device.
  • Since there had been some Windows updates I spun a new one from my working template and verified it wasn't a Windows change that caused it.
  • Checked different Display settings to see if that made a difference now compared to what I had before.

I am honestly at a loss and I have no idea where to go from here. I know there were a bunch of updates which included the kernel and I'm not sure if maybe this kernel has something different with it that makes it work differently / requires something additional. But if anyone could give me a hand / set me in the right direction I'd really appreciate it.

Thanks,

Nate
 
First thing you can try is pinning the old kernel to see if that resolves the issue: proxmox-boot-tool kernel pin 5.13.19-6-pve

Can you dump a few things:
  • /etc/default/grub
  • /etc/modules
  • /etc/modprobe.d/blacklist.conf
  • /etc/modprobe.d/vfio.conf
  • the configuration file for your VM
  • the output of the script below
Code:
#!/bin/bash
for d in $(find /sys/kernel/iommu_groups/ -type l | sort -n -k5 -t/); do
    n=${d#*/iommu_groups/*}; n=${n%%/*}
    printf 'IOMMU Group %s ' "$n"
    lspci -nns "${d##*/}"
done;
 
Firstly thank you very much for taking the time to try and help me with this!

/etc/default/grub:
Bash:
root@proxmox:~# cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
# GRUB_CMDLINE_LINUX_DEFAULT="quiet"
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on video=vesafb:off video=efifb:off"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

/etc/modules:
Bash:
root@proxmox:~# cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

/etc/modprobe.d/blacklist.conf:
Bash:
root@proxmox:~# cat /etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia

/etc/modprobe.d/vfio.conf:
Bash:
root@proxmox:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1f03,10de:10f9 disable_vga=1

VM CONF Example:
Bash:
root@proxmox:/etc/pve/qemu-server# cat 100.conf
agent: 1
audio0: device=ich9-intel-hda,driver=none
bios: ovmf
boot: order=scsi0;net0
cores: 4
cpu: host
efidisk0: local-lvm-fast:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:06:00,pcie=1,rombar=0,x-vga=1
machine: pc-q35-6.1
memory: 8096
meta: creation-qemu=6.1.0,ctime=1646892218
name: windows-gaming
net0: virtio=4A:CD:C4:B6:AC:A6,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: local-lvm-fast:vm-100-disk-0,cache=writeback,size=120G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=28af771c-ed4d-46d8-948e-c2c49099508c
sockets: 1
tpmstate0: local-lvm-fast:vm-100-disk-2,size=4M,version=v2.0
usb0: host=1-10.4,usb3=1
usb1: host=04b4:4009,usb3=1
vmgenid: c54b94fd-c299-4ec7-b000-f0acabfd179e

What I will say about the VM config is I thought I remembered having to add a long line into it, but I do not see it anymore. And of course I can't find it in my notes but maybe I'm remembering incorrectly back to a time when NVIDIA didn't support it. Maybe the long line was for my OSX VM.

Script Output:

Bash:
root@proxmox:~# sh awesome_script.sh
IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 2 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 3 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 4 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 5 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU Group 6 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 7 00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 8 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 9 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 10 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 11 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 12 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
IOMMU Group 12 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 13 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
IOMMU Group 13 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
IOMMU Group 13 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
IOMMU Group 13 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
IOMMU Group 13 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
IOMMU Group 13 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
IOMMU Group 13 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
IOMMU Group 13 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
IOMMU Group 14 01:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation Device [1987:5018] (rev 01)
IOMMU Group 15 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ee]
IOMMU Group 15 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43eb]
IOMMU Group 15 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43e9]
IOMMU Group 15 03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 15 03:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
IOMMU Group 15 04:00.0 Non-Volatile memory controller [0108]: SK hynix Device [1c5c:174a]
IOMMU Group 15 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
IOMMU Group 16 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1f03] (rev a1)
IOMMU Group 16 06:00.1 Audio device [0403]: NVIDIA Corporation TU106 High Definition Audio Controller [10de:10f9] (rev a1)
IOMMU Group 17 07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 18 08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 19 08:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU Group 20 08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU Group 21 08:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
 
Last edited:
OK, everything looks good to me on the configuration side. I'm pretty sure this is just the new kernel's simple framebuffer staying resident in memory. It's been tripping up a lot of folks.

(Sorry about that proxmox-boot-tool kernel pin 5.13.19-6-pve above, it only works with UEFI.)

We can try this one of two ways:
  • There should be an entry in your grub menu at boot to choose the old 5.13.19 kernel. Try that.
  • Disable autostart on your VMs/Containers and capture the output from cat /proc/iomem
 
@nick.kopas

Booting into the older kernel did in fact let everything work again. Thank you for that!

Can this new simple framebuffer that is staying in memory be changed / configured from kernel params to change it's behavior or are we sorta stuck at this point?
 
  • Like
Reactions: rexeus
@nick.kopas

Booting into the older kernel did in fact let everything work again. Thank you for that!

Can this new simple framebuffer that is staying in memory be changed / configured from kernel params to change it's behavior or are we sorta stuck at this point?
This is the workaround the community came up with:
https://forum.proxmox.com/threads/gpu-passthrough-issues-after-upgrade-to-7-2.109051/post-469855

NOTE: You have to modify the script to match the address of your NVIDIA card.
 
Last edited:
nope, just the main one.
but if your problem is solved by pinning the old kernel... i would recommend just doing that.
(i've ran into some other issues with networking and windows guests that are requiring me to pin the old kernel.)
 
Last edited:
@nick.kopas

Ok I have pinned it in grub for now but really we need a more long term solution. Especially if the new kernel is bring on other problems.
I also set up the script snippet just incase I need it in the future. I don't really see it causing any issues itself it's pretty basic.