[GPU_passthrough] Lenovo M75q Gen 2 - AMD 5650GE iGPU - Host crashes

cRaZy-bisCuiT

Member
Oct 8, 2022
57
9
13
I would like to pass through my iGPU of the 5650GE. Without the ACS override my IOMMU groups are messed up so I used the patch and it looks pretty good now. When I want to start a VM which has the GPU attached my PVE host crashes. Since there're so many information floating around the net and I tried a lot of tips i don't know whats the best option to make some progess. Maybe you could help me by telling me which information u need all in all. Thanks!

PVE kernel version
Code:
root@pve:~# uname -a
Linux pve 5.15.60-1-pve #1 SMP PVE 5.15.60-1 (Mon, 19 Sep 2022 17:53:17 +0200) x86_64 GNU/Linux

IOMMU groups
Code:
root@pve:~# ./script.sh
IOMMU Group 0:
    00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 1:
    00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 10:
    02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 1a)
IOMMU Group 11:
    03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev db)
IOMMU Group 12:
    03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:1637]
IOMMU Group 13:
    03:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
IOMMU Group 14:
    03:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir USB 3.1 [1022:1639]
IOMMU Group 15:
    03:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir USB 3.1 [1022:1639]
IOMMU Group 16:
    03:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor [1022:15e2] (rev 01)
IOMMU Group 2:
    00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1634]
IOMMU Group 3:
    00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1634]
IOMMU Group 4:
    00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
IOMMU Group 5:
    00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
IOMMU Group 6:
    00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
    00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 7:
    00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166a]
    00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166b]
    00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166c]
    00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166d]
    00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166e]
    00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:166f]
    00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1670]
    00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1671]
IOMMU Group 8:
    01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD [15b7:5006]
IOMMU Group 9:
    02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device [10ec:816e] (rev 1a)

GRUB_CMDLINE_LINUX_DEFAULT
I read that for AMD IOMMU will be on anyway so I don't need that option.
Code:
GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 quiet pcie_acs_override=downstream,multifunction iommu=pt video=vesafb:off video=efifb:off video=simplefb:off"

/etc/modprobe.d/pve-blacklist.conf
Code:
/etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu

/etc/modprobe.d/vfio.conf
I read I don't need those anymore so I commented them out. Is that right?
Code:
#options vfio-pci ids=1002:1638,1002:1637 disable_vga=1

Host VGA adapter
Seems like no kernel module driver is in use. Without comenting out the vfio config, that driver is in use.
Code:
root@pve:~# lspci -nnk | grep -i VGA -A2
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev db)
    Subsystem: Lenovo Device [17aa:32e4]
    Kernel modules: amdgpu

When I start a VM with the graphics attached, dmesg looks like that before it freezes. Sometimes it does not freeze but the system is in some kind of unreliable state and needs a reboot to function well.
Code:
[ 1079.051614] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1079.071857] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1079.095706] vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1079.316202] xhci_hcd 0000:03:00.3: remove, state 4
[ 1079.316210] usb usb2: USB disconnect, device number 1
[ 1079.316378] xhci_hcd 0000:03:00.3: USB bus 2 deregistered
[ 1079.316383] xhci_hcd 0000:03:00.3: remove, state 1
[ 1079.316386] usb usb1: USB disconnect, device number 1
[ 1079.316387] usb 1-1: USB disconnect, device number 2
[ 1079.330514] usb 1-2: USB disconnect, device number 3
[ 1079.634870] xhci_hcd 0000:03:00.3: USB bus 1 deregistered
[ 1079.748066] xhci_hcd 0000:03:00.4: remove, state 4
[ 1079.748077] usb usb4: USB disconnect, device number 1
[ 1079.748080] usb 4-1: USB disconnect, device number 2
[ 1079.763506] xhci_hcd 0000:03:00.4: USB bus 4 deregistered
[ 1079.763518] xhci_hcd 0000:03:00.4: remove, state 4
[ 1079.763522] usb usb3: USB disconnect, device number 1
[ 1079.763524] usb 3-1: USB disconnect, device number 2
[ 1079.779814] xhci_hcd 0000:03:00.4: USB bus 3 deregistered
[ 1079.807426] vfio-pci 0000:03:00.4: refused to change power state from D0 to D3hot
[ 1080.519560] device tap101i0 entered promiscuous mode
[ 1080.543210] vmbr0: port 2(fwpr101p0) entered blocking state
[ 1080.543214] vmbr0: port 2(fwpr101p0) entered disabled state
[ 1080.543295] device fwpr101p0 entered promiscuous mode
[ 1080.543350] vmbr0: port 2(fwpr101p0) entered blocking state
[ 1080.543352] vmbr0: port 2(fwpr101p0) entered forwarding state
[ 1080.548049] fwbr101i0: port 1(fwln101i0) entered blocking state
[ 1080.548052] fwbr101i0: port 1(fwln101i0) entered disabled state
[ 1080.548103] device fwln101i0 entered promiscuous mode
[ 1080.548136] fwbr101i0: port 1(fwln101i0) entered blocking state
[ 1080.548138] fwbr101i0: port 1(fwln101i0) entered forwarding state
[ 1080.552218] fwbr101i0: port 2(tap101i0) entered blocking state
[ 1080.552220] fwbr101i0: port 2(tap101i0) entered disabled state
[ 1080.552261] fwbr101i0: port 2(tap101i0) entered blocking state
[ 1080.552262] fwbr101i0: port 2(tap101i0) entered forwarding state
[ 1081.993212] vfio-pci 0000:03:00.0: enabling device (0002 -> 0003)
[ 1081.993425] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 1081.993430] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 1081.993432] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[ 1081.993433] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[ 1081.993434] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[ 1082.056825] vfio-pci 0000:03:00.3: enabling device (0000 -> 0002)
[ 1082.112561] vfio-pci 0000:03:00.4: enabling device (0000 -> 0002)
[ 1083.787148] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1083.803167] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1083.835150] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1083.867188] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1083.883145] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1083.899144] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.261625] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.261847] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.262068] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.262288] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.262507] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.262725] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.296692] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.296908] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.297125] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.297342] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.297768] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.297991] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.325181] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.325398] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.325603] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.325824] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.326041] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.326268] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.364414] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365669] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365686] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365707] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365927] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365942] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.365962] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366257] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366291] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366311] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366515] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366529] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366549] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366754] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366768] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366788] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.366993] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367008] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367210] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367226] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367240] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367255] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367269] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.367284] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.407099] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.407407] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.407705] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.407997] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.408292] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.408583] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.430819] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.431224] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.431243] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.431603] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.431624] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.433497] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.433519] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.433892] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.433910] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.434259] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.434276] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.434621] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.463470] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.463830] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.464182] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.464533] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.464877] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.465223] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.608562] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.609004] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.609435] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.609870] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.610321] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[ 1086.610758] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs

So what can I do to support debugging? Thanks!
 
Last edited:
GRUB_CMDLINE_LINUX_DEFAULT
I read that for AMD IOMMU will be on anyway so I don't need that option.
Code:
GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 quiet pcie_acs_override=downstream,multifunction iommu=pt video=vesafb:off video=efifb:off video=simplefb:off"
pcie_acs_override=downstream,multifunction invalidates all IOMMU group information you gave. You made Proxmox ignore the original groups but that doe not guarantee that it will work. The groups are determined by the motherboard and BIOS and unless you have a X570 motherboard, most are in the big chipset group.

Since the GPU is used during the the system BIOS POST/boot (because it is the only one) you probably need the work-around of initcall_blacklist=sysfb_init (instead of video=vesafb:off video=efifb:off video=simplefb:off). Check with cat /proc/cmdline if your kernel parameters are active.
Note that passthrough of the boot GPU and/or only GPU and/or integrated graphhics in a system is always more difficult and sometimes impossible
/etc/modprobe.d/vfio.conf
I read I don't need those anymore so I commented them out. Is that right?
Code:
#options vfio-pci ids=1002:1638,1002:1637 disable_vga=1
Probably but you also did early binding to vfio-pci for the audio device and you did not blacklist snd_hda_intel. I prefer the early binding instead of blacklisting (as I have multiple devices) but you need to make sure vfio-pci is loaded before the actual drivers: softdep amdgpu pre: vfio_pci and softdep snd_hda_intel pre: vfio_pci .
 
test reply
pcie_acs_override=downstream,multifunction invalidates all IOMMU group information you gave. You made Proxmox ignore the original groups but that doe not guarantee that it will work. The groups are determined by the motherboard and BIOS and unless you have a X570 motherboard, most are in the big chipset group.
Yes, this is the case! Everything is in the big chipset group without the patch. One question regarding that: Its always the cas without X570 chipsets? So if I'd buy a Desktop Mainboard with B550 I have the same trouble? Is Ryzen 7000 known to also have these problems?

Since the GPU is used during the the system BIOS POST/boot (because it is the only one) you probably need the work-around of initcall_blacklist=sysfb_init (instead of video=vesafb:off video=efifb:off video=simplefb:off). Check with cat /proc/cmdline if your kernel parameters are active.
Note that passthrough of the boot GPU and/or only GPU and/or integrated graphhics in a system is always more difficult and sometimes impossible
Code:
root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.19.7-1-pve root=/dev/mapper/pve-root ro loglevel=3 quiet pcie_acs_override=downstream,multifunction iommu=pt video=vesafb:off video=efifb:off video=simplefb:off

Probably but you also did early binding to vfio-pci for the audio device and you did not blacklist snd_hda_intel. I prefer the early binding instead of blacklisting (as I have multiple devices) but you need to make sure vfio-pci is loaded before the actual drivers: softdep amdgpu pre: vfio_pci and softdep snd_hda_intel pre: vfio_pci .
Where do I need to add the softdeps? Also in "/etc/modprobe.d/vfio.conf"?
 
Last edited:
test reply

Yes, this is the case! Everything is in the big chipset group without the patch. One question regarding that: Its always the cas without X570 chipsets?
No, the devices in slots with PCIe lanes connected to the CPU should be in separate IOMMU groups. I would expect the first PCIe x16 slot and the first M.2 slot and the integrated graphics to have separate groups.
Maybe update your motherboard BIOS, as some versions have poor groups and sometimes even break passthrough. What is the make and model of your motherboard and what is the BIOS version?
So if I'd buy a Desktop Mainboard with B550 I have the same trouble? Is Ryzen 7000 known to also have these problems?
The B550 should at least have the slots connected to the CPU in separate groups but everything else is probably in the big chipset group, like all AM4 motherboards except X570 and X570S. I don't know about the 7000 yet.
Code:
root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.19.7-1-pve root=/dev/mapper/pve-root ro loglevel=3 quiet pcie_acs_override=downstream,multifunction iommu=pt video=vesafb:off video=efifb:off video=simplefb:off
I still think you should replace video=vesafb:off video=efifb:off video=simplefb:off with initcall_blacklist=sysfb_init.
Where do I need to add the softdeps? Also in "/etc/modprobe.d/vfio.conf"?
Yes, in vfio.conf would be fine.
 
root@pve:~# ./gen_reports.sh
This scripts will generate reports about the current system status for the PVE forum.

uname -a
Code:
Linux pve 5.19.7-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-1 (Tue, 06 Sep 2022 07:54:58 + x86_64 GNU/Linux

cat /proc/cmdline
Code:
BOOT_IMAGE=/boot/vmlinuz-5.19.7-1-pve root=/dev/mapper/pve-root ro loglevel=3 quiet pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init

cat /etc/modprobe.d/vfio.conf
Code:
#options vfio-pci ids=1002:1638,1002:1637 disable_vga=1
softdep snd_hda_intel pre: vfio_pci
softdep amdgpu pre: vfio_pci

cat /etc/modprobe.d/pve-blacklist.conf
Code:
# This file contains a list of modules which are not supported by Proxmox VE 

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
#blacklist nvidiafb
blacklist amdgpu
blacklist snd_hda_intel
 
Thanks very much, your help is much appreciated btw!

Unfortunately ths softdeps seems not to work for me. What am I doing wrong?

root@pve:~# ./gen_reports.sh
This scripts will generate GPU passthrough debug reports about the current system status for the PVE forum.

uname -a
Code:
Linux pve 5.15.60-1-pve #1 SMP PVE 5.15.60-1 (Mon, 19 Sep 2022 17:53:17 +0200) x86_64 GNU/Linux

cat /proc/cmdline
Code:
BOOT_IMAGE=/boot/vmlinuz-5.15.60-1-pve root=/dev/mapper/pve-root ro loglevel=3 quiet pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init

lsmod | grep amdgpu|snd_hda_intel
Code:
amdgpu               9736192  1
iommu_v2               24576  1 amdgpu
snd_hda_intel          53248  0
gpu_sched              45056  1 amdgpu
snd_intel_dspcfg       28672  1 snd_hda_intel
drm_ttm_helper         16384  1 amdgpu
ttm                    86016  2 amdgpu,drm_ttm_helper
snd_hda_codec         159744  2 snd_hda_codec_hdmi,snd_hda_intel
drm_kms_helper        311296  1 amdgpu
snd_hda_core          106496  3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec
snd_pcm               143360  5 snd_hda_codec_hdmi,snd_pci_acp6x,snd_hda_intel,snd_hda_codec,snd_hda_core
i2c_algo_bit           16384  1 amdgpu
snd                   106496  6 snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_timer,snd_pcm
drm                   614400  6 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm

cat /etc/modprobe.d/vfio.conf
Code:
#options vfio-pci ids=1002:1638,1002:1637 disable_vga=1
softdep snd_hda_intel pre: vfio_pci
softdep amdgpu pre: vfio_pci

cat /etc/modprobe.d/pve-blacklist.conf
Code:
# This file contains a list of modules which are not supported by Proxmox VE 

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
#blacklist nvidiafb
#blacklist amdgpu
#blacklist snd_hda_intel


After I change modules / kernel configs I do the following:
Bash:
root@pve:~# cat update_initrd_gurb_efi.sh
#!/bin/bash
echo ""
echo "Update initrd"
echo ""
sleep 1
update-initramfs -u -k all

echo ""
echo "Update grub"
echo ""
sleep 1
update-grub

echo ""
echo "Update efi"
echo ""
pve-efiboot-tool refresh
sleep 1
 
Thanks very much, your help is much appreciated btw!

Unfortunately ths softdeps seems not to work for me. What am I doing wrong?

cat /etc/modprobe.d/vfio.conf
Code:
#options vfio-pci ids=1002:1638,1002:1637 disable_vga=1
softdep snd_hda_intel pre: vfio_pci
softdep amdgpu pre: vfio_pci
The first line is still commented out; remove the #. And don't forget to run update-initramfs -u and reboot.
After I change modules / kernel configs I do the following:
Bash:
root@pve:~# cat update_initrd_gurb_efi.sh
#!/bin/bash
echo ""
echo "Update initrd"
echo ""
sleep 1
update-initramfs -u -k all

echo ""
echo "Update grub"
echo ""
sleep 1
update-grub

echo ""
echo "Update efi"
echo ""
pve-efiboot-tool refresh
sleep 1
pve-efiboot-tool suggest that you are using an old version of Proxmox, as it was renamed to proxmox-boot-tool.
 
  • Like
Reactions: cRaZy-bisCuiT
Okay, I thought I dont need it anymore. I will do so.
Sorry for being unclear before. We want early binding to vfio-pci because we don't want anything else to touch the device before the VM. To make this work, we also need to make sure vfio-pci gets the device before any other driver. That's why you need all three lines: to get those devices bound to vfio-pci as early as possible. You might also want to add vfio-pci to /etc/modules to make sure it is always loaded at boot.

After booting the system, and before starting the VM, you should not see boot messages from Proxmox (only the GRUB menu) and the devices should have vfio-pci as 'kernel driver in use' (when you run lspci -nnk).
 
  • Like
Reactions: cRaZy-bisCuiT
Sorry for being unclear before. We want early binding to vfio-pci because we don't want anything else to touch the device before the VM. To make this work, we also need to make sure vfio-pci gets the device before any other driver. That's why you need all three lines: to get those devices bound to vfio-pci as early as possible. You might also want to add vfio-pci to /etc/modules to make sure it is always loaded at boot.
This is the case now.
Bash:
root@pve:~# lspci -nnk | grep -i VGA -A2
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev ff)
    Kernel driver in use: vfio-pci
    Kernel modules: amdgpu
After booting the system, and before starting the VM, you should not see boot messages from Proxmox (only the GRUB menu) and the devices should have vfio-pci as 'kernel driver in use' (when you run lspci -nnk).
This is also the case.

Unfortunately the host still crashes:

dmesg (when VM starts)
Bash:
[  217.767775] usb 3-1: USB disconnect, device number 2
[  217.800254] xhci_hcd 0000:03:00.4: USB bus 3 deregistered
[  217.828819] vfio-pci 0000:03:00.4: refused to change power state from D0 to D3hot
[  218.505921] device tap101i0 entered promiscuous mode
[  218.530718] vmbr0: port 2(fwpr101p0) entered blocking state
[  218.530721] vmbr0: port 2(fwpr101p0) entered disabled state
[  218.530768] device fwpr101p0 entered promiscuous mode
[  218.530795] vmbr0: port 2(fwpr101p0) entered blocking state
[  218.530796] vmbr0: port 2(fwpr101p0) entered forwarding state
[  218.535600] fwbr101i0: port 1(fwln101i0) entered blocking state
[  218.535603] fwbr101i0: port 1(fwln101i0) entered disabled state
[  218.535647] device fwln101i0 entered promiscuous mode
[  218.535676] fwbr101i0: port 1(fwln101i0) entered blocking state
[  218.535678] fwbr101i0: port 1(fwln101i0) entered forwarding state
[  218.540100] fwbr101i0: port 2(tap101i0) entered blocking state
[  218.540103] fwbr101i0: port 2(tap101i0) entered disabled state
[  218.540155] fwbr101i0: port 2(tap101i0) entered blocking state
[  218.540156] fwbr101i0: port 2(tap101i0) entered forwarding state
[  218.559163] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
[  219.998385] vfio-pci 0000:03:00.0: enabling device (0002 -> 0003)
[  219.998600] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  219.998605] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  219.998606] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  219.998607] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  219.998608] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  219.999963] vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
[  220.058694] vfio-pci 0000:03:00.3: enabling device (0000 -> 0002)
[  220.114423] vfio-pci 0000:03:00.4: enabling device (0000 -> 0002)
[  221.744553] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[  221.768558] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[  221.804580] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[  221.836568] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[  221.852551] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
...

All the ouput after the vm started (this time no freeze):
root@pve:~# ./gen_reports.sh
This scripts will generate GPU passthrough debug reports about the current system status for the PVE forum.

uname -a
Code:
Linux pve 5.15.60-1-pve #1 SMP PVE 5.15.60-1 (Mon, 19 Sep 2022 17:53:17 +0200) x86_64 GNU/Linux

cat /proc/cmdline
Code:
BOOT_IMAGE=/boot/vmlinuz-5.15.60-1-pve root=/dev/mapper/pve-root ro loglevel=3 quiet pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init

lsmod | grep amdgpu|snd_hda_intel
Code:
amdgpu               9736192  0
iommu_v2               24576  1 amdgpu
gpu_sched              45056  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    86016  2 amdgpu,drm_ttm_helper
snd_hda_intel          53248  0
snd_intel_dspcfg       28672  1 snd_hda_intel
drm_kms_helper        311296  1 amdgpu
snd_hda_codec         159744  1 snd_hda_intel
snd_hda_core          106496  2 snd_hda_intel,snd_hda_codec
i2c_algo_bit           16384  1 amdgpu
snd_pcm               143360  4 snd_pci_acp6x,snd_hda_intel,snd_hda_codec,snd_hda_core
snd                   106496  5 snd_hwdep,snd_hda_intel,snd_hda_codec,snd_timer,snd_pcm
drm                   614400  5 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm

cat /etc/modprobe.d/vfio.conf
Code:
options vfio-pci ids=1002:1638,1002:1637 disable_vga=1
softdep snd_hda_intel pre: vfio_pci
softdep amdgpu pre: vfio_pci

cat /etc/modprobe.d/pve-blacklist.conf
Code:
# This file contains a list of modules which are not supported by Proxmox VE 

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
#blacklist nvidiafb
#blacklist amdgpu
#blacklist snd_hda_intel
 
Unfortunately the host still crashes:

dmesg (when VM starts)
Bash:
[  217.767775] usb 3-1: USB disconnect, device number 2
[  217.800254] xhci_hcd 0000:03:00.4: USB bus 3 deregistered
[  217.828819] vfio-pci 0000:03:00.4: refused to change power state from D0 to D3hot
[  218.505921] device tap101i0 entered promiscuous mode
[  218.530718] vmbr0: port 2(fwpr101p0) entered blocking state
[  218.530721] vmbr0: port 2(fwpr101p0) entered disabled state
[  218.530768] device fwpr101p0 entered promiscuous mode
[  218.530795] vmbr0: port 2(fwpr101p0) entered blocking state
[  218.530796] vmbr0: port 2(fwpr101p0) entered forwarding state
[  218.535600] fwbr101i0: port 1(fwln101i0) entered blocking state
[  218.535603] fwbr101i0: port 1(fwln101i0) entered disabled state
[  218.535647] device fwln101i0 entered promiscuous mode
[  218.535676] fwbr101i0: port 1(fwln101i0) entered blocking state
[  218.535678] fwbr101i0: port 1(fwln101i0) entered forwarding state
[  218.540100] fwbr101i0: port 2(tap101i0) entered blocking state
[  218.540103] fwbr101i0: port 2(tap101i0) entered disabled state
[  218.540155] fwbr101i0: port 2(tap101i0) entered blocking state
[  218.540156] fwbr101i0: port 2(tap101i0) entered forwarding state
[  218.559163] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
[  219.998385] vfio-pci 0000:03:00.0: enabling device (0002 -> 0003)
[  219.998600] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  219.998605] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  219.998606] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  219.998607] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  219.998608] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  219.999963] vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
[  220.058694] vfio-pci 0000:03:00.3: enabling device (0000 -> 0002)
[  220.114423] vfio-pci 0000:03:00.4: enabling device (0000 -> 0002)
[  221.744553] vfio-pci 0000:03:00.5: vfio_bar_restore: reset recovery - restoring BARs
[  221.768558] vfio-pci 0000:03:00.4: vfio_bar_restore: reset recovery - restoring BARs
[  221.804580] vfio-pci 0000:03:00.3: vfio_bar_restore: reset recovery - restoring BARs
[  221.836568] vfio-pci 0000:03:00.2: vfio_bar_restore: reset recovery - restoring BARs
[  221.852551] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
...
Try passthrough of only 03:00.0 (without All Functions) and not the other functions (03:00.1, 03:003, 03:00.4, 03:00.5).
 
  • Like
Reactions: cRaZy-bisCuiT
Wow! You're a hero! No more crash so far! I can only see "Microsoft Basic Display Adapter" in the VM now but I will try to install AMD drivers. Let's see what happens!
 
  • Like
Reactions: leesteken
Wow! You're a hero! No more crash so far! I can only see "Microsoft Basic Display Adapter" in the VM now but I will try to install AMD drivers. Let's see what happens!
Maybe adding 03:00.1 will also work (as you already early bind it to vfio-pci) but I'm not certain. I assume you'll want to passthrough a USB controller, but I don't know which one would be a good idea (as everything is in the same group, I can't tell which one comes from the CPU and which come from the motherboard chipset).
 
Maybe adding 03:00.1 will also work (as you already early bind it to vfio-pci) but I'm not certain. I assume you'll want to passthrough a USB controller, but I don't know which one would be a good idea (as everything is in the same group, I can't tell which one comes from the CPU and which come from the motherboard chipset).
Well, for USB I have to see. Later on I might try that. The reason for that: This Tiny PC is supposed to be my LAB Server for Kubernetes / homelab but since it hase some nice GPU I'll give it a try for Steam game streaming as well as maybe rendering / transcoding / OBS. :)
 
Now I have "Code 43" within the Windows guest. I guess that's t reset bug, right? Can I work around that?
43 is a generic error. Maybe the Windows drivers don't handle this configuration (an integrated GPU without the rest)? I'm sorry but I don't know much about Windows.
Maybe create a Ubuntu VM and install Steam? Or just boot the Ubuntu Live installer and see if it will display output (then it will also work when you install Ubuntu).
 
I will instell Steam later on anyway on Ubuntu I guess. I heard "Code 43" could be related to the reset bugs and I find infos like that:
- https://www.reddit.com/r/VFIO/comments/cnm7pe/amd_gpu_passtrough_code_43/

Code:
Use video=efifb:off in GRUB_CMDLINE_LINUX_DEFAULT instead of video=vesafb:off,efifb:off

This won' work for me, right? Because I need:
Code:
initcall_blacklist=sysfb_init
?

I will try to pass the Audio device as well and see what happens.
 
I will instell Steam later on anyway on Ubuntu I guess. I heard "Code 43" could be related to the reset bugs and I find infos like that:
- https://www.reddit.com/r/VFIO/comments/cnm7pe/amd_gpu_passtrough_code_43/
It's not a reset bug, otherwise it would not before before installing AMD drivers.
You could try virtually removing the device and re-adding it using these commands (before starting the VM):
Bash:
echo 1 > "/sys/bus/pci/devices/0000:03:00.0/remove"
echo 1 > /sys/bus/pci/rescan
Code:
Use video=efifb:off in GRUB_CMDLINE_LINUX_DEFAULT instead of[ICODE] video=vesafb:off,efifb:off[/ICODE]
video=vesafb:off,efifb:off no longer works. You need to write it as video=vesafb:off video=efifb:off. And you actually only need one of the two depending on how your system boots.
This won' work for me, right? Because I need:
Code:
initcall_blacklist=sysfb_init
?
The newer kernel that you are using does not use vesafb and not efifb. It uses simplefb instead. However, video=simplefb:off does not work because it still claims some video memory. That's why you need initcall_blacklist=sysfb_init
I will try to pass the Audio device as well and see what happens.
Also enable PCI Express if you haven't. Maybe the AMD Windows drivers assume something about PCIe.
 
  • Like
Reactions: cRaZy-bisCuiT
Well, I don't get Windows 10 driver running right now. I guess that's a different part of the problem so I might just open a new thread.

Since I'd love to try that with Linux as well, how do I do that? I guess the console is not working after the passthrough so I'll need to install something like VNC first?