[SOLVED] GPU Passthru - Attaching PCI Device with 'All Functions' enabled crashes Proxmox 8

darkpod

New Member
Apr 11, 2023
7
0
1
Hello Proxmox Forum -

I have been battling this issue for quite some time and it is finally at the point that I need to reach out for help. I wrote up details of my current setup / settings and would appreciate anyone taking the time to help out! Thank you in advance!

Objective: Passthru the attached GPU to a Windows 11 Virtual Machine.

References:
https://pve.proxmox.com/wiki/PCI(e)_Passthrough
https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
https://3os.org/infrastructure/proxmox/gpu-passthrough/gpu-passthrough-to-vm/
https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/
https://www.reddit.com/r/Proxmox/comments/1118opd/psa_gpu_passthrough_on_single_gpu_systems/

Hardware:
AMD Ryzen 9 7900x
Asus ROG B650E Motherboard
GPU - AMD Sapphire RX 6950 XT
RAM - Corsair 4x 32gb DDR5

Versions:
BIOS is updated to version 1654
Proxmox is upgraded to latest version 8
Windows 11 works no issues - just need to get GPU to passthru

BIOS Settings: IOMMU is Enabled , ROM-Bar is Disabled.

VM Settings:
Screenshot 2023-09-04 at 8.37.34 PM.png

Currently I am seeing my entire machine crash when I start the machine with 'All Functions' Enabled:
Screenshot 2023-09-04 at 8.38.05 PM.png

Looking at /etc/default/grub
Code:
GRUB_DEFAULT=0
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on pcie_acs_override=downstream,multifunction video=efifb:off video=vesa:off vfio-pci.ids=1002:164e,1002:1640,1022:1649,1022:15b6,1022:15b7 vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 mos=1 modprobe.blacklist=radeon,nouveau,nvidia,nvidiafb,nvidia-gpu"
# Also tried the line below and left it commented as I am using the line above
# GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on initcall_blacklist=sysfb_init amd_iommu=on initcall_blacklist=sysfb_init amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off"
GRUB_CMDLINE_LINUX=""

Kernel Modules are all in etc/modules
Code:
 vfio
 vfio_iommu_type1
 vfio_pci
 vfio_virqfd

dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
Code:
[    0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[    0.094343] AMD-Vi: Unknown option - 'on'
[    0.223018] AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
[    0.459382] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.461462] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.461463] AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC
[    0.461467] AMD-Vi: Interrupt remapping enabled
[    0.544277] AMD-Vi: Virtual APIC enabled
[    0.544675] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank)

find /sys/kernel/iommu_groups/ -type l
Code:
/sys/kernel/iommu_groups/17/devices/0000:02:0c.0
/sys/kernel/iommu_groups/7/devices/0000:00:08.1
/sys/kernel/iommu_groups/25/devices/0000:0b:00.1
/sys/kernel/iommu_groups/15/devices/0000:02:0a.0
/sys/kernel/iommu_groups/5/devices/0000:00:04.0
/sys/kernel/iommu_groups/23/devices/0000:0a:00.0
/sys/kernel/iommu_groups/13/devices/0000:02:08.0
/sys/kernel/iommu_groups/3/devices/0000:00:02.2
/sys/kernel/iommu_groups/21/devices/0000:08:00.0
/sys/kernel/iommu_groups/11/devices/0000:01:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/28/devices/0000:0b:00.4
/sys/kernel/iommu_groups/18/devices/0000:02:0d.0
/sys/kernel/iommu_groups/8/devices/0000:00:08.3
/sys/kernel/iommu_groups/26/devices/0000:0b:00.2
/sys/kernel/iommu_groups/16/devices/0000:02:0b.0
/sys/kernel/iommu_groups/6/devices/0000:00:08.0
/sys/kernel/iommu_groups/24/devices/0000:0b:00.0
/sys/kernel/iommu_groups/14/devices/0000:02:09.0
/sys/kernel/iommu_groups/4/devices/0000:00:03.0
/sys/kernel/iommu_groups/22/devices/0000:09:00.0
/sys/kernel/iommu_groups/12/devices/0000:02:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.1
/sys/kernel/iommu_groups/20/devices/0000:07:00.0
/sys/kernel/iommu_groups/10/devices/0000:00:18.3
/sys/kernel/iommu_groups/10/devices/0000:00:18.1
/sys/kernel/iommu_groups/10/devices/0000:00:18.6
/sys/kernel/iommu_groups/10/devices/0000:00:18.4
/sys/kernel/iommu_groups/10/devices/0000:00:18.2
/sys/kernel/iommu_groups/10/devices/0000:00:18.0
/sys/kernel/iommu_groups/10/devices/0000:00:18.7
/sys/kernel/iommu_groups/10/devices/0000:00:18.5
/sys/kernel/iommu_groups/29/devices/0000:0c:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/19/devices/0000:06:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:14.3
/sys/kernel/iommu_groups/9/devices/0000:00:14.0
/sys/kernel/iommu_groups/27/devices/0000:0b:00.3

lspci -nn
Code:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14d8]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Device [1022:14d9]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14db]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14da]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14dd]
00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:14dd]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 71)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e0]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e1]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e2]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e3]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e4]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e5]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e6]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:14e7]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f4] (rev 01)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:0b.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
02:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f5] (rev 01)
06:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03)
07:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
08:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f7] (rev 01)
09:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43f6] (rev 01)
0a:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Raphael [1002:164e] (rev c2)
0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller [1002:1640]
0b:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] VanGogh PSP/CCP [1022:1649]
0b:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15b6]
0b:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15b7]
0c:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15b8]

details on 0b:00.0:
Code:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c2) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. Raphael
        Flags: bus master, fast devsel, latency 0, IRQ 62, IOMMU group 24
        Memory at fce0000000 (64-bit, prefetchable) [size=256M]
        Memory at fcf0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at f000 [size=256]
        Memory at fc900000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [450] Lane Margining at the Receiver <?>
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

nano /etc/modprobe.d/pve-blacklist.conf
Code:
blacklist nvidiafb
blacklist nvidia
blacklist radeon
blacklist nouveau
 
Okay a few more details as we diagnose...

The system crashes with 'All Functions' checked for the PCI Device 0b:00. This occurs in both a Windows VM and an Ubuntu VM.

The GPU seems to have 5 functional components:

Code:
lspci -nv -s 0b:00
0b:00.0 0300: 1002:164e (rev c2) (prog-if 00 [VGA controller])
        Subsystem: 1043:8877
        Flags: bus master, fast devsel, latency 0, IRQ 255, IOMMU group 24
        Memory at fce0000000 (64-bit, prefetchable) [size=256M]
        Memory at fcf0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at f000 [disabled] [size=256]
        Memory at fc900000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [450] Lane Margining at the Receiver <?>
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

0b:00.1 0403: 1002:1640
        Subsystem: 1043:8877
        Flags: fast devsel, IRQ 255, IOMMU group 25
        Memory at fc980000 (32-bit, non-prefetchable) [disabled] [size=16K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

0b:00.2 1080: 1022:1649
        Subsystem: 1043:8877
        Flags: fast devsel, IRQ 255, IOMMU group 26
        Memory at fc800000 (32-bit, non-prefetchable) [disabled] [size=1M]
        Memory at fc984000 (32-bit, non-prefetchable) [disabled] [size=8K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/2 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=2 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: ccp

0b:00.3 0c03: 1022:15b6 (prog-if 30 [XHCI])
        Subsystem: 1043:8877
        Flags: fast devsel, IRQ 26, IOMMU group 27
        Memory at fc700000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=8 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci

0b:00.4 0c03: 1022:15b7 (prog-if 30 [XHCI])
        Subsystem: 1043:8877
        Flags: fast devsel, IRQ 26, IOMMU group 28
        Memory at fc600000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=8 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci

I went ahead and blacklisted snd_hda_intel, ccp, & xhci_pci

System still crashes with 'All Functions'

0b:00.0, 0b:00.1, 0b:00.2, 0b:00.3 will allow VM to start.

Attaching 0b:00.4 causes crash (without checking all functions)
 
Must be the Micro Center special, running an extremely similar build. I'm having similar problems myself. Try not doing host on CPU, I can get my VM to boot using anything but host for CPU and doing all functions passthrough. Not that its super helpful, would really like host passthrough but as far as I can tell it seems to be a combo with host cpu and all functions passthrough. Also I need to set vbios but that might just be a 7900xt thing.
 
Last edited:
  • Like
Reactions: darkpod
Must be the Micro Center special, running an extremely similar build. I'm having similar problems myself. Try not doing host on CPU, I can get my VM to boot using anything but host for CPU and doing all functions passthrough. Not that its super helpful, would really like host passthrough but as far as I can tell it seems to be a combo with host cpu and all functions passthrough. Also I need to set vbios but that might just be a 7900xt thing.
It is the Micro Center special! Hopefully we can get to the bottom of this one together.

I switched the CPU configuration from host to qemu64, kvm64, and x86-64-v4. All 3 continue to crash the box when booted with 'All Functions'. Any idea which CPU configuration could work?

The windows VM boots, it just does not see the GPU. Its specifically attaching 0b:00.4 that causes crash (and it crashes without checking 'all functions').

Code:
0b:00.4 0c03: 1022:15b7 (prog-if 30 [XHCI])
        Subsystem: 1043:8877
        Flags: fast devsel, IRQ 26, IOMMU group 28
        Memory at fc600000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable- Count=8 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci
 
I switched the CPU configuration from host to qemu64, kvm64, and x86-64-v4. All 3 continue to crash the box when booted with 'All Functions'. Any idea which CPU configuration could work?
1693965770334.png
This currently works for me, SCSI Controller doesn't seem to matter, hidden doesn't seem to make a difference for me. I'm doing a ton of passthrough, essentially trying to do 2 independent computers in 1.

It's weird that you have 4 functions, do you know what they are? I have 2, video and audio. The only thing I can think of is passing in the vbios/romfile. https://pve.proxmox.com/wiki/PCI_Passthrough#The_.27romfile.27_option
 
View attachment 55082
This currently works for me, SCSI Controller doesn't seem to matter, hidden doesn't seem to make a difference for me. I'm doing a ton of passthrough, essentially trying to do 2 independent computers in 1.

It's weird that you have 4 functions, do you know what they are? I have 2, video and audio. The only thing I can think of is passing in the vbios/romfile. https://pve.proxmox.com/wiki/PCI_Passthrough#The_.27romfile.27_option
Thanks I can try and run thru that to get the ROM file. Is that how you attached the 7900xt?

Since you have the same motherboard, can you confirm the BIOS settings for GPU Passthru?
 
Thanks I can try and run thru that to get the ROM file. Is that how you attached the 7900xt?

Since you have the same motherboard, can you confirm the BIOS settings for GPU Passthru?
Yeah, definitely doesn't work without it.

IOMMU on, rebar off, same for me.
 
Thank you. That worked. I can now see my GPU in the PCI Device List.
Nice, let me know if you can actually boot. That's where I'm stuck, Host cpu with gpu passthrough, never get anything on my monitor. I think the whole system doesn't boot, can't even access it via RDP.
 
Nice, let me know if you can actually boot. That's where I'm stuck, Host cpu with gpu passthrough, never get anything on my monitor. I think the whole system doesn't boot, can't even access it via RDP.
Yessir, I could boot even before GPU passthru. If you need help there I can point you in the right direction-
https://www.wundertech.net/how-to-install-windows-11-on-proxmox/ seems to have all the steps there.
The real trick with the Windows VMs is getting the boot order set properly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!