Issue setting up GPU Passthrough (Dual-GPU)

cayubweeums · Oct 20, 2023

Hey all!

I have struggled with GPU passthrough many a times as many before me. I have had a cluster of two PVE nodes prior that failed to have stable GPU passthrough so I had given up until now. I am testing in a single PVE instance with two GPUs. Both AMD and Nvidia are present in the server and I would like to pass them to two seperate VMs. Currently I have the AMD GPU successfully passed to a Windows10 VM and drivers have been installed and runs fine. For some reason occationally after ‘x’ amount of time, the VM is set to the ‘suspended’ status. I have only observed this once and it has not repeated in the last 8 hours. Resuming simply spikes the CPU to 100% then stops the VM. This has only occured once on the new instance of Proxmox, but was the reason why I gave up trying to get this working a few weeks ago

The Nvidia GPU has not been successfully passed to the Windows10 VM I have dedicated it to. It errors out when trying to boot the VM with the Error below.

All relevant files and errors are, hopefully, listed below and if any additional info is needed please ask! I feel I am extremely close to getting this to function fully, I am simply ignorant of the last couple steps to get it accross the finish line.

System Info:

MoBo: Asus Prime z390-a (Latest Bios 2004)
CPU: Intel i9-9900k
RAM: 4x16gb Corsair
GPU(s):
- AMD Reference 6950 xt (Slot 1)
- Nvidia GTX 1650 Super (Slot 2)
OS: Proxmox PVE 8.0.3

Bios options:

VT-d enabled
SR-IOV enabled
4G deconding enabled
Primary GPU - iGPU
CSM disabled
Resizeable BAR disabled

cat /etc/default/grub

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init video=vesafb

ff video=efifb

ff video=simplefb

ff"
GRUB_CMDLINE_LINUX=""

lspci -nnn -s 03:00

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a6]
03:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB [1002:73a4]

lspci -nnn -s 04:00

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
04:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb] (rev a1)
04:00.2 USB controller [0c03]: NVIDIA Corporation TU116 USB 3.1 Host Controller [10de:1aec] (rev a1)
04:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller [10de:1aed] (rev a1)

cat /etc/modules

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

ls /etc/modprobe.d

ls -a /etc/modprobe.d
. .. iommu_unsafe_interrupts.conf kvm.conf pve-blacklist.conf vfio.conf

cat /etc/modprobe.d/iommu_unsafe_interrupts.conf

options vfio_iommu_type1 allow_unsafe_interrupts=1

cat /etc/modprobe.d/pve-blacklist.conf

blacklist nvidiafb
blacklist nvidia
blacklist nouveau

cat /etc/modprobe.d/vfio.conf

options vfio-pci ids=1002:73a5,10de:2187,1002:ab28,10de:1aeb disable_vga=1

cat /etc/modprobe.d/kvm.conf

options kvm ignore_msrs=1

Nvidia VM config

- cat /etc/pve/qemu-server/101.conf
agent: 1
balloon: 0
bios: ovmf
boot: order=sata0;ide2;net0
cores: 8
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:04:00,pcie=1,x-vga=1
machine: pc-q35-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=1697717771
name: win10-temp-nvidia
net0: e1000=xx:xx:xx:xx:xx:xx,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
sata0: local-lvm:vm-101-disk-1,size=180G
scsihw: virtio-scsi-single
smbios1: uuid=d6d24523-7c52-4155-90fb-97541c80207f
sockets: 1
tpmstate0: local-lvm:vm-101-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 0403bef3-5875-4c33-9dac-5d840e7d6c28

Error observed when starting Windows 10 VM with Nvidia GPU

swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:04:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:04:00.0: failed to open /dev/vfio/2: Device or resource busy
stopping swtpm instance (pid 46240) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

Checking kernel driver loaded for GPUs

- lspci -nnk -d [gpu id]

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:0e3a]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd TU116 [GeForce GTX 1650 SUPER] [1458:401b]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

As best I can tell from my errors, the host (PVE) is some how accessing the Nvidia GPU and therefore not allowing the VM to us it. I am unsure how this is the case as I am pretty sure I added all the items I could to isolate the device from the host. I mean obviously something I did was wrong or else it would've worked. But I am stumped as to next steps. I would love someone to point out where I am making my missteps!! Thanks in advance for the help!!!

cayubweeums · Oct 20, 2023

UPDATE:
I have continued to push buttons and tinker to try and resolve the issue on my own, and doing so I have discovered another piece of info I think would be helpful to someone smarter than me.

I stated that only the Win10 VM with the AMD GPU was functional, but this appears to be incorrect. When turning off the VM with the AMD GPU and booting the one with the Nvidia GPU it boots fine with no error. But with that VM running, trying to boot the VM with the AMD card in it results in the same `Device or Resource is busy` error. This leads me to assume it is some error with IOMMU grouping? Or the GPUs are accessing similar memory space and as such the system thinks they are in use when it is the other one? I am not 100% certain of this, but would love some insight. I will continue to post updates if I come accross anthing else or get it functioning.

jensie · Dec 2, 2023

Hello, @cayubweeums, I'm having the same issues. When I look at my grouping with

Code:

pvesh get /nodes/pve1/hardware/pci --pci-class-blacklist ""

, all is grouped under -1
Where you able to get ride of the yellow banner stating "IOMMU detected, please activate it.See Documentation for further information." ?

Trzinka · Feb 23, 2024

I have the same problem!

Search

Search

Issue setting up GPU Passthrough (Dual-GPU)

cayubweeums

New Member

cayubweeums

New Member

jensie

Active Member

Trzinka

New Member

We value your privacy