Hey all!
I have struggled with GPU passthrough many a times as many before me. I have had a cluster of two PVE nodes prior that failed to have stable GPU passthrough so I had given up until now. I am testing in a single PVE instance with two GPUs. Both AMD and Nvidia are present in the server and I would like to pass them to two seperate VMs. Currently I have the AMD GPU successfully passed to a Windows10 VM and drivers have been installed and runs fine. For some reason occationally after ‘x’ amount of time, the VM is set to the ‘suspended’ status. I have only observed this once and it has not repeated in the last 8 hours. Resuming simply spikes the CPU to 100% then stops the VM. This has only occured once on the new instance of Proxmox, but was the reason why I gave up trying to get this working a few weeks ago
The Nvidia GPU has not been successfully passed to the Windows10 VM I have dedicated it to. It errors out when trying to boot the VM with the Error below.
All relevant files and errors are, hopefully, listed below and if any additional info is needed please ask! I feel I am extremely close to getting this to function fully, I am simply ignorant of the last couple steps to get it accross the finish line.
System Info:
MoBo: Asus Prime z390-a (Latest Bios 2004)
CPU: Intel i9-9900k
RAM: 4x16gb Corsair
GPU(s):
- AMD Reference 6950 xt (Slot 1)
- Nvidia GTX 1650 Super (Slot 2)
OS: Proxmox PVE 8.0.3
Bios options:
cat /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init video=vesafbff video=efifbff video=simplefbff"
GRUB_CMDLINE_LINUX=""
lspci -nnn -s 03:00
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a6]
03:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB [1002:73a4]
lspci -nnn -s 04:00
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
04:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb] (rev a1)
04:00.2 USB controller [0c03]: NVIDIA Corporation TU116 USB 3.1 Host Controller [10de:1aec] (rev a1)
04:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller [10de:1aed] (rev a1)
cat /etc/modules
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
ls /etc/modprobe.d
ls -a /etc/modprobe.d
. .. iommu_unsafe_interrupts.conf kvm.conf pve-blacklist.conf vfio.conf
cat /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1
cat /etc/modprobe.d/pve-blacklist.conf
blacklist nvidiafb
blacklist nvidia
blacklist nouveau
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:73a5,10de:2187,1002:ab28,10de:1aeb disable_vga=1
cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1
Nvidia VM config
- cat /etc/pve/qemu-server/101.conf
agent: 1
balloon: 0
bios: ovmf
boot: order=sata0;ide2;net0
cores: 8
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:04:00,pcie=1,x-vga=1
machine: pc-q35-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=1697717771
name: win10-temp-nvidia
net0: e1000=xx:xx:xx:xx:xx:xx,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
sata0: local-lvm:vm-101-disk-1,size=180G
scsihw: virtio-scsi-single
smbios1: uuid=d6d24523-7c52-4155-90fb-97541c80207f
sockets: 1
tpmstate0: local-lvm:vm-101-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 0403bef3-5875-4c33-9dac-5d840e7d6c28
Error observed when starting Windows 10 VM with Nvidia GPU
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:04:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:04:00.0: failed to open /dev/vfio/2: Device or resource busy
stopping swtpm instance (pid 46240) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
Checking kernel driver loaded for GPUs
- lspci -nnk -d [gpu id]
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:0e3a]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd TU116 [GeForce GTX 1650 SUPER] [1458:401b]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
As best I can tell from my errors, the host (PVE) is some how accessing the Nvidia GPU and therefore not allowing the VM to us it. I am unsure how this is the case as I am pretty sure I added all the items I could to isolate the device from the host. I mean obviously something I did was wrong or else it would've worked. But I am stumped as to next steps. I would love someone to point out where I am making my missteps!! Thanks in advance for the help!!!
I have struggled with GPU passthrough many a times as many before me. I have had a cluster of two PVE nodes prior that failed to have stable GPU passthrough so I had given up until now. I am testing in a single PVE instance with two GPUs. Both AMD and Nvidia are present in the server and I would like to pass them to two seperate VMs. Currently I have the AMD GPU successfully passed to a Windows10 VM and drivers have been installed and runs fine. For some reason occationally after ‘x’ amount of time, the VM is set to the ‘suspended’ status. I have only observed this once and it has not repeated in the last 8 hours. Resuming simply spikes the CPU to 100% then stops the VM. This has only occured once on the new instance of Proxmox, but was the reason why I gave up trying to get this working a few weeks ago
The Nvidia GPU has not been successfully passed to the Windows10 VM I have dedicated it to. It errors out when trying to boot the VM with the Error below.
All relevant files and errors are, hopefully, listed below and if any additional info is needed please ask! I feel I am extremely close to getting this to function fully, I am simply ignorant of the last couple steps to get it accross the finish line.
System Info:
MoBo: Asus Prime z390-a (Latest Bios 2004)
CPU: Intel i9-9900k
RAM: 4x16gb Corsair
GPU(s):
- AMD Reference 6950 xt (Slot 1)
- Nvidia GTX 1650 Super (Slot 2)
OS: Proxmox PVE 8.0.3
Bios options:
- VT-d enabled
- SR-IOV enabled
- 4G deconding enabled
- Primary GPU - iGPU
- CSM disabled
- Resizeable BAR disabled
cat /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init video=vesafbff video=efifbff video=simplefbff"
GRUB_CMDLINE_LINUX=""
lspci -nnn -s 03:00
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a6]
03:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB [1002:73a4]
lspci -nnn -s 04:00
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
04:00.1 Audio device [0403]: NVIDIA Corporation TU116 High Definition Audio Controller [10de:1aeb] (rev a1)
04:00.2 USB controller [0c03]: NVIDIA Corporation TU116 USB 3.1 Host Controller [10de:1aec] (rev a1)
04:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller [10de:1aed] (rev a1)
cat /etc/modules
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
ls /etc/modprobe.d
ls -a /etc/modprobe.d
. .. iommu_unsafe_interrupts.conf kvm.conf pve-blacklist.conf vfio.conf
cat /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1
cat /etc/modprobe.d/pve-blacklist.conf
blacklist nvidiafb
blacklist nvidia
blacklist nouveau
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:73a5,10de:2187,1002:ab28,10de:1aeb disable_vga=1
cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1
Nvidia VM config
- cat /etc/pve/qemu-server/101.conf
agent: 1
balloon: 0
bios: ovmf
boot: order=sata0;ide2;net0
cores: 8
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:04:00,pcie=1,x-vga=1
machine: pc-q35-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=1697717771
name: win10-temp-nvidia
net0: e1000=xx:xx:xx:xx:xx:xx,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
sata0: local-lvm:vm-101-disk-1,size=180G
scsihw: virtio-scsi-single
smbios1: uuid=d6d24523-7c52-4155-90fb-97541c80207f
sockets: 1
tpmstate0: local-lvm:vm-101-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 0403bef3-5875-4c33-9dac-5d840e7d6c28
Error observed when starting Windows 10 VM with Nvidia GPU
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:04:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:04:00.0: failed to open /dev/vfio/2: Device or resource busy
stopping swtpm instance (pid 46240) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
Checking kernel driver loaded for GPUs
- lspci -nnk -d [gpu id]
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:0e3a]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] [10de:2187] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd TU116 [GeForce GTX 1650 SUPER] [1458:401b]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
As best I can tell from my errors, the host (PVE) is some how accessing the Nvidia GPU and therefore not allowing the VM to us it. I am unsure how this is the case as I am pretty sure I added all the items I could to isolate the device from the host. I mean obviously something I did was wrong or else it would've worked. But I am stumped as to next steps. I would love someone to point out where I am making my missteps!! Thanks in advance for the help!!!