GPU Passthrough not respecting secondary GPU

crazywolf13

Member
Oct 15, 2023
76
9
13
Hi,

I have two GPUs:
Code:
root@tower8:~# lspci | egrep 'VGA|3D'
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)

I want the 1080 GPU passed through to a VM and the 1060 available to the host, possibly for LXC passthrough.

I've done the following setup which worked just fine for passing through nvidia gpus to VMs other times I've done it

Add iommu boot argument:

Code:
root@tower8:~# cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

Add vfio modules:
Code:
root@tower8:~# nano /etc/modules
vfio
vfio_iommu_type1
vfio_pci

Add driver blacklist:
Code:
root@tower8:~# nano /etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist nouveau
# comment out nvidia-driver for gpu-separation
# blacklist nvidia

Check GPU ID:
Code:
root@tower8:~# lspci -n -s 01:00
01:00.0 0300: 10de:1b80 (rev a1)
01:00.1 0403: 10de:10f0 (rev a1)
root@tower8:~# lspci -n -s 02:00
02:00.0 0300: 10de:1c02 (rev a1)
02:00.1 0403: 10de:10f1 (rev a1)

Added the GPU to vfio:
Code:
root@tower8:~# nano /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1b80 disable_vga=1
Code:
update-grub
update-initramfs -u
reboot

While the above works just fine to add the 1080 GPU to a VM (after adding the PCI-E device via webui), however the second gpu the 1060 is not available for the host:
Code:
root@tower8:~# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Code:
root@tower8:~# lspci -nnk | grep -A3 NVIDIA
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] [10de:1c02] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

As the above suggests, the 1060 is still in use by vfio-pci, what do I need to do to exclude it from vfio-pci so the host can use it again?

I saw the package pve-nvidia-vgpu-helper installed, is it possible that this one may be interfering here?
 
 

Thanks a lot for this!

I've done it exactly like in the post and remove my config entry to /etc/modprobe.d/vfio.conf


However after a rebuild of initramfs and a reboot I still see this:


Code:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] [10de:1c02] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

any ideas?
 
If it still becomes vfio-pci even after disabling the blacklist, I believe early binding for vfio-pci is also configured.

At the time of installing Proxmox, vfio-pci is not used under any circumstances.

Please review the settings you configured in /etc/modprobe.d/.

* I don't think I've removed the blacklist, but that's just a guess.
 
Add driver blacklist:
root@tower8:~# nano /etc/modprobe.d/blacklist.conf blacklist radeon blacklist nouveau # comment out nvidia-driver for gpu-separation # blacklist nvidia

Added the GPU to vfio:
root@tower8:~# nano /etc/modprobe.d/vfio.conf options vfio-pci ids=10de:1b80 disable_vga=1

Even if you remove vfio, it doesn't appear to indicate that blacklist has been removed.

If you are using a script, blacklist is unnecessary.
 
I don't think I've ever touched any files manually other than the ones I showed above:


Code:
root@tower8:~# for f in /etc/modprobe.d/*; do
  echo "===== $f ====="
  cat "$f"
  echo
done
===== /etc/modprobe.d/blacklist.conf =====
blacklist radeon
blacklist nouveau

===== /etc/modprobe.d/dkms.conf =====
# modprobe information used for DKMS modules
#
# This is a stub file, should be edited when needed,
# used by default by DKMS.

===== /etc/modprobe.d/intel-microcode-blacklist.conf =====
# The microcode module attempts to apply a microcode update when
# it autoloads.  This is not always safe, so we block it by default.
blacklist microcode

===== /etc/modprobe.d/iommu_unsafe_interrupts.conf =====

===== /etc/modprobe.d/kvm.conf =====

===== /etc/modprobe.d/nvidia-blacklists-nouveau.conf =====
# You need to run "update-initramfs -u" after editing this file.

# see #580894
blacklist nouveau

===== /etc/modprobe.d/nvidia.conf =====
install nvidia modprobe -i nvidia-current $CMDLINE_OPTS

install nvidia-modeset modprobe nvidia ; modprobe -i nvidia-current-modeset $CMDLINE_OPTS

install nvidia-drm modprobe nvidia-modeset ; modprobe -i nvidia-current-drm $CMDLINE_OPTS

install nvidia-uvm modprobe nvidia ; modprobe -i nvidia-current-uvm $CMDLINE_OPTS

install nvidia-peermem modprobe nvidia ; modprobe -i nvidia-current-peermem $CMDLINE_OPTS

# unloading needs the internal names (i.e. upstream's names, not our renamed files)

remove nvidia modprobe -r -i nvidia-drm nvidia-modeset nvidia-peermem nvidia-uvm nvidia

remove nvidia-modeset modprobe -r -i nvidia-drm nvidia-modeset


alias char-major-195* nvidia

# These aliases are defined in *all* nvidia modules.
# Duplicating them here sets higher precedence and ensures the selected
# module gets loaded instead of a random first match if more than one
# version is installed. See #798207.
alias   pci:v000010DEd00000E00sv*sd*bc04sc80i00*        nvidia
alias   pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*        nvidia
alias   pci:v000010DEd*sv*sd*bc03sc02i00*               nvidia
alias   pci:v000010DEd*sv*sd*bc03sc00i00*               nvidia

===== /etc/modprobe.d/nvidia-options.conf =====
#options nvidia-current NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=44 NVreg_DeviceFileMode=0660

# To grant performance counter access to unprivileged users, uncomment the following line:
#options nvidia-current NVreg_RestrictProfilingToAdminUsers=0

# Uncomment to enable this power management feature:
#options nvidia-current NVreg_PreserveVideoMemoryAllocations=1

# Uncomment to enable this power management feature:
#options nvidia-current NVreg_EnableS0ixPowerManagement=1

===== /etc/modprobe.d/pve-blacklist.conf =====
# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb

Those seem to be from either intel microcode update, nvidia driver or proxmox as far as I can tell?

I've now removed the blacklist file fully, but that did also not fix it.
 
Did you check the IOMMU groups: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation ? If both GPU are in the same group then you cannot split them over different VMs and/or the Proxmox host.
That seems to explain this:

So that would mean it's not possible to have one GPU for the host and one for a VM?

Code:
├───────┼──────┼──────────┼─────────┼──────┼─────────────────────────────────────────────────────
│ 0x030000 │ 0x1b80 │ 0000:01:00.0 │          1 │ 0x10de │ GP104 [GeForce GTX 1080]
├──────────┼────────┼──────────┼─────────┼────────┼────────────────────────────────────────────────
│ 0x030000 │ 0x1c02 │ 0000:02:00.0 │          1 │ 0x10de │ GP106 [GeForce GTX 1060 3GB]
├──────────┼────────┼──────────┼─────────┼──────┼────────────────────────────────────────────────
 
Last edited:
That would mean different groups, right?
Code:
├───────┼──────┼──────────┼─────────┼──────┼─────────────────────────────────────────────────────
│ 0x030000 │ 0x1b80 │ 0000:01:00.0 │          1 │ 0x10de │ GP104 [GeForce GTX 1080]
├──────────┼────────┼──────────┼─────────┼────────┼────────────────────────────────────────────────
│ 0x030000 │ 0x1c02 │ 0000:02:00.0 │          1 │ 0x10de │ GP106 [GeForce GTX 1060 3GB]
├──────────┼────────┼──────────┼─────────┼──────┼────────────────────────────────────────────────
No, the iommugroup column shows group 1 for both. Try a different PCIe slot for one or both (as there might be more devices in group 1).
 
  • Like
Reactions: crazywolf13
Okay I just tried the ACS patch as according to some reddit post my mobo doesn't have ACS, which is pretty likely as it's an Z390 Prime

Seems to work fine with it, I have now different groups:


Code:
├──────────┼────────┼──────────────┼──────┼────────┼─────────────────────────────
│ 0x030000 │ 0x1b80 │ 0000:01:00.0 │   10 │ 0x10de │ GP104 [GeForce GTX 1080]
├──────────┼────────┼──────────────┼──────┼────────┼──────────────────────────────
│ 0x030000 │ 0x1c02 │ 0000:02:00.0 │   11 │ 0x10de │ GP106 [GeForce GTX 1060 3GB]
├──────────┼────────┼──────────────┼──────┼────────┼─────────────────────────────



Code:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8725]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] [10de:1c02] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel modules: nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
        Subsystem: Hewlett-Packard Company Device [103c:82fc]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel


Funnily enough, now it seems like no driver is used at all for the second GPU?
 
Last edited:
Sorry, but I don't really understand what you mean?

Yeah I guess they were added when I installed nvidia-driver and what now?

I've only deleted the file /etc/modprobe.d/blacklist.conf as this one was manually created
 
Last edited:
So it seems like the proper nvidia kernel module is missing, I installed it with apt install nvidia-driver after enabling non-free contrib. However I ran a kernel update afterwards as far as I can remember through apt upgrade

Code:
root@tower8:~# find /lib/modules/$(uname -r) -name "nvidia*.ko" | grep -v fbdev
/lib/modules/6.17.4-2-pve/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko

DKMS shows module is added but not built:

Code:
root@tower8:~# dkms status
nvidia-current/550.163.01: added

modprobe seems to fail looking for nvidia-current
Code:
root@tower8:~# modprobe nvidia
modprobe: FATAL: Module nvidia-current not found in directory /lib/modules/6.17.4-2-pve

What would be the suggested way to handle this, without breaking too much of the proxmox specific kernel stuff?

I'm running Proxmox VE 9.0.1 on kernel 6.17.4-2-pve with nvidia-driver 550.163.01-2 from deb trixie