GPU Passthrough to VM - Everything seems working but it doesn't

lsk

New Member
Apr 10, 2022
2
0
1
Hi to everyone,
I have a fresh install of proxmox VE 7.1 (uefi) and some VM I migrated from an ESXi 6 server.
One of them uses a Nvidia Quadro P600 GPU for video encoding, so I needed a passthrough.
I enabled Vt-d in Bios (is an HP Elitedesk 800 G5 SFF with an i5-9500)

I enable Iommu:

/etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_aspm=off pci=noaer"
followed by
update-grub and for good measure pve-efiboot-tool refresh

so dmesg | grep -e DMAR -e IOMMU -e AMD-Vi:
Code:
[    0.008592] ACPI: DMAR 0x00000000A3C0D000 0000C8 (v01 INTEL  CFL      00000002      01000013)
[    0.008622] ACPI: Reserving DMAR table memory at [mem 0xa3c0d000-0xa3c0d0c7]
[    0.026542] DMAR: IOMMU enabled
[    0.068632] DMAR: Host address width 39
[    0.068633] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.068638] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 19e2ff0505e
[    0.068640] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.068643] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[    0.068645] DMAR: RMRR base: 0x000000a381d000 end: 0x000000a383cfff
[    0.068646] DMAR: RMRR base: 0x000000a8000000 end: 0x000000ac7fffff
[    0.068647] DMAR: RMRR base: 0x000000a386e000 end: 0x000000a38edfff
[    0.068649] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
[    0.068650] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.068651] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.071854] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.287481] DMAR: No ATSR found
[    0.287481] DMAR: No SATC found
[    0.287483] DMAR: IOMMU feature fl1gp_support inconsistent
[    0.287484] DMAR: IOMMU feature pgsel_inv inconsistent
[    0.287484] DMAR: IOMMU feature nwfs inconsistent
[    0.287485] DMAR: IOMMU feature pasid inconsistent
[    0.287485] DMAR: IOMMU feature eafs inconsistent
[    0.287486] DMAR: IOMMU feature prs inconsistent
[    0.287486] DMAR: IOMMU feature nest inconsistent
[    0.287487] DMAR: IOMMU feature mts inconsistent
[    0.287487] DMAR: IOMMU feature sc_support inconsistent
[    0.287488] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    0.287489] DMAR: dmar0: Using Queued invalidation
[    0.287491] DMAR: dmar1: Using Queued invalidation
[    0.287889] DMAR: Intel(R) Virtualization Technology for Directed I/O

and find /sys/kernel/iommu_groups/ -type l works correctly:

Code:
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/1/devices/0000:01:00.0
/sys/kernel/iommu_groups/1/devices/0000:01:00.1

I added the correct modules on /etc/modules:

Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

and configured vfio:
Code:
options vfio-pci ids=10de:1cb2,10de:0fb9

followed by update-initramfs -u -k all

lspci -nnk shows that the GPU uses correctly vfio-pci as kernel driver, BUT also loading nvidiafb/nouveau kernel modules

Code:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107GL [Quadro P600] [10de:1cb2] (rev a1)
        Subsystem: Hewlett-Packard Company GP107GL [Quadro P600] [103c:11bd]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
        Subsystem: Hewlett-Packard Company GP107GL High Definition Audio Controller [103c:11bd]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
that sounds strange to me, but I cannot understand if is ok.

Moving on to the VM;
this is the .conf file:
Code:
boot: order=scsi0
cores: 6
hostpci0: 0000:01:00,pcie=1
machine: q35
memory: 4096
name: Emby-Server
net0: vmxnet3=00:0c:29:3c:ee:da,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
scsi0: local-lvm:vm-203-disk-0
smbios1: uuid=c0518828-c438-488b-bce2-5c029108f54c
sockets: 1
vmgenid: fbea01a0-8e04-4ed4-8ff3-458217639eef

and in facts seems to work correctly:
lspci -nnk (inside VM)
Code:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107GL [Quadro P600] [10de:1cb2] (rev a1)
        Subsystem: Hewlett-Packard Company GP107GL [Quadro P600] [103c:11bd]
        Kernel driver in use: nvidia
        Kernel modules: nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
        Subsystem: Hewlett-Packard Company GP107GL High Definition Audio Controller [103c:11bd]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
nvidia-detect
Code:
Detected NVIDIA GPUs:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107GL [Quadro P600] [10de:1cb2] (rev a1)

Checking card:  NVIDIA Corporation GP107GL [Quadro P600] (rev a1)
Your card is supported by all driver versions.
Your card is also supported by the Tesla 460 drivers series.
Your card is also supported by the Tesla 450 drivers series.
Your card is also supported by the Tesla 418 drivers series.
It is recommended to install the
    nvidia-driver
package.
BUT nvidia-smi
Code:
No devices were found
and those three blocks during boot
Code:
[    0.970999] PCI Interrupt Link [GSIF] enabled at IRQ 21
[    0.972495] shpchp 0000:05:01.0: pci_hp_register failed with error -16
[    0.972730] shpchp 0000:05:01.0: Slot initialization failed
[    0.974412] shpchp 0000:05:02.0: HPC vendor_id 1b36 device_id 1 ss_vid 0 ss_did 0
[    0.974566] PCI Interrupt Link [GSIG] enabled at IRQ 22
[    0.975599] shpchp 0000:05:02.0: pci_hp_register failed with error -16
[    0.975754] shpchp 0000:05:02.0: Slot initialization failed
[    0.976773] shpchp 0000:05:03.0: HPC vendor_id 1b36 device_id 1 ss_vid 0 ss_did 0
[    0.976923] PCI Interrupt Link [GSIH] enabled at IRQ 23
[    0.978003] shpchp 0000:05:03.0: pci_hp_register failed with error -16
[    0.978156] shpchp 0000:05:03.0: Slot initialization failed
[    0.979098] shpchp 0000:05:04.0: HPC vendor_id 1b36 device_id 1 ss_vid 0 ss_did 0
[    0.979247] PCI Interrupt Link [GSIE] enabled at IRQ 20
[    0.980364] shpchp 0000:05:04.0: pci_hp_register failed with error -16
[    0.980516] shpchp 0000:05:04.0: Slot initialization failed

Code:
[    5.513278] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[    5.513634] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    5.651159] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[    5.651383] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Code:
[   11.158594] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   11.158735] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   11.298546] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[   11.298725] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  567.809457] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[  567.809587] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  567.949916] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[  567.950115] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

what I'm missing?

Sorry for the long post but I wanted to include all the relevant informations
 

cromatn5

Active Member
Mar 26, 2018
72
8
28
37
France
Hi,
Have you blacklisted module ?

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
echo "snd_hda_intel" >> /etc/modprobe.d/blacklist.conf

Code:
# cat /etc/modprobe.d/blacklist.conf
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel
blacklist skl_uncore
 
Last edited:

lsk

New Member
Apr 10, 2022
2
0
1
Hi,
Have you blacklisted module ?

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
echo "snd_hda_intel" >> /etc/modprobe.d/blacklist.conf

Code:
# cat /etc/modprobe.d/blacklist.conf
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel
blacklist skl_uncore
yes, i forgot in OP
/etc/modprobe.d/blacklist.conf

Code:
blacklist nouveau
blacklist nvidia

radeon doesn't apply to my case of course

i also tried depmod -a before update-initramfs -u
Seems strange to me in facts the output of lspc -nnk that list those two modules (nvidiafb is blacklisted by default on proxmox)

UPDATE:
now i tried to run emby in a container following this guide
This time everything works as expected (nvidia-smi correctly recognizing the cpu) but still is not using it for transcode.
Maybe i should head to emby support
 
Last edited:

Aimsucks

New Member
May 30, 2022
5
0
1
Has no solution to this been found? I have the exact same problem. The GPU shows up as a VGA device in lspci, the same RmInitAdapter failed! error shows up, and nvidia-smi returns No devices were found.
 

Docop2

Member
Nov 20, 2021
92
5
8
43
only 2 setting are needed, no other in grub: quiet intel_iommu=on video=efifb:eek:ff . then blacklist driver, and vfio with the proper card id:
options vfio-pci ids=8086:193b,8086:a170 disable_vga=1
reboot, edit vm, add pci with option: rombar,pci-ex,full. start vm. *uber simple: start a bootable iso, no old vm or other stuff.
 

Aimsucks

New Member
May 30, 2022
5
0
1
only 2 setting are needed, no other in grub: quiet intel_iommu=on video=efifb:eek:ff . then blacklist driver, and vfio with the proper card id:
options vfio-pci ids=8086:193b,8086:a170 disable_vga=1
reboot, edit vm, add pci with option: rombar,pci-ex,full. start vm. *uber simple: start a bootable iso, no old vm or other stuff.
Host
pveversion -v
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-1
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-7
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

lspci -nnv -s 07:00
Code:
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106GL [Quadro P2000] [10de:1c30] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation GP106GL [Quadro P2000] [10de:11b3]
        Flags: bus master, fast devsel, latency 0, IRQ 81, IOMMU group 21
        Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

07:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
        Subsystem: NVIDIA Corporation GP106 High Definition Audio Controller [10de:11b3]
        Flags: bus master, fast devsel, latency 0, IRQ 82, IOMMU group 22
        Memory at fc080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

/etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on pcie_acs_override=downstream,multifunction video=simplefb:off video=efifb:off video=vesa:off vfio-pci.ids=10de:1c30,10de:10f1 vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 modprobe.blacklist=radeon,nouveau,nvidia,nvidiafb,nvidia-gpu"

/etc/modprobe.d/vfio.conf
Code:
options vfio-pci ids=10de:1c30,10de:10f1 disable_vga=1

/etc/modprobe.d/iommu_unsafe_interrupts.conf
Code:
options vfio_iommu_type1 allow_unsafe_interrupts=1

/etc/modules
# /etc/modules: kernel modules to load at boot time. # # This file contains the names of kernel modules that should be loaded # at boot time, one per line. Lines beginning with "#" are ignored. vfio vfio_iommu_type1 vfio_pci vfio_virqfd

IOMMU Groups
Code:
IOMMU Group 21:
        07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106GL [Quadro P2000] [10de:1c30] (rev a1)
IOMMU Group 22:
        07:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)

dmesg -w | grep vfio-pci (these messages are spammed in dmesg, also in dmesg I do not get DMAR: IOMMU enabled but I do get perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).)
Code:
[123402.528189] vfio-pci 0000:07:00.0: BAR 3: can't reserve [mem 0xe0000000-0xe1ffffff 64bit pref]
[123402.528205] vfio-pci 0000:07:00.0: BAR 3: can't reserve [mem 0xe0000000-0xe1ffffff 64bit pref]

/etc/pve/qemu-server/100.conf (NOTE: I tried using SeaBIOS too with a BIOS install of Ubuntu Server 22.04 and it did not work either, I also tried setting the GPU as a primary device which disables the console but also does not work)
Code:
bios: ovmf
boot: order=scsi0;net0
cores: 4
efidisk0: local-lvm:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:07:00,pcie=1
machine: q35
memory: 16384
meta: creation-qemu=6.2.0,ctime=1653201783
name: hydra
net0: virtio=5E:F4:43:2C:BE:E4,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-0,size=64G
scsi1: dust:vm-100-disk-0,backup=0,discard=on,size=10000G
scsihw: virtio-scsi-pci
smbios1: uuid=d58740ff-401e-4c1c-9b33-e6566a623f9d
sockets: 1
vmgenid: 3e5190c9-ccff-4a10-9b08-78dedc4bd59f

Guest
lspci -nnv -s 01:00
Code:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP106GL [Quadro P2000] [10de:1c30] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation GP106GL [Quadro P2000] [10de:11b3]
        Physical Slot: 0
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 800000000 (64-bit, prefetchable) [size=256M]
        Memory at 810000000 (64-bit, prefetchable) [size=32M]
        I/O ports at d000 [size=128]
        Expansion ROM at c2020000 [virtual] [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

01:00.1 Audio device [0403]: NVIDIA Corporation GP106 High Definition Audio Controller [10de:10f1] (rev a1)
        Subsystem: NVIDIA Corporation GP106 High Definition Audio Controller [10de:11b3]
        Physical Slot: 0
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at c2000000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

dmesg
Code:
[    2.799624] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1417)
[    2.800589] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    2.800955] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    2.801224] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

And finally, the problem:
nvidia-smi
Code:
No devices were found

I have no idea what to do. The passthrough obviously appears to be working but the NVIDIA drivers are not detecting the GPU.

Fixed! Following the instructions in this post fixed it for me.
 
Last edited:

sublightnova

New Member
May 30, 2022
21
2
3
Following the instructions in this post fixed it for me. Thank you! I didn't see that thread when I was searching around for some reason.


Great to hear... Fortunately it only took me 2 or 3 days pulling my hair out before stumbling on the thread myself.. Glad it worked out for you too!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!