Hi there,
I'm trying to pass a tesla m40 into a windows 10 vm on my dell r710. I've followed the official pci passthrough guide and have looked through a bunch of other threads to try and get it working, but to no avail. The gpu shows up in device manager, but it has an error symbol and the message:
Grub:
On the host, I've edited the grub file:
Iommu:
Iommu is confirmed to be enabled with
Etc modules:
Etc modules contains the following:
Interrupt remapping:
Running
I'm running the latest bios version (6.6.0), so I enabled unsafe interrupts by editing the iommu_unsafe_interrupts.conf file to read:
This seems to have fixed the interrupts as I don't get any errors when adding the pci device in the proxmox gui
Identifying the gpu:
Running
Iommu isolation:
When running
GPU passthrough:
I found the vendor id of my card using
I then blacklisted the drivers in
GPU OVMF PICe passthrough:
The windows 10 vm originally used seabios, so I used a command line tool inside the vm to convert to uefi to be able to use ovmf. I set the machine type from the default to q35 and added the gpu as a pci device. When adding the gpu, I checked all the boxes to make it pcie and the primary gpu.
I attempted to use the rom-parser utility, however, when issuing echo 1 > rom, I received a permission denied, even though I was root.
Actual issues:
Inside windows, I've tried installing the windows 10 64-bit uk edition of both the tesla m40 graphics drivers and the quadro graphics drivers, but it still produces the same device manager error that I explained at the top.
When running
I have done googling of the various bar errors, but none of the solutions have worked
Solutions:
Something I've seen mentioned multiple times is to enable above 4g decoding. I have hunted high and low throughout my bios, but cannot find it anywhere. From my hours of googling and trial and error, my current theory is that is the culprit and this specific server and gpu configuration will not work. Can anyone confirm or (hopefully) deny my theory? Any help would be greatly appreciated, thanks.
I'm trying to pass a tesla m40 into a windows 10 vm on my dell r710. I've followed the official pci passthrough guide and have looked through a bunch of other threads to try and get it working, but to no avail. The gpu shows up in device manager, but it has an error symbol and the message:
This device cannot start. (Code 10) Insufficient system resources exist to complete the API.
Grub:
On the host, I've edited the grub file:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off video=vesafb:off video=simplefb:off"
Iommu:
Iommu is confirmed to be enabled with
dmesg | grep -e DMAR -e IOMMU
, which shows the line DMAR: IOMMU enabled
Etc modules:
Etc modules contains the following:
Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
Interrupt remapping:
Running
dmesg | grep 'remapping'
reveals this:
Code:
DMAR-IR: This system BIOS has enabled interrupt remapping
on a chipset that contains an erratum making that
feature unstable. To maintain system stability
interrupt remapping is being disabled. Please
contact your BIOS vendor for an update
Code:
options vfio_iommu_type1 allow_unsafe_interrupts=1
Identifying the gpu:
Running
lspci -s 06:00.0 -v
shows my Tesla M40:
Code:
06:00.0 3D controller: NVIDIA Corporation GM200GL [Tesla M40] (rev a1)
Subsystem: NVIDIA Corporation GM200GL [Tesla M40]
Flags: bus master, fast devsel, latency 0, IRQ 67, IOMMU group 26
Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
Memory at <ignored> (64-bit, prefetchable)
Memory at d0000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
Iommu isolation:
When running
find /sys/kernel/iommu_groups/ -type l
, I get this, where the isolated line is my gpu:
Code:
/sys/kernel/iommu_groups/17/devices/0000:00:16.6
/sys/kernel/iommu_groups/35/devices/0000:ff:03.4
/sys/kernel/iommu_groups/35/devices/0000:ff:03.2
/sys/kernel/iommu_groups/35/devices/0000:ff:03.0
/sys/kernel/iommu_groups/35/devices/0000:ff:03.1
/sys/kernel/iommu_groups/7/devices/0000:00:09.0
/sys/kernel/iommu_groups/25/devices/0000:04:00.0
/sys/kernel/iommu_groups/15/devices/0000:00:16.4
/sys/kernel/iommu_groups/33/devices/0000:ff:00.0
/sys/kernel/iommu_groups/33/devices/0000:ff:00.1
/sys/kernel/iommu_groups/5/devices/0000:00:06.0
/sys/kernel/iommu_groups/23/devices/0000:01:00.0
/sys/kernel/iommu_groups/23/devices/0000:01:00.1
/sys/kernel/iommu_groups/13/devices/0000:00:16.2
/sys/kernel/iommu_groups/31/devices/0000:fe:05.0
/sys/kernel/iommu_groups/31/devices/0000:fe:05.3
/sys/kernel/iommu_groups/31/devices/0000:fe:05.1
/sys/kernel/iommu_groups/31/devices/0000:fe:05.2
/sys/kernel/iommu_groups/3/devices/0000:00:04.0
/sys/kernel/iommu_groups/21/devices/0000:08:03.0
/sys/kernel/iommu_groups/21/devices/0000:00:1e.0
/sys/kernel/iommu_groups/11/devices/0000:00:16.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/38/devices/0000:ff:06.1
/sys/kernel/iommu_groups/38/devices/0000:ff:06.2
/sys/kernel/iommu_groups/38/devices/0000:ff:06.0
/sys/kernel/iommu_groups/38/devices/0000:ff:06.3
/sys/kernel/iommu_groups/28/devices/0000:fe:02.5
/sys/kernel/iommu_groups/28/devices/0000:fe:02.3
/sys/kernel/iommu_groups/28/devices/0000:fe:02.1
/sys/kernel/iommu_groups/28/devices/0000:fe:02.4
/sys/kernel/iommu_groups/28/devices/0000:fe:02.2
/sys/kernel/iommu_groups/28/devices/0000:fe:02.0
/sys/kernel/iommu_groups/18/devices/0000:00:16.7
/sys/kernel/iommu_groups/36/devices/0000:ff:04.2
/sys/kernel/iommu_groups/36/devices/0000:ff:04.0
/sys/kernel/iommu_groups/36/devices/0000:ff:04.3
/sys/kernel/iommu_groups/36/devices/0000:ff:04.1
/sys/kernel/iommu_groups/8/devices/0000:00:14.0
/sys/kernel/iommu_groups/26/devices/0000:06:00.0
/sys/kernel/iommu_groups/16/devices/0000:00:16.5
/sys/kernel/iommu_groups/34/devices/0000:ff:02.5
/sys/kernel/iommu_groups/34/devices/0000:ff:02.3
/sys/kernel/iommu_groups/34/devices/0000:ff:02.1
/sys/kernel/iommu_groups/34/devices/0000:ff:02.4
/sys/kernel/iommu_groups/34/devices/0000:ff:02.2
/sys/kernel/iommu_groups/34/devices/0000:ff:02.0
/sys/kernel/iommu_groups/6/devices/0000:00:07.0
/sys/kernel/iommu_groups/24/devices/0000:02:00.0
/sys/kernel/iommu_groups/24/devices/0000:02:00.1
/sys/kernel/iommu_groups/14/devices/0000:00:16.3
/sys/kernel/iommu_groups/32/devices/0000:fe:06.3
/sys/kernel/iommu_groups/32/devices/0000:fe:06.1
/sys/kernel/iommu_groups/32/devices/0000:fe:06.2
/sys/kernel/iommu_groups/32/devices/0000:fe:06.0
/sys/kernel/iommu_groups/4/devices/0000:00:05.0
/sys/kernel/iommu_groups/22/devices/0000:00:1f.2
/sys/kernel/iommu_groups/22/devices/0000:00:1f.0
/sys/kernel/iommu_groups/12/devices/0000:00:16.1
/sys/kernel/iommu_groups/30/devices/0000:fe:04.2
/sys/kernel/iommu_groups/30/devices/0000:fe:04.0
/sys/kernel/iommu_groups/30/devices/0000:fe:04.3
/sys/kernel/iommu_groups/30/devices/0000:fe:04.1
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/20/devices/0000:00:1d.1
/sys/kernel/iommu_groups/20/devices/0000:00:1d.0
/sys/kernel/iommu_groups/20/devices/0000:00:1d.7
/sys/kernel/iommu_groups/10/devices/0000:00:14.2
/sys/kernel/iommu_groups/29/devices/0000:fe:03.1
/sys/kernel/iommu_groups/29/devices/0000:fe:03.4
/sys/kernel/iommu_groups/29/devices/0000:fe:03.2
/sys/kernel/iommu_groups/29/devices/0000:fe:03.0
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/19/devices/0000:00:1a.1
/sys/kernel/iommu_groups/19/devices/0000:00:1a.0
/sys/kernel/iommu_groups/19/devices/0000:00:1a.7
/sys/kernel/iommu_groups/37/devices/0000:ff:05.3
/sys/kernel/iommu_groups/37/devices/0000:ff:05.1
/sys/kernel/iommu_groups/37/devices/0000:ff:05.2
/sys/kernel/iommu_groups/37/devices/0000:ff:05.0
/sys/kernel/iommu_groups/9/devices/0000:00:14.1
/sys/kernel/iommu_groups/27/devices/0000:fe:00.1
/sys/kernel/iommu_groups/27/devices/0000:fe:00.0
GPU passthrough:
I found the vendor id of my card using
lspci -n -s 06:00
, which prints 06:00.0 0302: 10de:17fd (rev a1)
, meaning the id is 10de:17fd
. So, I added it to the /etc/modprobe.d/vfio.conf
file, which now reads options vfio-pci ids=10de:17fd
I then blacklisted the drivers in
/etc/modprobe.d/blacklist.conf
, so it now reads:
Code:
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
GPU OVMF PICe passthrough:
The windows 10 vm originally used seabios, so I used a command line tool inside the vm to convert to uefi to be able to use ovmf. I set the machine type from the default to q35 and added the gpu as a pci device. When adding the gpu, I checked all the boxes to make it pcie and the primary gpu.
I attempted to use the rom-parser utility, however, when issuing echo 1 > rom, I received a permission denied, even though I was root.
Actual issues:
Inside windows, I've tried installing the windows 10 64-bit uk edition of both the tesla m40 graphics drivers and the quadro graphics drivers, but it still produces the same device manager error that I explained at the top.
When running
dmesg | grep 06:00
, I get the following:
Code:
[ 1.093275] pci 0000:06:00.0: [10de:17fd] type 00 class 0x030200
[ 1.093285] pci 0000:06:00.0: reg 0x10: [mem 0xdc000000-0xdcffffff]
[ 1.093294] pci 0000:06:00.0: reg 0x14: [mem 0xcffff00000000000-0xcffff007ffffffff 64bit pref]
[ 1.093304] pci 0000:06:00.0: reg 0x1c: [mem 0xd0000000-0xd1ffffff 64bit pref]
[ 1.093327] pci 0000:06:00.0: Enabling HDA controller
[ 1.093398] pci 0000:06:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link at 0000:00:07.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[ 1.104463] pci 0000:06:00.0: can't claim BAR 1 [mem 0xcffff00000000000-0xcffff007ffffffff 64bit pref]: no compatible bridge window
[ 1.133780] pci 0000:06:00.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
[ 1.133783] pci 0000:06:00.0: BAR 1: trying firmware assignment [mem 0xcffff00000000000-0xcffff007ffffffff 64bit pref]
[ 1.133785] pci 0000:06:00.0: BAR 1: [mem 0xcffff00000000000-0xcffff007ffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
[ 1.133788] pci 0000:06:00.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
[ 1.930555] pci 0000:06:00.0: Adding to iommu group 26
[ 71.389962] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 71.389981] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 3493.668278] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 3493.668297] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Solutions:
Something I've seen mentioned multiple times is to enable above 4g decoding. I have hunted high and low throughout my bios, but cannot find it anywhere. From my hours of googling and trial and error, my current theory is that is the culprit and this specific server and gpu configuration will not work. Can anyone confirm or (hopefully) deny my theory? Any help would be greatly appreciated, thanks.
Last edited: