Tesla K80 GPU passtrough driver error (device not found by nvidia-smi in VM)

Tutbjun

Member
Aug 4, 2022
6
1
8
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?
 

Attachments

Last edited:
  • Like
Reactions: doomonkee
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?

I wish I had an answer, but I will add myself as a +1 to having this exact issue.
 
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?

I have a separate problem but are you running "i440x" or "q35"? Do you have PCIE enabled with a UEFI boot for the VM?
 
So I returned to this project, and found that I had never replied in this thread, oops...
I have a separate problem but are you running "i440x" or "q35"? Do you have PCIE enabled with a UEFI boot for the VM?
To answer this: I am running q35, PCIE is enable (i am assuming on the PCI device VM config), and the VM is running with UEFI.

My current suspicion is that Ubuntu/kernel version is too new, so I am setting up a VM with Ubuntu 20. I will report back if that is successful.

@bkinigadner it would be nice to hear details of your setup; version, method, etc.