Tesla K80 GPU passtrough driver error (device not found by nvidia-smi in VM)

Tutbjun

Member
Aug 4, 2022
5
1
8
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?
 

Attachments

  • dmesg.txt
    98.3 KB · Views: 4
Last edited:
  • Like
Reactions: doomonkee
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?

I wish I had an answer, but I will add myself as a +1 to having this exact issue.
 
Hi everyone

I've undertaken a project of making a Proxmox server for machine learning and remote gaming, but I'm sort of stuck getting the Nvidia drivers to work on my Ubuntu 22.04 VM. The plan is to have a few VM's configured for either machine learning or playing games using different GPUs.
I am both new to this forum and a bit new to Linux, so please bear with me for any mistakes, and point out if I'm missing some info :)

So far I have successfully passed trough my 1060 to a Windows VM following the guide, but I can't seem to get the K80 working properly. The VM has been set up with both the two available GPU's from the K80, but i had a similar error before with a single K80 GPU VM.

My system has a MSI Z590 MB, Intel 10850k, a gtx 1060, and a Tesla K80.

I have mainly used this guide as a reference:
https://3os.org/infrastructure/prox...virtual-machine-gpu-passthrough-configuration

The only thing I have done inside the VM so far is to use the inbuilt "Software & Updates" to install the Nvidia 470 display driver.

The main symptom arises by running the
Code:
nvidia-smi
command:
Code:
No devices were found

Although the GPU's are listed when running
Code:
lspci -nnv
...
Code:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1000000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

02:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
    Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
    Physical Slot: 0-2
    Flags: bus master, fast devsel, latency 0, IRQ 16
    Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 1002000000 (64-bit, prefetchable) [size=32M]
    Capabilities: <access denied>
    Kernel modules: nvidiafb, nouveau

The best clue I have is this part from the
Code:
dmesg -w
command:
Code:
[    4.749614] resource sanity check: requesting [mem 0xc2700000-0xc36fffff], which spans more than PCI Bus 0000:01 [mem 0xc2000000-0xc2ffffff]
[    4.749619] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[    4.763172] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[    4.763299] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(same error on the other GPU; PCI bus 0000:02)
*full log in .txt file

The best suggestions I could find scouring the internet was to enable "above 4G decoding", and disable "CSM", both of which I have done in the host BIOS.

Any help or clues would be appreciated, as can't really find much info about this issue.

Update:
I've read through the forums a bit, and I found this very informative post by Lefuneste:
https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-471013

Where it was helpfully pointed out that by running
Code:
cat /proc/iomem
in the host, there should be a line with "vfio-pci" the line under the GPU PCIE adress, which I don't get. Instead, I get nothing the line under my GPU adresses. In fact, when running
Code:
cat /proc/iomem | grep vfio
, I get nothing. Does this mean that the Nvidia drivers are succesfully blocked from grabbing my GPU's, but the vfio fails to get it?

I have a separate problem but are you running "i440x" or "q35"? Do you have PCIE enabled with a UEFI boot for the VM?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!