Help configuring vGPU?

proxwolfe

Well-Known Member
Jun 20, 2020
504
54
48
49
Hi,

I am trying to get vGPU to work by following this guide: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x#cite_ref-3

My card is an RTX A5000, the same as used in the guide, so should work.

I went through with the setup but it doesn't work.
Code:
lspci -d 10de:
results in a very short list: only the vga controller and the sound controller are listed.

My guess is that the problem comes from my missing this on the first try: "Some supported NVIDIA GPUs don't have vGPU enabled out of the box and need to have their display ports disabled.This is the case with our RTX A5000, and can be achieved by using their display mode selector tool[3].For a list of GPUs where this is necessary check their documentation."

When I try to run the displaymodeselector utility, it tells me that I need to unload the NVIDIA kernel driver first by
Code:
rmmod nvidia
but this doesn't work because
Code:
rmmod: ERROR: Module nvidia is in use
.

Then I tried to blacklist nvidia so it would not get loaded, updated the initramfs, and rebooted.

But
Code:
lspci -nnk
still shows nvidia as kernel driver in use.

Any ideas how I can either remove nvidia or keep it from loading?

Thanks!
 
Ah, got it disabled by adding
Code:
module_blacklist=nvidia
to the
Code:
GRUB_CMDLINE_LINUX_DEFAULT
line in /etc/default/grub

cross my fingers that disabling the displayport is going to do the trick...
 
Hmm.

So disabling the card's displayport seems to have worked (the benefit of this is that I got my KVM video feed back because the card does not override the onboard KVM anymore) - yay!

I then removed the blacklist entry from GRUB and nvidia was loaded again.

But
Code:
lspci -d 10de:
still does not show a long list of available devices. Instead, the sound part of the card has now disappeared from the list as well.

Then I tried to enable SR-IOV again (I had done that on the first try, but thought it couldn't hurt to do it again). The first time nothing had happened. This time, however, I got this
Code:
Enabling VFs on 0000:02:00.0
/usr/lib/nvidia/sriov-manage: line 184: echo: write error: Cannot allocate memory
A progress of sorts...

in DMESG I find
Code:
[   82.077577] NVRM: GPU 0000:02:00.0: UnbindLock acquired
[   82.080350] pci-pf-stub 0000:02:00.0: claimed by pci-pf-stub
[   82.581623] pci-pf-stub 0000:02:00.0: not enough MMIO resources for SR-IOV
[   82.584829] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:02:00.0)
[   82.584833] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR2 is 0M @ 0x0 (PCI:0000:02:00.0)
[   82.584835] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR5 is 0M @ 0x0 (PCI:0000:02:00.0)
[   84.103485] NVRM: GPU at 0000:02:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[   84.310005] resource: resource sanity check: requesting [mem 0x0000000092700000-0x00000000936fffff], which spans more than PCI Bus 0000:02 [mem 0x90000000-0x92ffffff]
[   84.310009] caller os_map_kernel_space+0x114/0x130 [nvidia] mapping multiple BARs
[   84.329477] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x24:0x72:1436)
[   84.329870] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[   85.832073] NVRM: GPU at 0000:02:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[   85.892661] resource: resource sanity check: requesting [mem 0x0000000092700000-0x00000000936fffff], which spans more than PCI Bus 0000:02 [mem 0x90000000-0x92ffffff]
[   85.892664] caller os_map_kernel_space+0x114/0x130 [nvidia] mapping multiple BARs
[   85.912093] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x24:0x72:1436)
[   85.912384] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
This may or may not have to do with, upon rebooting, my BIOS greeted me with a message that I have never seen before: That there aren't enough PCI resources available and that some had to be disabled. It let me boot after this info and everything seems normal.

Any ideas what to try next?

Thanks!
 
Hmm.

So disabling the card's displayport seems to have worked (the benefit of this is that I got my KVM video feed back because the card does not override the onboard KVM anymore) - yay!

I then removed the blacklist entry from GRUB and nvidia was loaded again.

But
Code:
lspci -d 10de:
still does not show a long list of available devices. Instead, the sound part of the card has now disappeared from the list as well.

Then I tried to enable SR-IOV again (I had done that on the first try, but thought it couldn't hurt to do it again). The first time nothing had happened. This time, however, I got this
Code:
Enabling VFs on 0000:02:00.0
/usr/lib/nvidia/sriov-manage: line 184: echo: write error: Cannot allocate memory
A progress of sorts...

in DMESG I find
Code:
[   82.077577] NVRM: GPU 0000:02:00.0: UnbindLock acquired
[   82.080350] pci-pf-stub 0000:02:00.0: claimed by pci-pf-stub
[   82.581623] pci-pf-stub 0000:02:00.0: not enough MMIO resources for SR-IOV
[   82.584829] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:02:00.0)
[   82.584833] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR2 is 0M @ 0x0 (PCI:0000:02:00.0)
[   82.584835] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR5 is 0M @ 0x0 (PCI:0000:02:00.0)
[   84.103485] NVRM: GPU at 0000:02:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[   84.310005] resource: resource sanity check: requesting [mem 0x0000000092700000-0x00000000936fffff], which spans more than PCI Bus 0000:02 [mem 0x90000000-0x92ffffff]
[   84.310009] caller os_map_kernel_space+0x114/0x130 [nvidia] mapping multiple BARs
[   84.329477] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x24:0x72:1436)
[   84.329870] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[   85.832073] NVRM: GPU at 0000:02:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[   85.892661] resource: resource sanity check: requesting [mem 0x0000000092700000-0x00000000936fffff], which spans more than PCI Bus 0000:02 [mem 0x90000000-0x92ffffff]
[   85.892664] caller os_map_kernel_space+0x114/0x130 [nvidia] mapping multiple BARs
[   85.912093] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x24:0x72:1436)
[   85.912384] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
This may or may not have to do with, upon rebooting, my BIOS greeted me with a message that I have never seen before: That there aren't enough PCI resources available and that some had to be disabled. It let me boot after this info and everything seems normal.

Any ideas what to try next?

Thanks!
I have the extact same errors in my DMESG log... I have a NVIDIA Tesla T4 Did you find a solution to this?
 
I must have, because I got vGPU working after a while. But I don't remember the exact steps unfortunately.

But I gave up on using vGPU because

- it was flaky, hit and miss
- the concept of dividing my card into virtual GPUs isn't actually right for my use case because I keep experimenting and I don't know today what kind of (virtual) card I am going to need to tomorrow. This is too inflexible for me.

Since I need my GPU to power my AI apps, I have created one VM to which I pass through the (entire) GPU and in which I run various docker containers that all share this card and each container can theoretically make use of the entire GPU and its VRAM (that isn't allocated to another container at that time). This gives me much greater flexibility.
 
  • Like
Reactions: toralux

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!