[TUTORIAL] NVIDIA vGPU on Proxmox VE 7.x

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
9,804
1,355
273
35
Vienna
Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html
 
I am imagining an NVR VM which can be live migrated between hosts and can pick up vGPUs from the new host as easily as VCPUs.

In this current state, would this be possible?
 
no live migration is not really supported, while we could enable a flag on the qemu side, the driver must support that too, and the current version of the nvidia vgpu driver does not support live migration on linux

with the next version of my proposed patches linked, i'll introduce a mechanic such that a vm chooses a pci device dynamically (if configured) so no need to hardcode the address in the vm (still no live migration though)
 
  • Like
Reactions: need2gcm and flames
thanks for this, can't wait for improved clustering. I'll show results, trying this with a m6000 on Monday.
 
L
Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html
Looking forward to test it as i was unable to make SR-IOV work on RTX A6000 with VGPU Splicing
 
@dcsapak i would also suggest 1 key improvement in PCI ID menu if someone assign 1 function to a vm it should get grey out and should not be available for mounting to other vm untill and unless its removed from the existing vm.
that will probably not work like this, else we would have to hold a complete cluster lock on editing vm configs...

what we already do though, is we 'reserve' the pci devices in a file on vm start, and prevent the start of another vm as long as that first one is running
 
that will probably not work like this, else we would have to hold a complete cluster lock on editing vm configs...

what we already do though, is we 'reserve' the pci devices in a file on vm start, and prevent the start of another vm as long as that first one is running
You reserve the PCI but it does not show its greyout or is being used untill and unless it is assigned to different vm and it gives out error.
 
You reserve the PCI but it does not show its greyout or is being used untill and unless it is assigned to different vm and it gives out error.
yes as i said, preventing a user from configuring the device would need a (at least) node wide lock for editing vm config, which is not really ideal and feasible
we could maybe implement something for the gui only (e.g. showing as 'in use' or something like that) but that would not prevent the users from configuring it (e.g. via api)
 
yes as i said, preventing a user from configuring the device would need a (at least) node wide lock for editing vm config, which is not really ideal and feasible
we could maybe implement something for the gui only (e.g. showing as 'in use' or something like that) but that would not prevent the users from configuring it (e.g. via api)
if we can get the status also through API thats its being active on a vm it will also the solve the problem
 
if someone assign 1 function to a vm it should get grey out and should not be available for mounting to other vm untill and unless its removed from the existing vm.
Imo that would be a bad idea.
I often have different VM which would use the same resource, but don't simultaneously as I do not run both VM at the same time.
If your idea was implemented consistently a VM with pass through could not be copied.

In summary showing resources currently in use is a useful enhancement for the GUI but preventing duplicate assignment is a bad idea. The lock needs to occur on VM start up not VM definition.
 
Do vGPU drivers allow host access to the gpu, ie retain ability to use the GPU in LXC containers? I have gpu that are used in multiple LXC containers and it would be nice to also leverage vGPU in VM, but I’m not sure if the drivers permit both.
i guess this depends on the driver, in my example, the nvidia vgpu driver does not use the card in a 'normal' way, so there is no devices nodes created on the host for gpu output, etc.
maybe it's possible, but i don't know how and a short look into the nvidia documents doesn't immediately reveal anything.
 
Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html
Hi,

I followed the guide your provided but its not working for me.

I have Nvidia Quadro RTX 8000
Host drivers installled correctly, nvidia-smi works on host along with showing the vgpu profiles.

I was also able to install the guest drivers in ubuntu vm but the nvidia-smi command gives not able to communicate with hardware error on guest os.

Is there anything else I need to do to get this GPU working?

Thanks for the help.
 
Can anybody help me.

I am running AMD Ryzen 7950X with A5000 and have installed Proxmox 7.3 and have done the latest updates

I have followed the above, installed the VIFO Modules but when I run the nvidia Host driver i get a error

ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 450.216.04 -k 5.15.74-1-pve`:
Kernel preparation unnecessary for this kernel. Skipping...

the output from /var/log/nvidia-installer.log is below
----
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Dec 1 15:51:29 2022
installer version: 450.216.04

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

nvidia-installer command line:
./nvidia-installer

Using: nvidia-installer ncurses v6 user interface
-> Detected 32 CPUs online; setting concurrency level to 32.
-> Running distribution scripts
executing: '/usr/lib/nvidia/pre-unload'...
-> done.
-> Installing NVIDIA driver version 450.216.04.
-> There appears to already be a driver installed on your system (version: 450.216.04). As part of installing this driver (version: 450.216.04), the existin>
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kerne>
-> Uninstalling the previous installation with /usr/bin/nvidia-uninstall.
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (450.216.04):
executing: '/usr/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 450.216.04 -k 5.15.74-1-pve`:
Kernel preparation unnecessary for this kernel. Skipping...

Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.15.74-1-pve IGNORE_CC_MISMATCH='' modules...(bad exit status: 2)
Error! Bad return status for module build on kernel: 5.15.74-1-pve (x86_64)
Consult /var/lib/dkms/nvidia/450.216.04/build/make.log for more information.
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for>
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems i>
 
Consult /var/lib/dkms/nvidia/450.216.04/build/make.log for more information.
can you post that file?

EDIT, no need, just saw that you try to install the 450 driver... i guess this is too old for the 5.15 kernel
check here: https://docs.nvidia.com/grid/index.html
which versions are supported and use a current one (e.g. we used here the 510.85.03)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!