[TUTORIAL] NVIDIA vGPU on Proxmox VE 7.x

dcsapak · Aug 22, 2022

Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html

MRosu · Aug 23, 2022

I am imagining an NVR VM which can be live migrated between hosts and can pick up vGPUs from the new host as easily as VCPUs.

In this current state, would this be possible?

dcsapak · Aug 23, 2022

no live migration is not really supported, while we could enable a flag on the qemu side, the driver must support that too, and the current version of the nvidia vgpu driver does not support live migration on linux

with the next version of my proposed patches linked, i'll introduce a mechanic such that a vm chooses a pci device dynamically (if configured) so no need to hardcode the address in the vm (still no live migration though)

appliedmatt · Aug 26, 2022

thanks for this, can't wait for improved clustering. I'll show results, trying this with a m6000 on Monday.

punjprateek · Aug 27, 2022

L

dcsapak said:
Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html

Looking forward to test it as i was unable to make SR-IOV work on RTX A6000 with VGPU Splicing

punjprateek · Aug 28, 2022

@dcsapak i would also suggest 1 key improvement in PCI ID menu if someone assign 1 function to a vm it should get grey out and should not be available for mounting to other vm untill and unless its removed from the existing vm.

spirit · Aug 29, 2022

spoiler

punjprateek · Aug 29, 2022

spirit said:
spoiler

Nice

dcsapak · Sep 1, 2022

punjprateek said:
@dcsapak i would also suggest 1 key improvement in PCI ID menu if someone assign 1 function to a vm it should get grey out and should not be available for mounting to other vm untill and unless its removed from the existing vm.

that will probably not work like this, else we would have to hold a complete cluster lock on editing vm configs...

what we already do though, is we 'reserve' the pci devices in a file on vm start, and prevent the start of another vm as long as that first one is running

punjprateek · Sep 1, 2022

dcsapak said:
that will probably not work like this, else we would have to hold a complete cluster lock on editing vm configs...

what we already do though, is we 'reserve' the pci devices in a file on vm start, and prevent the start of another vm as long as that first one is running

You reserve the PCI but it does not show its greyout or is being used untill and unless it is assigned to different vm and it gives out error.

dcsapak · Sep 1, 2022

punjprateek said:
You reserve the PCI but it does not show its greyout or is being used untill and unless it is assigned to different vm and it gives out error.

yes as i said, preventing a user from configuring the device would need a (at least) node wide lock for editing vm config, which is not really ideal and feasible
we could maybe implement something for the gui only (e.g. showing as 'in use' or something like that) but that would not prevent the users from configuring it (e.g. via api)

punjprateek · Sep 1, 2022

dcsapak said:
yes as i said, preventing a user from configuring the device would need a (at least) node wide lock for editing vm config, which is not really ideal and feasible
we could maybe implement something for the gui only (e.g. showing as 'in use' or something like that) but that would not prevent the users from configuring it (e.g. via api)

if we can get the status also through API thats its being active on a vm it will also the solve the problem

patch · Sep 3, 2022

punjprateek said:
if someone assign 1 function to a vm it should get grey out and should not be available for mounting to other vm untill and unless its removed from the existing vm.

Imo that would be a bad idea.
I often have different VM which would use the same resource, but don't simultaneously as I do not run both VM at the same time.
If your idea was implemented consistently a VM with pass through could not be copied.

In summary showing resources currently in use is a useful enhancement for the GUI but preventing duplicate assignment is a bad idea. The lock needs to occur on VM start up not VM definition.

jasonsansone · Oct 17, 2022

Do vGPU drivers allow host access to the gpu, ie retain ability to use the GPU in LXC containers? I have gpu that are used in multiple LXC containers and it would be nice to also leverage vGPU in VM, but I’m not sure if the drivers permit both.

dcsapak · Oct 18, 2022

jasonsansone said:
Do vGPU drivers allow host access to the gpu, ie retain ability to use the GPU in LXC containers? I have gpu that are used in multiple LXC containers and it would be nice to also leverage vGPU in VM, but I’m not sure if the drivers permit both.

i guess this depends on the driver, in my example, the nvidia vgpu driver does not use the card in a 'normal' way, so there is no devices nodes created on the host for gpu output, etc.
maybe it's possible, but i don't know how and a short look into the nvidia documents doesn't immediately reveal anything.

wastedolphine · Oct 23, 2022

dcsapak said:
Hi all,

recently we got access to a vGPU capable GPU (RTX A5000), and we put together a short how-to on how to use it with a current PVE 7.x here:
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

While Proxmox VE is not a supported platform for NVIDIAs GRID/vGPU, it seems to work quite well here. There are some things in the wiki article that are not fully packaged yet, but those are denoted by their upcoming versions.

Also note that there are more improvements incoming for handling pci passthrough (especially for clusters) (see [0] for the current state)
which will make using vgpus even easier (i will update the wiki article when those patches will be applied)

If you have any questions or suggestions, just ask here.

0: https://lists.proxmox.com/pipermail/pve-devel/2022-July/053565.html

Hi,

I followed the guide your provided but its not working for me.

I have Nvidia Quadro RTX 8000
Host drivers installled correctly, nvidia-smi works on host along with showing the vgpu profiles.

I was also able to install the guest drivers in ubuntu vm but the nvidia-smi command gives not able to communicate with hardware error on guest os.

Is there anything else I need to do to get this GPU working?

Thanks for the help.

deepcloud · Dec 1, 2022

Can anybody help me.

I am running AMD Ryzen 7950X with A5000 and have installed Proxmox 7.3 and have done the latest updates

I have followed the above, installed the VIFO Modules but when I run the nvidia Host driver i get a error

ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 450.216.04 -k 5.15.74-1-pve`:
Kernel preparation unnecessary for this kernel. Skipping...

the output from /var/log/nvidia-installer.log is below
----
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Dec 1 15:51:29 2022
installer version: 450.216.04

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

nvidia-installer command line:
./nvidia-installer

Using: nvidia-installer ncurses v6 user interface
-> Detected 32 CPUs online; setting concurrency level to 32.
-> Running distribution scripts
executing: '/usr/lib/nvidia/pre-unload'...
-> done.
-> Installing NVIDIA driver version 450.216.04.
-> There appears to already be a driver installed on your system (version: 450.216.04). As part of installing this driver (version: 450.216.04), the existin>
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kerne>
-> Uninstalling the previous installation with /usr/bin/nvidia-uninstall.
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (450.216.04):
executing: '/usr/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 450.216.04 -k 5.15.74-1-pve`:
Kernel preparation unnecessary for this kernel. Skipping...

Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.15.74-1-pve IGNORE_CC_MISMATCH='' modules...(bad exit status: 2)
Error! Bad return status for module build on kernel: 5.15.74-1-pve (x86_64)
Consult /var/lib/dkms/nvidia/450.216.04/build/make.log for more information.
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for>
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems i>

dcsapak · Dec 1, 2022

dcorp said:
Consult /var/lib/dkms/nvidia/450.216.04/build/make.log for more information.

can you post that file?

EDIT, no need, just saw that you try to install the 450 driver... i guess this is too old for the 5.15 kernel
check here: https://docs.nvidia.com/grid/index.html
which versions are supported and use a current one (e.g. we used here the 510.85.03)

deepcloud · Dec 1, 2022

Hi @dcsapak,

the problem is that i have downloaded the latest driver only - from the below (see attached screenshot)

we get this driver
https://griddownloads.nvidia.com/em..._CHch4zYmU8X9h_J7feEwVojXEARCsLw0fes4Vall3eBQ

so its ver 450.216.04 only. where do i find the ver. 510.85.03

dcsapak · Dec 1, 2022

this is for 'product version' 11.10, but the rtx a5000 only supports the v14/13 versions https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html
search for the linux kvm package for 'product version 14.3' for instance

[TUTORIAL] NVIDIA vGPU on Proxmox VE 7.x

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

New Member

Member

Member

Distinguished Member

Attachments

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Active Member

Proxmox Staff Member

New Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member