setting up NVidia vGPU is driving me mad

proxwolfe

Well-Known Member
Jun 20, 2020
499
50
48
49
I am trying to set up an Nvidia vGPU for AI workloads and I am following this guide: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x#

My card is an RTX A5000. I obtained the NVidia grid host driver (535.129) and installed it on the PVE host. I activated SR-IOV and now have several virtual devices that I can pass through to a VM.

I created a Debian 12 VM and installed the NVidia grid guest driver (535.129) in the VM. nvidia-smi is showing me the virtual card (with Cuda 12.2).

My troubles start when I try to set up Cuda.

I downloaded the official Cuda package (12.2; because it says somewhere that 12.3 is not compatible) from Nvidia. It wants to install a "driver" in addition to the Cuda Toolkit. If I let it install the driver (535.54), it deinstalls my vGPU guest driver and then tells me that itself is not compatible with the card found. So I thought, well, this seems to be a normal driver that I can do without (because I already have the grid guest driver). But if I don't let it install, it complains that the "Cuda driver" was not installed. So apparently, it is a Cuda driver after all. But why does it deinstall my grid guest driver then???

If I install the official Debian non-free Nvidia drivers, nvidia-smi can't communicate with the driver anymore. Which is not surprising. But using this driver was my last straw.

Can someone tell me, PLEASE!!!, how to set up Cuda in a VM with a vGPU? I'm about to lose my mind over this...

Thanks!
 
Only NVidia knows their drivers, firmware and hardware. The open-source Linux kernel holds no secrets but people outside of NVidia would just be guessing. Have you asked NVidia support?
EDIT: apologies for not being helpful. Maybe someone here has encountered the same problem and has found a solution.
 
Last edited:
Only NVidia knows their drivers, firmware and hardware. The open-source Linux kernel holds no secrets but people outside of NVidia would just be guessing.
Yeah, this thing remains a mystery too me. I have wasted more hours on this than on any other thing in my home lab ever.
Have you asked NVidia support?
NVidia support is going to be my next stop. But I thought I'd try here first, given that the Proxmox guys have experimented with vGPU (albeit not necessarily Cuda) with the same type of card I own, that I am already registered here, and that this is the forum with the most competent people I know.
EDIT: apologies for not being helpful. Maybe someone here has encountered the same problem and has found a solution.
That's what I'm hoping for.
 
So I managed to get this working after a (long) while. ABut yesterday it stopped working again. I don't know why but my best guess is this:

When I installed the Grid drivers (both on the host and in the VM), I needed to also install the respective kernel headers and make utilities. So while I don't fully understand the mechanism, my understanding is that the driver needed to be tailored to my kernel.

This worked for a while. But then I upgraded both PVE and the VM OS. After a reboot of the VM I noticed that NVidia-SMI complained that the driver was not loaded. (At that time, I did not have any idea as to the reasons.) I then rebootet the host and found the same happening on the host.

Therefore, could it be that the drivers don't work anymore because the kernels (of the host and of the VM) have changed?

If so, I would have expected the upgrade process to take care of this and trigger the rebuilding of the drivers for the new kernels. But maybe this expectation is naive?

If so, how can I manually trigger the rebuilding of the drivers for the new kernels now?

And is it possible to have this happen automatically in the future?

Or am I barking up the wrong tree and there's another cause for all of this?

Thanks!
 
Therefore, could it be that the drivers don't work anymore because the kernels (of the host and of the VM) have changed?
Yes, very much so. Linux kernel developers do not care about keeping stable internal interfaces for proprietary drivers (as opposed to regular programs where they work very hard to maintain compatibility). They prefer that vendors work with them to include drivers in the kernel package where the community can keep them up-to-date after the vendors get bored of it.

If so, I would have expected the upgrade process to take care of this and trigger the rebuilding of the drivers for the new kernels. But maybe this expectation is naive?
Only if the proprietary drivers are packaged with DKMS and only if they are still compatible with the new kernel and you have updated the kernel headers appropriately. Otherwise it is a manual process.

ETA: NVidia is well-known in the open-source community for being one of the worst vendors to work with. Ok, maybe Broadcom is worse, but it is a close call.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!