setting up NVidia vGPU is driving me mad

proxwolfe · Jan 28, 2024

I am trying to set up an Nvidia vGPU for AI workloads and I am following this guide: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x#

My card is an RTX A5000. I obtained the NVidia grid host driver (535.129) and installed it on the PVE host. I activated SR-IOV and now have several virtual devices that I can pass through to a VM.

I created a Debian 12 VM and installed the NVidia grid guest driver (535.129) in the VM. nvidia-smi is showing me the virtual card (with Cuda 12.2).

My troubles start when I try to set up Cuda.

I downloaded the official Cuda package (12.2; because it says somewhere that 12.3 is not compatible) from Nvidia. It wants to install a "driver" in addition to the Cuda Toolkit. If I let it install the driver (535.54), it deinstalls my vGPU guest driver and then tells me that itself is not compatible with the card found. So I thought, well, this seems to be a normal driver that I can do without (because I already have the grid guest driver). But if I don't let it install, it complains that the "Cuda driver" was not installed. So apparently, it is a Cuda driver after all. But why does it deinstall my grid guest driver then???

If I install the official Debian non-free Nvidia drivers, nvidia-smi can't communicate with the driver anymore. Which is not surprising. But using this driver was my last straw.

Can someone tell me, PLEASE!!!, how to set up Cuda in a VM with a vGPU? I'm about to lose my mind over this...

Thanks!

leesteken · Jan 28, 2024

Only NVidia knows their drivers, firmware and hardware. The open-source Linux kernel holds no secrets but people outside of NVidia would just be guessing. Have you asked NVidia support?
EDIT: apologies for not being helpful. Maybe someone here has encountered the same problem and has found a solution.

dcsapak · Jan 29, 2024

please check nvidias documentation for this, e.g. for the GRID v16 driver see here: https://docs.nvidia.com/grid/16.0/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu

proxwolfe · Jan 29, 2024

leesteken said:
Only NVidia knows their drivers, firmware and hardware. The open-source Linux kernel holds no secrets but people outside of NVidia would just be guessing.

Yeah, this thing remains a mystery too me. I have wasted more hours on this than on any other thing in my home lab ever.

leesteken said:
Have you asked NVidia support?

NVidia support is going to be my next stop. But I thought I'd try here first, given that the Proxmox guys have experimented with vGPU (albeit not necessarily Cuda) with the same type of card I own, that I am already registered here, and that this is the forum with the most competent people I know.

leesteken said:
EDIT: apologies for not being helpful. Maybe someone here has encountered the same problem and has found a solution.

That's what I'm hoping for.

proxwolfe · Apr 17, 2024

So I managed to get this working after a (long) while. ABut yesterday it stopped working again. I don't know why but my best guess is this:

When I installed the Grid drivers (both on the host and in the VM), I needed to also install the respective kernel headers and make utilities. So while I don't fully understand the mechanism, my understanding is that the driver needed to be tailored to my kernel.

This worked for a while. But then I upgraded both PVE and the VM OS. After a reboot of the VM I noticed that NVidia-SMI complained that the driver was not loaded. (At that time, I did not have any idea as to the reasons.) I then rebootet the host and found the same happening on the host.

Therefore, could it be that the drivers don't work anymore because the kernels (of the host and of the VM) have changed?

If so, I would have expected the upgrade process to take care of this and trigger the rebuilding of the drivers for the new kernels. But maybe this expectation is naive?

If so, how can I manually trigger the rebuilding of the drivers for the new kernels now?

And is it possible to have this happen automatically in the future?

Or am I barking up the wrong tree and there's another cause for all of this?

Thanks!

BobhWasatch · Apr 17, 2024

proxwolfe said:
Therefore, could it be that the drivers don't work anymore because the kernels (of the host and of the VM) have changed?

Yes, very much so. Linux kernel developers do not care about keeping stable internal interfaces for proprietary drivers (as opposed to regular programs where they work very hard to maintain compatibility). They prefer that vendors work with them to include drivers in the kernel package where the community can keep them up-to-date after the vendors get bored of it.

proxwolfe said:
If so, I would have expected the upgrade process to take care of this and trigger the rebuilding of the drivers for the new kernels. But maybe this expectation is naive?

Only if the proprietary drivers are packaged with DKMS and only if they are still compatible with the new kernel and you have updated the kernel headers appropriately. Otherwise it is a manual process.

ETA: NVidia is well-known in the open-source community for being one of the worst vendors to work with. Ok, maybe Broadcom is worse, but it is a close call.

Search

Search

setting up NVidia vGPU is driving me mad

proxwolfe

Well-Known Member

leesteken

Distinguished Member

dcsapak

Proxmox Staff Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

BobhWasatch

Famous Member