vGPU, NVIDIA and their cryptic documentation

fenex

New Member
Apr 1, 2024
2
0
1
Proxmox 8, on dell 7920 rack WS.
512gb ram,
Xeon Gold 6154 X2 proc
96Tb raid 6.
100gb iSCSI
1x Nvidia A100 (planned 4x A100-80)

Project is a fire prediction AI model, We are scrambling to get this up and running before fire season really hits. With global warming and El Nino, Canada is going to burn completely unless we can get this prediction model going. the situation is rather dire.

the good news is our model was proven in the test cases we worked on last year, able to predict where fire would start, and which way it would spread so mitigation methods can be employed to stop the fire before it even starts.

this server is the result of modest funding by the government to further roll this out.

ok so here is what we are up against. we need to be able to run a windows 10 machine for the multispectral GIS software (Agisoft Metashape, Arcgis) some docker containers, and a couple of apline linux vm's all running off of 4 x Nvidia A100's

ive followed this tutorial

https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE

and this thread
https://forum.proxmox.com/threads/pci-passthrough-issue-on-dell-precision-7920-with-proxmox.135508/

the odd thing is i had it running.. once. I had to rebuild the windows VM, and the second time around it wont install the nvidia client drivers.

because there are no mdevs coming up, i largely suspect its a licencing issue. I have nvidia vGPU licences, but not the slightest clue how to install them on the proxmox server, and their documentation to say the least, sucks donkey.

is there is anyone out there who could help me with a walkthrough, throw me some bones, anything? I would super appreciate it.
 
Last edited:
Hi,

which nvidia driver did you use? according to https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html
the A100 is only supported up until the grid v15 driver (not by v16 or v17 anymore) which will not work in proxmox ve 8.x because of the newer kernel
for current driver support for the a100 nvidia wants you probably to use the https://www.nvidia.com/en-us/data-center/products/ai-enterprise/ but we currently don't have any experience with that, as the licensing seems even more steep that with vgpus and
is even harder to get (no trial AFAICS)

another way would be to pass through the complete gpu to a vm without splitting it up, then you don't need the nvidia host driver at all
 
oh lovely.. I just spent the entire long weekend on this trying to force this. Its too bad the Proxmox documentation didn't reflect this.

so what is the last proxmox that will run grid 15?
 
so what is the last proxmox that will run grid 15?
That would be PVE 7.4 with the 5.15 Kernel.

Ubuntu 22.04 is the highest supported version for Grid v15 - and since our kernel is derived from Ubuntu's, that should probably work.

Note that you can also use nested virtualization and just pass your GPU through to the nested PVE. I'm not sure if that's applicable in your situation, but I still wanted to mention it nevertheless - even if it's just for testing whether the driver actually runs first before nuking your entire system. Like @dcsapak mentioned, you wouldn't need Nvidia's driver on the host in that case either.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!