NVIDIA MIG (Multi-Instance GPU) on Proxmox

jbssm · Oct 23, 2021

Does Proxmox support the relatively new NVIDIA MIG functionality for GPU virtualization: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

I did manage to install the required drivers and setup the different GPU contexts from Proxmox command line, but then, there is no way to assign those to a Proxmox VM in the graphical interface.

jbssm · Oct 28, 2021

jbssm said:
Does Proxmox support the relatively new NVIDIA MIG functionality for GPU virtualization: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

I did manage to install the required drivers and setup the different GPU contexts from Proxmox command line, but then, there is no way to assign those to a Proxmox VM in the graphical interface.

Any word in this?

dcsapak · Oct 28, 2021

while it seems that some users have had success with nvidia vgpus on pve [0][1][2]
we sadly do not have such cards here, so i cannot vouch for the results...

0: https://gist.github.com/cdknight/8f63a66d3280e6d5bf12dd4f4dac9183
1: https://wvthoog.nl/proxmox-7-vgpu/
2: https://forum.proxmox.com/threads/proxmox-5-3-tesla-p40-vgpu-issues.51871/

jbssm · Oct 28, 2021

Thank you @dcapak

This is not VGPU though, it's MIG, a new(ish from 2020) technology from NVIDIA.

And I can confirm that creating a virtual GPRU with MIG does work correctly on Proxmox command line. But after that, nothing else can't be done because (unlike with VGPU instances that Proxmox allows attributing to a VM), Proxmox isn't aware they exist and we can't assign them to a VM.

dcsapak · Oct 29, 2021

ah ok, sorry i only skimmed the linked article.

after looking a bit more it seems that the default deployment is via devices in '/dev' ? so it should
simply be possible to bindmount that into containers? (there is no mention of qemu/kvm/vms on that page, so i assume that will not work with MIG)

jbssm · Nov 18, 2021

Sorry for taking so long but I've been trying a lot of different stuff to get this work and nothing works except for using Ubuntu or Suse on bare metal.

What do you mean by bindmount the /dev device into containers? I got the idea bindmount in Proxmox was just of LXC. Can I use this in a VM?

Dunuin · Nov 18, 2021

so it should simply be possible to bindmount that into containers? (there is no mention of qemu/kvm/vms on that page, so i assume that will not work with MIG)

dcsapak · Nov 18, 2021

jbssm said:
I got the idea bindmount in Proxmox was just of LXC. Can I use this in a VM?

yes exactly, the docs does not mention any use of vms, so i guess that will not work, but they mention containers here: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-containers
although i did not read through the whole documentation, and they only mention their custom docker toolset, my educated guess is that they passthrough the relevant devices generated in /dev which the toolkit in the container can use

dcsapak · Nov 18, 2021

also, you can of course passthrough the whole gpu into a vm and use ubuntu/opensuse in there

further edit: it seems those gpus also support 'vgpu' but that requires a different driver AFAICT that is not freely available (only to subscriber of nvidia licenses AFAIK)

jbssm · Nov 18, 2021

I can pass it with full passthrough, yes, and that works, but it really defeats the purpose since it can't only be assigned to a single VM instead of several.

VGPU is supposed to work, and I tried with the subscriber NVIDIA drivers, but `mdevctl` complains it can't find any device, so I end up not being able to create a VGPU as a temporary workaround.

As a direction, creating a MIG virtualized instance works by creating a SR-IOV device. It even shows it's UUID when queried. But on Proxmox, it doesn't show up under `/sys/bus/mdev/devices/` like it does in SUSE for instance.

If it did show up under mdev devices, this would be much easier to pull up with some hacking (and to properly add to proxmox interface in the future).

P.S.: SUSE has a nice guide about it all: https://documentation.suse.com/sles/15-SP3/html/SLES-all/article-nvidia-vgpu.html

dcsapak · Nov 18, 2021

jbssm said:
VGPU is supposed to work, and I tried with the subscriber NVIDIA drivers, but `mdevctl` complains it can't find any device, so I end up not being able to create a VGPU as a temporary workaround.

i guess this is the problem, even the suse docs want to create a vgpu with the conventional method (that we already support -> mdev)

jbssm said:
As a direction, creating a MIG virtualized instance works by creating a SR-IOV device. It even shows it's UUID when queried. But on Proxmox, it doesn't show up under `/sys/bus/mdev/devices/` like it does in SUSE for instance.

how did you create that instance?

jbssm · Nov 18, 2021

dcsapak said:
how did you create that instance?

simply by doing: nvidia-smi mig -i 0 -cgi 14

dcsapak · Nov 22, 2021

sorry for the late answer..

where do you see the vgpu then? in the sysfs ?

if you want to use vgpus in pve, there has to be a pci device that exposes the 'mediated devices' in the sysfs. you can then select this device in the pve gui as pci device, and the 'mediated device' drop down should
give you the available models. pve will then create a new instance (via sysfs) on vm start and clean it up on vm stop

jbssm · Nov 22, 2021

No, I don't see it anywhere in the system under Proxmox.
It's the `nvidia-smi` that shows it with the command: nvidia-smi -L

For instance:

Code:

> nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ee14e29d-dd5b-2e8e-eeaf-9d3debd10788)
 MIG 4g.20gb     Device  0: (UUID: MIG-fed03f85-fd95-581b-837f-d582496d0260)

On SUSE this shows up under: /sys/bus/mdev/devices, (i.e. /sys/bus/mdev/devices/fed03f85-fd95-581b-837f-d582496d0260). In Ubuntu I don't remember exactly where. But on Proxmox I can't find any device this way.

dcsapak · Nov 22, 2021

yes and after that you must create the sr-iov instance on top (like the suse docs mention)
https://documentation.suse.com/sles...#configure-nvidia-vgpu-passthrough-with-sriov

jbssm · Nov 22, 2021

Right, you can do that, but you don't need to. This is for the cases where you want to create a MIG device, and then divide it into further VGPU device(s).

But, are you saying that would be the workaround though? I tried to use directly mdevctl on the MIG (which didn't work), maybe I'm missing the SR-IOV step there.

wastedolphine · Nov 5, 2022

@jbssm did you manage to get this working?
I have Nvidia A100 80 GB card, i got it working in passthrough but same as your mdevctl types are not loaded in proxmox.

dosmage · Jan 25, 2023

Hello! I am actively working on this for a project at my employment. There are a lot of misunderstandings in this threat. First the A100 and A30 support MIG (Multi-Instance Graphics) which is not GRID. MIG partitions the A100 and A30s in to smaller contexts. GRID uses vGPU technology that create PCI devices that can passthru to VMs. I'm not certain that MIG utilizes SR-IOV in any capacity. The instances of the graphics cards are exposed via /dev. To be honest, I don't believe GRID uses SR-IOV either, it uses software mediated devices, e.g. not hardware virtualization like the Intel XE or AMDPRO GPUs do.
I believe that MIG will not work with virtual machines, but will require containers, as mentioned in this thread. This is due to the devices not being exposed as hardware, like with GRID. GRID is a licensed feature and comes at a price.
I don't wish to encourage license circumventing software, but if you are looking for vGPU to work with Proxmox, as purely an educational endeavour, vgpu-unlock can unlock software mediated devices on consumer grade hardware. The last time I was researching this, for the project I'm currently on, I could not get an RTX3080 to work, but I was able to see a drop down list of all supported mediated devices within the Proxmox user interface. I could not get the cards to be recognized; however in the discourse of getting the A100s to be passed into my VMs I discovered that VM UEFI bios for Proxmox is bugged and I could not get the devices. I would like to circle back to this and discover whether I can get vGPU with mdevctl devices to function with SeaBIOS vs OVMF, which is what seems to be bugged. For the record, the bug is that the nvidia kmod will not recognize the card being passed in, regardless if it's a full device or a virtual GPU.
I hope to report back for anyone who is stuck on this project, or at least complete the missing components to this thread; however this will likely be a couple month's project. The answer may still be to use containers only, as outlined in the aforementioned article.
*I should clarify that the Ampere architecture does support SR-IOV; presumably also is the reason vgpu-unlock doesn't yet support Ampere devices. I was working with 1080s as well as 3080s for this.

Search

Search

NVIDIA MIG (Multi-Instance GPU) on Proxmox

jbssm

Member

jbssm

Member

dcsapak

Proxmox Staff Member

jbssm

Member

dcsapak

Proxmox Staff Member

jbssm

Member

Dunuin

Distinguished Member

dcsapak

Proxmox Staff Member

dcsapak

Proxmox Staff Member

jbssm

Member

dcsapak

Proxmox Staff Member

jbssm

Member

dcsapak

Proxmox Staff Member

jbssm

Member

dcsapak

Proxmox Staff Member

jbssm

Member

wastedolphine

New Member

dosmage

Active Member