vGPU with nVIDIA on Kernel 6.8

@relink: You will need to download 17.4 from your nVIDIA enterprise account to work with the current kernel versions, I haven't had much luck getting 16.8 to run, I believe you can still run the guest with 16.8 if you have a 17.4 hypervisor.
Thank you for your advice :)

But how to upgrade the driver properly?

Never done this before, can you help me out?

As far as I know, there is no simple "upgrade" option or just pulling the 17.4 and installing it.

I created a thread yesterday, asking for exchaning experpiences :)
https://forum.proxmox.com/threads/proper-way-to-update-upgrade-nvidia-drivers.158523/
 
Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:
GPU 00000000:07:00.0
    GRID A100X-8C

GPU 00000000:0B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:48:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:4C:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:88:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:8B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:C8:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:CB:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:
Code:
error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.
 
  • Like
Reactions: Boysa22
Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:
GPU 00000000:07:00.0
    GRID A100X-8C

GPU 00000000:0B:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:48:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:4C:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:88:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:8B:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:C8:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:CB:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:
Code:
error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.
Hello, Im have the exact same issue with the same hardware, are you referring to this driver? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/vgpu-host-driver-5 Im waiting for a trial from nvidia if it is this could you send it over somehow?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!