vGPU with nVIDIA on Kernel 6.8

@relink: You will need to download 17.4 from your nVIDIA enterprise account to work with the current kernel versions, I haven't had much luck getting 16.8 to run, I believe you can still run the guest with 16.8 if you have a 17.4 hypervisor.
Thank you for your advice :)

But how to upgrade the driver properly?

Never done this before, can you help me out?

As far as I know, there is no simple "upgrade" option or just pulling the 17.4 and installing it.

I created a thread yesterday, asking for exchaning experpiences :)
https://forum.proxmox.com/threads/proper-way-to-update-upgrade-nvidia-drivers.158523/
 
Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:
GPU 00000000:07:00.0
    GRID A100X-8C

GPU 00000000:0B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:48:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:4C:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:88:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:8B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:C8:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:CB:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:
Code:
error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.
 
  • Like
Reactions: Boysa22
Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:
GPU 00000000:07:00.0
    GRID A100X-8C

GPU 00000000:0B:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:48:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:4C:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:88:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:8B:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:C8:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:CB:00.0
    GRID A100X-4C
    GRID A100X-5C
    GRID A100X-8C
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:
Code:
error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.
Hello, Im have the exact same issue with the same hardware, are you referring to this driver? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/vgpu-host-driver-5 Im waiting for a trial from nvidia if it is this could you send it over somehow?
 
Sorry for the late reply.

What drivers were you trying before?
We tried with the normal vGPU drivers from the Nvidia Licensing Portal (NLP) but they didn't work. They just brought us very close to the solution which was really frustrating :rolleyes:

Hello, Im have the exact same issue with the same hardware, are you referring to this driver? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/vgpu-host-driver-5 Im waiting for a trial from nvidia if it is this could you send it over somehow?
So sorry I didn't read your request earlier. Did you manage to get the free evaluation yet?
To be precise we used the "vGPU Host Driver 4" which contains the driver version 535.216.01 and we haven't tried with newer version yet because we were so happy it worked and didn't want to bother with it any longer. But I would really appreciate if you could test it with a newer driver version and give a short reply if it worked or not.
 
We're experiencing this error now:
Code:
error writing '461' to '/sys/bus/pci/devices/0000:07:00.4/nvidia/current_vgpu_type': Invalid argument
TASK ERROR: could not set vgpu type to '461' for '0000:07:00.4'

Last time we could just reboot the whole hypervisor and it worked again, but what if we have to change something in productive environment?
It seems like this is happening when a GPU was configured as a vGPU beforehand and disconnected from a VM. The vGPU Type in current_vgpu_type is reset to 0 correctly and the creatable_vgpu_types file shows all possible types again. But current_vgpu_type is just not writable:
Code:
/sys/bus/pci/devices/0000:07:00.4/nvidia# ls -al
total 0
drwxr-xr-x 2 root root    0 Dec 19 09:45 .
drwxr-xr-x 6 root root    0 Dec 19 09:43 ..
-r--r--r-- 1 root root 4096 Dec 19 10:41 creatable_vgpu_types
-rw-r--r-- 1 root root 4096 Dec 19 15:52 current_vgpu_type
-rw-r--r-- 1 root root 4096 Dec 19 12:23 vgpu_params

Here it seems the file is writable, but:
1734622085370.png
nano editor says [ Error writing lock file ./.current_vgpu_type.swp: Permission denied ] .
And if we just try to edit the file with the editor and want to save it, it says [ Error writing current_vgpu_type: Invalid argument ].

Do you know what we could try to restart/reset so we could assign a vGPU Type to a GPU again without the need of a hypervisor restart?

Edit: copied the wrong error message

Edit2: Figured it out!!! Seems like disabling and enabling the Virtual Functions with /usr/lib/nvidia/sriov-manage did the trick.
 
Last edited:
You shouldn’t use a text editor, it is not a real file or file system. Use echo and redirects if you want to do it manually.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!