vGPU with nVIDIA on Kernel 6.8

relink · Dec 5, 2024

guruevi said:
@relink: You will need to download 17.4 from your nVIDIA enterprise account to work with the current kernel versions, I haven't had much luck getting 16.8 to run, I believe you can still run the guest with 16.8 if you have a 17.4 hypervisor.

Thank you for your advice

But how to upgrade the driver properly?

Never done this before, can you help me out?

As far as I know, there is no simple "upgrade" option or just pulling the 17.4 and installing it.

I created a thread yesterday, asking for exchaning experpiences

https://forum.proxmox.com/threads/proper-way-to-update-upgrade-nvidia-drivers.158523/

giraffe_official · Dec 5, 2024

Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:

GPU 00000000:07:00.0
    GRID A100X-8C

GPU 00000000:0B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:48:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:4C:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:88:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:8B:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:C8:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

GPU 00000000:CB:00.0
    GRID A100X-4C 
    GRID A100X-5C 
    GRID A100X-8C 
    GRID A100X-10C
    GRID A100X-20C
    GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:

Code:

error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.

guruevi · Dec 5, 2024

giraffe_official said:
Update:

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.

What drivers were you trying before?

thejames · Dec 9, 2024

giraffe_official said:
Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the nvidia folder with creatable_vgpu_types, etc.

Output of nvidia-smi vgpu -c to display the creatable vgpu types is now:

Code:

GPU 00000000:07:00.0 GRID A100X-8C GPU 00000000:0B:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:48:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:4C:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:88:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:8B:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:C8:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C GPU 00000000:CB:00.0 GRID A100X-4C GRID A100X-5C GRID A100X-8C GRID A100X-10C GRID A100X-20C GRID A100X-40C

As you can see, for the first GPU we already set the ID for type GRID A100X-8C in the first virtual function virtfn0 (which points to PCI-ID 0000:07:00.4) in current_vgpu_type. Must be something like 459.

But we couldn't start the VM configured with PCI device 0000:07:00.4 because it said:

Code:

error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument TASK ERROR: Cannot bind 0000:07:00.4 to vfio

We tried with another GPU (0000:0b:00) and added PCI devices 0000:0b:00.4 and 0000:0b:00.5 with MDev Type nvidia-459 configured via GUI to the VM and TADAAAA...it works!!!

After reboot, the first GPU also could be configured via GUI again.

We tested a few thing and until now everything works correctly.

Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.

Hello, Im have the exact same issue with the same hardware, are you referring to this driver? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/vgpu-host-driver-5 Im waiting for a trial from nvidia if it is this could you send it over somehow?

giraffe_official · Dec 19, 2024

Sorry for the late reply.

guruevi said:
What drivers were you trying before?

We tried with the normal vGPU drivers from the Nvidia Licensing Portal (NLP) but they didn't work. They just brought us very close to the solution which was really frustrating

thejames said:
Hello, Im have the exact same issue with the same hardware, are you referring to this driver? https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/vgpu-host-driver-5 Im waiting for a trial from nvidia if it is this could you send it over somehow?

So sorry I didn't read your request earlier. Did you manage to get the free evaluation yet?
To be precise we used the "vGPU Host Driver 4" which contains the driver version 535.216.01 and we haven't tried with newer version yet because we were so happy it worked and didn't want to bother with it any longer. But I would really appreciate if you could test it with a newer driver version and give a short reply if it worked or not.

giraffe_official · Dec 19, 2024

We're experiencing this error now:

Code:

error writing '461' to '/sys/bus/pci/devices/0000:07:00.4/nvidia/current_vgpu_type': Invalid argument
TASK ERROR: could not set vgpu type to '461' for '0000:07:00.4'

Last time we could just reboot the whole hypervisor and it worked again, but what if we have to change something in productive environment?
It seems like this is happening when a GPU was configured as a vGPU beforehand and disconnected from a VM. The vGPU Type in current_vgpu_type is reset to 0 correctly and the creatable_vgpu_types file shows all possible types again. But current_vgpu_type is just not writable:

Code:

/sys/bus/pci/devices/0000:07:00.4/nvidia# ls -al
total 0
drwxr-xr-x 2 root root    0 Dec 19 09:45 .
drwxr-xr-x 6 root root    0 Dec 19 09:43 ..
-r--r--r-- 1 root root 4096 Dec 19 10:41 creatable_vgpu_types
-rw-r--r-- 1 root root 4096 Dec 19 15:52 current_vgpu_type
-rw-r--r-- 1 root root 4096 Dec 19 12:23 vgpu_params

Here it seems the file is writable, but:

nano editor says [ Error writing lock file ./.current_vgpu_type.swp: Permission denied ] .
And if we just try to edit the file with the editor and want to save it, it says [ Error writing current_vgpu_type: Invalid argument ].

Do you know what we could try to restart/reset so we could assign a vGPU Type to a GPU again without the need of a hypervisor restart?

Edit: copied the wrong error message

Edit2: Figured it out!!! Seems like disabling and enabling the Virtual Functions with /usr/lib/nvidia/sriov-manage did the trick.

guruevi · Dec 21, 2024

You shouldn’t use a text editor, it is not a real file or file system. Use echo and redirects if you want to do it manually.

LeonardoNguyen · Jan 16, 2025

Dear all,
Nvidia 16.8 vGPU driver installed with Proxmox 8.3.2 with Tesla P4 GPU!

Linux pve 6.8.12-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Jan 16 14:00:42 +07 2025 on pts/0
root@pve:~# nvidia-smi
Thu Jan 16 14:11:05 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P4 On | 00000000:05:00.0 Off | 0 |
| N/A 35C P8 10W / 75W | 31MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P4 On | 00000000:0B:00.0 Off | 0 |
| N/A 36C P8 10W / 75W | 31MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla P4 On | 00000000:84:00.0 Off | 0 |
| N/A 35C P8 10W / 75W | 31MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla P4 On | 00000000:88:00.0 Off | 0 |
| N/A 37C P8 10W / 75W | 31MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

giraffe_official · Jan 29, 2025

Hello again,

after a few days of working with the vGPUs in our VMs we came across another weird problem:
We can easily assign 4 whole GPUs in form of 40 GB vGPUs, but if we want to attach more than 4 GPUs in total, e.g. something like 5 half GPUs in form of 20 GB vGPUs we suddenly get network issues, like the network interfaces of the VM not connecting to our networks anymore. This doesn't seem to be random. As soon as there are more than 4 vGPUs assigned to the VM, the VM refuses to connect to the network.

Did anyone of you ever experience such a behavior?

Edit: after a few seconds more of researching and testing, we found the solution, so I want to share it with you. We solved it by just leaving the PCI Express option blank on pci device creation. So there seems to be a problem with the number of PCIe-lanes and our CPU maybe. Changing to PCI allows us to add more than 4 vGPUs to our VMs. Proxmox documentation says the following to the PCI Express option:

[...] This does not mean that PCIe capable devices that are passed through as PCI devices will only run at PCI speeds. Passing through devices as PCIe just sets a flag for the guest to tell it that the device is a PCIe device instead of a "really fast legacy PCI device". Some guest applications benefit from this.

So it doesn't seem to be such a big thing to change it to PCI only.

Graxo · Feb 26, 2025

Hi,

We installed A Nvidia A16 GPU in one of our proxmox hosts running Proxmox 8.2.7 and kernel 6.8.12-8-pve.
Installing the driver, NVIDIA-Linux-x86_64-550.144.02-vgpu-kvm.run, works out of the box and the GPU is shown with nvidia-smi.

We have enabled SR-IOV, IOMMU and ARI like in the documentation of Nvidia.
Output of nvidia-smi:

Code:

Wed Feb 26 08:59:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.02             Driver Version: 550.144.02     CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A16                     On  |   00000000:45:00.0 Off |                  Off |
|  0%   44C    P8             16W /   62W |       0MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A16                     On  |   00000000:46:00.0 Off |                  Off |
|  0%   46C    P8             15W /   62W |       0MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A16                     On  |   00000000:47:00.0 Off |                  Off |
|  0%   40C    P8             15W /   62W |       0MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A16                     On  |   00000000:48:00.0 Off |                  Off |
|  0%   37C    P8             15W /   62W |       0MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

We have a service running that enables the VFs:

Code:

[Unit]
Description=Enable NVIDIA SR-IOV
After=network.target nvidia-vgpud.service nvidia-vgpu-mgr.service
Before=pve-guests.service

[Service]
Type=oneshot
ExecStartPre=/bin/sleep 5
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL
#ExecStart=/usr/bin/nvidia-smi vgpu -shm 1 #Enable this if you want to use multiple vGPU profiles on the same card.

[Install]
WantedBy=multi-user.target

We see the VFs and mdev types being created in the journalctl -b | grep nvidia-vgpud:

Code:

Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: Global settings:
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: Size: 16
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: Homogeneous vGPUs: 1
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: vGPU types: 492
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]:
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: pciId of gpu [0]: 0:45:0:0
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: pciId of gpu [1]: 0:46:0:0
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: pciId of gpu [2]: 0:47:0:0
Feb 26 08:36:26 <hostname> nvidia-vgpud[3404]: pciId of gpu [3]: 0:48:0:0
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]:
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: Physical GPU:
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: PciID: 0x0000 / 0x0045 / 0x0000 / 0x0000
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: Size: 56
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: DevID: 0x10de / 0x25b6 / 0x10de / 0x14a9
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: Supported vGPUs count: 12
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: Fractional Multivgpu supported: 0x1
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]:
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]:
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: Supported VGPU 0x2c5: max 16
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]:
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: VGPU Type 0x2c5: NVIDIA A16-1B Class: NVS
Feb 26 08:36:30 <hostname> nvidia-vgpud[3404]: DevId: 0x10de / 0x25b6 / 0x10de / 0x159d
etc..etc..

For each card VFs are created:

Code:

lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn0 -> ../0000:45:00.4
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn1 -> ../0000:45:00.5
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn10 -> ../0000:45:01.6
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn11 -> ../0000:45:01.7
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn12 -> ../0000:45:02.0
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn13 -> ../0000:45:02.1
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn14 -> ../0000:45:02.2
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn15 -> ../0000:45:02.3
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn2 -> ../0000:45:00.6
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn3 -> ../0000:45:00.7
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn4 -> ../0000:45:01.0
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn5 -> ../0000:45:01.1
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn6 -> ../0000:45:01.2
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn7 -> ../0000:45:01.3
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn8 -> ../0000:45:01.4
lrwxrwxrwx 1 root root           0 Feb 26 08:45 virtfn9 -> ../0000:45:01.5

lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn0 -> ../0000:46:00.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn1 -> ../0000:46:00.5
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn10 -> ../0000:46:01.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn11 -> ../0000:46:01.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn12 -> ../0000:46:02.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn13 -> ../0000:46:02.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn14 -> ../0000:46:02.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn15 -> ../0000:46:02.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn2 -> ../0000:46:00.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn3 -> ../0000:46:00.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn4 -> ../0000:46:01.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn5 -> ../0000:46:01.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn6 -> ../0000:46:01.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn7 -> ../0000:46:01.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn8 -> ../0000:46:01.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn9 -> ../0000:46:01.5

lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn0 -> ../0000:47:00.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn1 -> ../0000:47:00.5
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn10 -> ../0000:47:01.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn11 -> ../0000:47:01.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn12 -> ../0000:47:02.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn13 -> ../0000:47:02.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn14 -> ../0000:47:02.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn15 -> ../0000:47:02.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn2 -> ../0000:47:00.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn3 -> ../0000:47:00.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn4 -> ../0000:47:01.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn5 -> ../0000:47:01.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn6 -> ../0000:47:01.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn7 -> ../0000:47:01.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn8 -> ../0000:47:01.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn9 -> ../0000:47:01.5

lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn0 -> ../0000:48:00.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn1 -> ../0000:48:00.5
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn10 -> ../0000:48:01.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn11 -> ../0000:48:01.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn12 -> ../0000:48:02.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn13 -> ../0000:48:02.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn14 -> ../0000:48:02.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn15 -> ../0000:48:02.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn2 -> ../0000:48:00.6
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn3 -> ../0000:48:00.7
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn4 -> ../0000:48:01.0
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn5 -> ../0000:48:01.1
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn6 -> ../0000:48:01.2
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn7 -> ../0000:48:01.3
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn8 -> ../0000:48:01.4
lrwxrwxrwx 1 root root           0 Feb 26 08:58 virtfn9 -> ../0000:48:01.5

We also see all the profiles when running nvidia-smi vgpu -c:

Code:

GPU 00000000:45:00.0
    NVIDIA A16-1B
    NVIDIA A16-2B
    NVIDIA A16-1Q
    NVIDIA A16-2Q
    NVIDIA A16-4Q
    NVIDIA A16-8Q
    NVIDIA A16-16Q
    NVIDIA A16-1A
    NVIDIA A16-2A
    NVIDIA A16-4A
    NVIDIA A16-8A
    NVIDIA A16-16A

GPU 00000000:46:00.0
    NVIDIA A16-1B
    NVIDIA A16-2B
    NVIDIA A16-1Q
    NVIDIA A16-2Q
    NVIDIA A16-4Q
    NVIDIA A16-8Q
    NVIDIA A16-16Q
    NVIDIA A16-1A
    NVIDIA A16-2A
    NVIDIA A16-4A
    NVIDIA A16-8A
    NVIDIA A16-16A

GPU 00000000:47:00.0
    NVIDIA A16-1B
    NVIDIA A16-2B
    NVIDIA A16-1Q
    NVIDIA A16-2Q
    NVIDIA A16-4Q
    NVIDIA A16-8Q
    NVIDIA A16-16Q
    NVIDIA A16-1A
    NVIDIA A16-2A
    NVIDIA A16-4A
    NVIDIA A16-8A
    NVIDIA A16-16A

GPU 00000000:48:00.0
    NVIDIA A16-1B
    NVIDIA A16-2B
    NVIDIA A16-1Q
    NVIDIA A16-2Q
    NVIDIA A16-4Q
    NVIDIA A16-8Q
    NVIDIA A16-16Q
    NVIDIA A16-1A
    NVIDIA A16-2A
    NVIDIA A16-4A
    NVIDIA A16-8A
    NVIDIA A16-16A

When assigning a PCI device to a vm and pressing start we get the following error:
TASK ERROR: Cannot bind 0000:45:00.4 to vfio

The weird thing is that ls /sys/class/mdev_bus/ is also empty, should this be empty?
We have been throubleshooting this for the last few days and we dont know where to look anymore.
Is there anyone who could assist us?

dcsapak · Feb 26, 2025

Graxo said:
We have been throubleshooting this for the last few days and we dont know where to look anymore.
Is there anyone who could assist us?

can you post the output of

Code:

pveversion -v

? for the newer behavior in the nvidia driver for kernel 6.8+ you need at least qemu-server >= 8.2.6
also can you post your vm config ? did you use resource mapping for the passthrough?

the error

Graxo said:
TASK ERROR: Cannot bind 0000:45:00.4 to vfio

only happens when we try to rebind the card to the vfio driver (which we don't do for nvidia vgpus with current packages)

Graxo said:
The weird thing is that ls /sys/class/mdev_bus/ is also empty, should this be empty?

with the kernel 6.8+ and newer nvidia drivers, the kernel interface changed and it will not use the generic mdevs anymore (although we still call them that, for backwards compatibility)

giraffe_official · Feb 26, 2025

Hi Grexo,
we had a similar issue with the same error message. You can see it in this post.
just like dcsapak said, rebinding the card which was already binded was the problem.

At first we manually wrote the vgpu type to the file which lead to this kind of problem but when we configure it via the GUI it works.

Graxo · Feb 26, 2025

dcsapak said:
can you post the output of

Code:

pveversion -v

? for the newer behavior in the nvidia driver for kernel 6.8+ you need at least qemu-server >= 8.2.6
also can you post your vm config ? did you use resource mapping for the passthrough?

the error

only happens when we try to rebind the card to the vfio driver (which we don't do for nvidia vgpus with current packages)

with the kernel 6.8+ and newer nvidia drivers, the kernel interface changed and it will not use the generic mdevs anymore (although we still call them that, for backwards compatibility)

Thanks for the quick response!
Output pveversion -v

Code:

proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve: 6.8.12-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-1-pve: 6.5.13-1
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

We will update the qemu-server.

Code:

also can you post your vm config ? did you use resource mapping for the passthrough?

Yes, here it is. We do not use resource mappings atm.

Code:

#{
#"customerId"%3A "01",
#"os"%3A "W"
#}
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=scsi0;ide0
cores: 4
cpu: host
cpuunits: 1000
efidisk0: nlhaa01-nvme01:vm-4760-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:45:00.4,pcie=1
ide0: none,media=cdrom
ide2: nlhaa01-nvme01:vm-4760-cloudinit,media=cdrom,size=4M
ipconfig0: ip=<HIDDEN>/24,gw=<HIDDEN>
machine: pc-q35-7.2
memory: 8192
meta: creation-qemu=7.2.0,ctime=1702549892
name: <HIDDEN>
nameserver:  <HIDDEN>
net0: virtio= <HIDDEN>,bridge=vmbr1013
numa: 1
ostype: win11
scsi0: nlhaa01-nvme01:vm-4760-disk-1,discard=on,size=80G,ssd=1
scsi1: nlhaa01-nvme01:vm-4760-disk-3,backup=0,discard=on,size=25G,ssd=1
scsihw: virtio-scsi-pci
serial1: socket
smbios1: uuid= <HIDDEN>
sockets: 2
tpmstate0: nlhaa01-nvme01:vm-4760-disk-2,size=4M,version=v2.0
vcpus: 8
vmgenid:  <HIDDEN>

Graxo · Feb 26, 2025

After updating qemu server to qemu-server: 8.3.8 it all works.

EDIT: Most of our vms work normally. The ones that work like they should are the Windows 11 22H2 vms. Every Windows 11 24H2 vm crashes when adding a vGPU. Are there other who are experiencing this?

dcsapak · Feb 27, 2025

Graxo said:
After updating qemu server to qemu-server: 8.3.8 it all works.

EDIT: Most of our vms work normally. The ones that work like they should are the Windows 11 22H2 vms. Every Windows 11 24H2 vm crashes when adding a vGPU. Are there other who are experiencing this?

how/when does it crash? bluescreen? on boot? when you do something intensive?

can you check the host syslog/journal of that time? maybe the windows eventviewer in the guest?

Graxo · Feb 27, 2025

dcsapak said:
how/when does it crash? bluescreen? on boot? when you do something intensive?

can you check the host syslog/journal of that time? maybe the windows eventviewer in the guest?

Hi,

Its always the same blue screen, but it not happens at the same moment. It randomly crashes.
But the BSOD we get is always the same:

We think it has to do something with the Nvidia driver.

Kind regrads,
Graxo

dcsapak · Feb 27, 2025

can you post the vm config? did you try different cpu models?

Graxo · Feb 27, 2025

dcsapak said:
can you post the vm config? did you try different cpu models?

On all our vms we use the host cpu model. But i have also tried x86-64-v2-AES and x86-64-v4

Graxo · Thursday at 07:50

We installed NVIDIA Virtual GPU Software v18.0 (went really smooth) yesterday on one of our T4 gpu nodes and will continue testing, if this is successful we will continue to our A16 and A40 gpu nodes and Windows 24H2.

EDIT: After updating some Windows 11 to the new Nvidia guest driver (572.60), with the cpu type `host`, it gets a BSOD with the error `BAD POOL HEADER`. When changing the cpu to `x86-64-v2-AES` it boots, but then WSL doestn work anymore. Im still throubleshooting this.

dcsapak · 2025-03-11T10:12:56+0100

Graxo said:
EDIT: After updating some Windows 11 to the new Nvidia guest driver (572.60), with the cpu type `host`, it gets a BSOD with the error `BAD POOL HEADER`. When changing the cpu to `x86-64-v2-AES` it boots, but then WSL doestn work anymore. Im still throubleshooting this.

are you sure this has something to do with the guest drivers and not e.g. an update of windows itself?

vGPU with nVIDIA on Kernel 6.8

New Member

New Member

Active Member

New Member

New Member

New Member

Active Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

We value your privacy