vGPU with nVIDIA on Kernel 6.8

Hi there,

we have been pulling at our hairs (all available) as we just can not get our Nvidia A30 Gpu's to work any more since moving to the latest kernel (6.8.12-1-pve). All steps, documents and latest drivers have been implemented and on our hosts we can see the GPUs but as soon as we add the GPU via cli to a VM (ubuntu only vm's) we get this;

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: vfio 0000:44:00.4: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
TASK ERROR: start failed: QEMU exited with code 1

we even tried changing from "-device vfio-pci" to "-device nvidia" and get this;
kvm: -device nvidia,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: 'nvidia' is not a valid device model name
TASK ERROR: start failed: QEMU exited with code 1

running a "lspci -d 10de: -k" we get this;


44:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

running nvidia-smi we see this;



+-----------------------------------------------------------------------------------------+


| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |


|-----------------------------------------+------------------------+----------------------+


| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |


| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |


| | | MIG M. |


|=========================================+========================+======================|


| 0 NVIDIA A30 On | 00000000:44:00.0 Off | Off |


| N/A 30C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |


| | | Disabled |


+-----------------------------------------+------------------------+----------------------+


| 1 NVIDIA A30 On | 00000000:C4:00.0 Off | 0 |


| N/A 31C P0 32W / 165W | 0MiB / 24576MiB | 0% Default |


| | | Disabled |


+-----------------------------------------+------------------------+----------------------+

(sorry formatting is a bit off)
but one can see the GPUs are there...
sriov-manage also states that all is good;
GPU at 0000:44:00.0 already has VFs enabled.
GPU at 0000:c4:00.0 already has VFs enabled.

in /etc/modprobe.d/blacklist.conf we have the following set;
blacklist nouveau
blacklist nvidia

Where are we going wrong?
Any tips are advise would be awesome.

Thanks in advance.
 
Hi there,

we have been pulling at our hairs (all available) as we just can not get our Nvidia A30 Gpu's to work any more since moving to the latest kernel (6.8.12-1-pve). All steps, documents and latest drivers have been implemented and on our hosts we can see the GPUs but as soon as we add the GPU via cli to a VM (ubuntu only vm's) we get this;

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: vfio 0000:44:00.4: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
TASK ERROR: start failed: QEMU exited with code 1

we even tried changing from "-device vfio-pci" to "-device nvidia" and get this;
kvm: -device nvidia,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: 'nvidia' is not a valid device model name
TASK ERROR: start failed: QEMU exited with code 1

running a "lspci -d 10de: -k" we get this;


44:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


44:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


c4:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]


Kernel driver in use: nvidia


Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

running nvidia-smi we see this;



+-----------------------------------------------------------------------------------------+


| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |


|-----------------------------------------+------------------------+----------------------+


| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |


| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |


| | | MIG M. |


|=========================================+========================+======================|


| 0 NVIDIA A30 On | 00000000:44:00.0 Off | Off |


| N/A 30C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |


| | | Disabled |


+-----------------------------------------+------------------------+----------------------+


| 1 NVIDIA A30 On | 00000000:C4:00.0 Off | 0 |


| N/A 31C P0 32W / 165W | 0MiB / 24576MiB | 0% Default |


| | | Disabled |


+-----------------------------------------+------------------------+----------------------+

(sorry formatting is a bit off)
but one can see the GPUs are there...
sriov-manage also states that all is good;
GPU at 0000:44:00.0 already has VFs enabled.
GPU at 0000:c4:00.0 already has VFs enabled.

in /etc/modprobe.d/blacklist.conf we have the following set;
blacklist nouveau
blacklist nvidia

Where are we going wrong?
Any tips are advise would be awesome.

Thanks in advance.
Hi.

It seems you are using vgpu drivers version 17 and A30 seems to be supported only on vgpu 15.
See here https://docs.nvidia.com/vgpu/gpus-supported-by-vgpu.html
 
@Boysa22 thanks for the reply however I am not to sure that is correct.
A30 are managed by the NVIDIA AI Enterprise software package and for the 6.8.x Kernel is this the correct and current version.
100%
The earlier versions are not supported by the OS, I believe this is also highlighted in previous threads.
Nevertheless here is the product support guide and there you can see that A30 is support as vGPU
 

Attachments

  • 550.90.05-550.90.07-552.74-nvidia-ai-enterprise-product-support-matrix.pdf
    270.6 KB · Views: 5
fyi: i posted a patch series to our devel mailing list, so feel free to test it (if you want/can):

https://lists.proxmox.com/pipermail/pve-devel/2024-August/065046.html
Any chance you can shed some light on what the status of an official solution is? I know guruevi made that neat little hook script and you made those patches in the mailing list.

It would be nice to see an official solution that doesn't require patching or customization to work- especially for anyone on the enterprise side.
 
Any chance you can shed some light on what the status of an official solution is? I know guruevi made that neat little hook script and you made those patches in the mailing list.

It would be nice to see an official solution that doesn't require patching or customization to work- especially for anyone on the enterprise side.
for the patches to be included they must be reviewed first (and possibly a second/third round if the reviewer finds issues). this can take a bit sometimes, especially if there is much to do or people are on vacation etc.
i'll see if i can nudge some colleague to get them to review sooner though
 
  • Like
Reactions: hna-jp
for the patches to be included they must be reviewed first (and possibly a second/third round if the reviewer finds issues). this can take a bit sometimes, especially if there is much to do or people are on vacation etc.
i'll see if i can nudge some colleague to get them to review sooner though
Awesome, thank you for all the help.

VNC works on my setup, my display does not get treated as 2 displays, but as 1 (A40/L40 Q-series vGPU). There is an "xvga" option to make the GPU an output at boot time, at which point only the UEFI type stuff will be displayed, although nVIDIA does have accelerated x11vnc etc.

Make sure you install the VirtIO drivers and the nVIDIA OpenGL patch for RDP on Windows. I'm playing around with Sunshine for a VDI streaming solution.

@AbsolutelyFree: https://docs.nvidia.com/vgpu/index.html - the R470 and R535 branches still are available. I don't know whether they've been updated for 6.8 yet since Ubuntu LTS is still on 6.5.

It is "safe" to pin 6.5 for a while depending on your security profile and features you want from the 6.8 kernel. Not sure if Canonical backports things into 6.5 for LTS.
Has anyone found a way to get a second VGX virtual display? Most of the documentation I've seen suggests you change the vGPU profile to one which supports multiple displays, and I've seen documentation to suggest this can be done with vGPU_Unlock-rs custom profiles but have yet to give it a try.

Also of note, in my testing, I found that Parsec generally works well by default as long as the vGPU is licensed when you try to connect. Parsec's virtual monitor also works well and isn't noticeably different from the base VGX virtual display
 
For people dealing with crashes:
I ran into a similar problem on Linux and called nVIDIA support, found out that the previous driver was conflicting with the new one. Uninstall the driver completely (on Linux with apt purge, not just remove). For Windows make sure you don’t have drivers “accidentally” installed by Windows Update. I also found you have to have a current qemu model for Windows (I selected 8.1).

As far as the question on NUMA, you should have number of sockets equal to those in your server and then QEMU apparently splits your memory into the right one IF there is free memory on that NUMA node. So if you have many guests and they are all using memory on one CPU or are allocating GPUs, your guest may not be allocating memory on the CPU the GPU is attached to despite having apparently enough, which causes all traffic to run over the QuickPath Interconnect.

For Danny: please look at the earlier posts and hook scripts on how to allocate a GPU on the new kernel. It currently does not have a GUI solution.

For multiple displays: most OS can render multiple virtual displays on 1 GPU, RDP can do it. Not sure what you are trying to do, I have systems with multiple vGPU attached (2xA40-48Q) for some high end workloads.
 
Last edited:
that's a question you have to ask nvidia. i'm not sure what their current policy regarding lts + newer kernels are, but only they have the capability to update the older branch of the driver for newer kernels...
Are there any ongoing discussions with NVIDIA regarding the potential support for Proxmox as a hypervisor in the near future? We have heard that this may occur in 2025, is there any truth to this information?
 
NVIDIA supports the generic Linux + KVM hypervisor. I haven’t had issues with support yet in this model, they will help you troubleshoot. The kernels Proxmox uses in the enterprise repo are functional with the current supported drivers. Their older branches get updates too and can be downloaded from their website.
 
I have the same problem as @just-danny.
Hardware is a gigabyte g292-z20 with a single L40S.

Code:
root@PVE-S02:~# qm start 111
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:c4:01.1: vfio 0000:c4:01.1: group 0 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
stopping swtpm instance (pid 5196) due to QEMU startup error
start failed: QEMU exited with code 1

Here are some details;
Code:
dmesg | grep -e DMAR -e IOMMU
[    3.621726] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    3.623993] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    3.625446] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    3.626636] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    3.629423] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    3.629441] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    3.629459] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    3.629474] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
Code:
nvidia-smi
Tue Sep  3 21:27:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:C4:00.0 Off |                    0 |
| N/A   28C    P8             36W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
Code:
nvidia-sriov.service
[Unit]
Description=Enable NVIDIA SR-IOV
After=network.target nvidia-vgpud.service nvidia-vgpu-mgr.service
Before=pve-guests.service

[Service]
Type=oneshot
ExecStartPre=/bin/sleep 5
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL
ExecStart=/usr/bin/nvidia-smi vgpu -shm 1

[Install]
WantedBy=multi-user.target
root@PVE-S02:~# /usr/lib/nvidia/sriov-manage -e ALL
GPU at 0000:c4:00.0 already has VFs enabled.
root@PVE-S02:~# /usr/bin/nvidia-smi vgpu -shm 1
Enabled vGPU heterogeneous mode for GPU 00000000:C4:00.0


Using the script from @guruevi , and the general NVIDIA-GRID-17.3 installer, and the details from @polloloco.
Have disabled the scripts, enabled the scripts and much forward and backwards... so no I am almost giving up...
Code:
111.conf
agent: 1
args: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:c4:01.1 -uuid 4f403d25-3ee7-42af-a713-6dc73eeda8fc
bios: ovmf
boot: order=scsi0
cores: 64
cpu: x86-64-v3
efidisk0: STORAGE:111/vm-111-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hookscript: local:snippets/nvidia_allocator.py
machine: pc-q35-9.0,viommu=virtio
memory: 774144
meta: creation-qemu=9.0.0,ctime=1721219135
name: vfx01.ad.out-post.tv
net0: virtio=BC:24:11:F5:0B:3B,bridge=vmbr0
numa: 0
ostype: win11
scsi0: STORAGE:111/vm-111-disk-2.qcow2,cache=writeback,iothread=1,size=512G
scsihw: virtio-scsi-single
smbios1: uuid=4f403d25-3ee7-42af-a713-6dc73eeda8fc
sockets: 1
tags: nvidia-1155
tpmstate0: STORAGE:111/vm-111-disk-0.raw,size=4M,version=v2.0
vga: none
vmgenid: b4a288d9-de32-4c61-82dd-a7c8c2be7549
 
Seems to be AMD specific, see here: https://forum.proxmox.com/threads/gpu-passthrough.97291/

Try pcie_acs_override=on,multiplatform besides amd_iommu=on in your Grub config and your UEFI also should have an ACS toggle (see also here: https://forum.level1techs.com/t/using-acs-to-passthrough-devices-without-whole-iommu-group/122913/6)

Basically your mobo is sharing multiple devices on the same IOMMU group is what the error seems to indicate, so all devices in group 0 must be passed through. You can try moving the PCIe slot but this is more an issue of the motherboard/CPU design.
 
  • Like
Reactions: Aanerud
Thanks, that was a success!
So since I run ZFS and UEFI, I have now added the details in /etc/kernel/cmdline .

Code:
root=ZFS=rpool/ROOT/pve-1 boot=zfs iommu=pt pcie_acs_override=downstream,multifunction

Then I did the;
Code:
proxmox-boot-tool refresh
reboot


Looks like this workaround basically gave each device an IOMMU Group, I verified it all with this command.

Code:
for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nnks "${d##*/}"; done

For example;
Code:
IOMMU group 97 c4:01.0 3D controller [0302]: NVIDIA Corporation AD102GL [L40S] [10de:26b9] (rev a1)
    Subsystem: NVIDIA Corporation AD102GL [L40S] [10de:0000]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
IOMMU group 98 c4:01.1 3D controller [0302]: NVIDIA Corporation AD102GL [L40S] [10de:26b9] (rev a1)
    Subsystem: NVIDIA Corporation AD102GL [L40S] [10de:0000]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
IOMMU group 99 c4:01.2 3D controller [0302]: NVIDIA Corporation AD102GL [L40S] [10de:26b9] (rev a1)
    Subsystem: NVIDIA Corporation AD102GL [L40S] [10de:0000]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

When I then started using your script @guruevi I did the following;

Code:
/var/lib/vz/snippets/nvidia_allocator.py 111 get_command -24Q

Then it gave an output that I added

qm set 111 --hookscript local:snippets/nvidia_allocator.py
qm set 111 --args "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:c4:04.3 -uuid 4f403d25-3ee7-42af-a713-6dc73eeda8fc"
qm set 111 --tags "nvidia-1155"


Now the VM is running, and ill do some heavy duty tasks on it, to see where it all ends :D
Thanks a bunch!
 
Hey guys!

I tried the fix from @guruevi but I cannot find the folder called 'nvidia' nor 'creatable_vgpu_types' file. Any idea of what might be wrong with my setup?

Commands I tried:

Bash:
cd /sys/bus/pci/devices/domain\:bus\:vf-slot.v-function/nvidia
cat creatable_vgpu_types
echo <TYPE> > current_vgpu_type

This is what I have under the 0000:00:02.0 folder which is my nvidia gpu:

Bash:
root@pve-node1:/sys/bus/pci/devices/0000:00:02.0# ls
0000:00:02.0:pcie001          aer_rootport_total_err_nonfatal  d3cold_allowed   iommu_group     msi_bus      resource                vendor
0000:00:02.0:pcie010          ari_enabled                      device           irq             msi_irqs     revision                wakeup
0000:02:00.0                  broken_parity_status             dma_mask_bits    link            numa_node    secondary_bus_number
aer_dev_correctable           class                            driver           local_cpulist   pci_bus      subordinate_bus_number
aer_dev_fatal                 config                           driver_override  local_cpus      power        subsystem
aer_dev_nonfatal              consistent_dma_mask_bits         enable           max_link_speed  power_state  subsystem_device
aer_rootport_total_err_cor    current_link_speed               firmware_node    max_link_width  remove       subsystem_vendor
aer_rootport_total_err_fatal  current_link_width               iommu            modalias        rescan       uevent

nvidia-smi output:

Bash:
root@pve-node1:/sys/bus/pci/devices/0000:00:02.0# nvidia-smi
Sun Sep  8 21:48:42 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               On  |   00000000:02:00.0 Off |                    0 |
| 30%   43C    P8             26W /  230W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Running kernel version: 6.8.12-1-pve

Thank you in advance for any tips!
 
it seems you did not enable the virtual functions (sr-iov), see https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE for that part (this still applies even with the newer kernel+driver)
I have already enabled SR-IOV and performed all necessary steps before (I had kernel 6.5 and older driver 535.54 or so), but now with kernel 6.8 and latest nVidia vGPU driver (550.90.05) does not seem like I can enable SR-IOV.

I have double-checked that SR-IOV is enabled in BIOS.

After I ran the commands from the documentation, I get this:

Bash:
root@pve-node1:~# /usr/lib/nvidia/sriov-manage -e 0000:02:00.0
Enabling VFs on 0000:02:00.0
Cannot obtain unbindLock for 0000:02:00.0
/usr/lib/nvidia/sriov-manage: line 90: echo: write error: Device or resource busy
root@pve-node1:~# lspci -d 10de:
02:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)

This is the same with the service that I already had created from before:

Bash:
root@pve-node1:~# systemctl status nvidia-sriov.service
○ nvidia-sriov.service - Enable NVIDIA SR-IOV
     Loaded: loaded (/etc/systemd/system/nvidia-sriov.service; enabled; preset: enabled)
     Active: inactive (dead) since Mon 2024-09-09 22:01:14 EEST; 26s ago
   Main PID: 888120 (code=exited, status=0/SUCCESS)
        CPU: 108ms

Sep 09 22:01:14 pve-node1 systemd[1]: Starting nvidia-sriov.service - Enable NVIDIA SR-IOV...
Sep 09 22:01:14 pve-node1 sriov-manage[888124]: Enabling VFs on 0000:02:00.0
Sep 09 22:01:14 pve-node1 sriov-manage[888124]: Cannot obtain unbindLock for 0000:02:00.0
Sep 09 22:01:14 pve-node1 sriov-manage[888124]: /usr/lib/nvidia/sriov-manage: line 90: echo: write>
Sep 09 22:01:14 pve-node1 systemd[1]: nvidia-sriov.service: Deactivated successfully.
Sep 09 22:01:14 pve-node1 systemd[1]: Finished nvidia-sriov.service - Enable NVIDIA SR-IOV.
 
It says device busy, do you have the nouveau or the ‘regular’ nVIDIA drivers loaded. Also make sure you are not already sharing the GPU with a VM in some other way (passthrough etc).

Make sure the card isn’t in graphics mode, it is a workstation card, so by default it may be in GPU mode.
https://developer.nvidia.com/displaymodeselector
 
Last edited:
It says device busy, do you have the nouveau or the ‘regular’ nVIDIA drivers loaded. Also make sure you are not already sharing the GPU with a VM in some other way (passthrough etc).

Make sure the card isn’t in graphics mode, it is a workstation card, so by default it may be in GPU mode.
https://developer.nvidia.com/displaymodeselector
Seems like everything is in order (see below):

Here it shows it is in VGPU mode with SR-IOV enabled:

Bash:
root@pve-node1:~# nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Sep 10 10:48:18 2024
Driver Version                            : 550.90.05
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Product Name                          : NVIDIA RTX A5000
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1321721010432
    GPU UUID                              : GPU-1ef666f3-ec80-32fb-1775-e4180ac0                        5996
    Minor Number                          : 0
    VBIOS Version                         : 94.02.6D.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x200
    Board Part Number                     : 900-5G132-2700-003
    GPU Part Number                       : 2231-850-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G132.0500.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
        vGPU Heterogeneous Mode           : Disabled
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x223110DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x147E17AA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 312 MiB
        Used                              : 0 MiB
        Free                              : 22717 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 1 MiB
        Free                              : 32767 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 40 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 90 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 26.43 W
        Current Power Limit               : 230.00 W
        Requested Power Limit             : 230.00 W
        Default Power Limit               : 230.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 230.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 8001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 675.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None

Nouveau driver is blacklisted and seems like gpu is using the right driver from this output:


Bash:
root@pve-node1:~# cat /etc/modprobe.d/blacklist.conf
blacklist nouveau

root@pve-node1:~# cat /sys/module/nvidia/version
550.90.05

root@pve-node1:~# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.90.05  Mon May 27 14:20:18 UTC 2024
GCC version:  gcc version 12.2.0 (Debian 12.2.0-14)

root@pve-node1:~# nvidia-smi
Tue Sep 10 10:55:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               On  |   00000000:02:00.0 Off |                    0 |
| 30%   40C    P8             26W /  230W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
 
Last edited:
It says display mode enabled. You can see details on modules with lsmod and details on PCIe with lspci. Once SR-IOV is enabled you should see many (virtual) PCIe cards
 
@johndoe1029 I have the same GPU as you, so I post nvidia-smi output here for your reference:
Code:
==============NVSMI LOG==============

Timestamp                                 : Tue Sep 10 21:34:07 2024
Driver Version                            : 550.90.05
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA RTX A5000
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1320422014495
    GPU UUID                              : GPU-7b367052-db19-ca8d-5a60-40ef1844ed2f
    Minor Number                          : 0
    VBIOS Version                         : 94.02.6D.00.0E
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : 900-5G132-0100-001
    GPU Part Number                       : 2231-850-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G132.0500.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
        vGPU Heterogeneous Mode           : Enabled
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x223110DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x147E1028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 41 %
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 24564 MiB
        Reserved                          : 321 MiB
        Used                              : 13953 MiB
        Free                              : 10291 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 1 MiB
        Free                              : 32767 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 69 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 90 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 111.87 W
        Current Power Limit               : 230.00 W
        Requested Power Limit             : 230.00 W
        Default Power Limit               : 230.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 230.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1695 MHz
        SM                                : 1695 MHz
        Memory                            : 8000 MHz
        Video                             : 1485 MHz
    Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 8001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 918.750 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 13547
            Type                          : C+G
            Name                          : vgpu
            Used GPU Memory               : 1984 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 13584
            Type                          : C+G
            Name                          : vgpu
            Used GPU Memory               : 11968 MiB

From my experience, if something is not working, it is best to start from beginning. So I would use the nvidia-uninstall command and then start from anew after reboot. I hope it can be done in your case. The script I am using to enable vGPU is very basic. I also needed to add the options to GRUB for SR-IOV even if BIOS has it enabled. Maybe check this https://pve.proxmox.com/wiki/PCI(e)_Passthrough and be sure you have intel_iommu=on in kernel command line. See if you have the correct kernel modules loaded too. This is how I enable vGPU:
Code:
bus=$(nvidia-smi -q |grep ^GPU |awk -F " 0000" '{print tolower($2)}')  #detemine bus id of physical GPU
 /usr/lib/nvidia/sriov-manage -e $bus    # Enable VFs on the GPU
 nvidia-smi vgpu -i $bus -shm 1   #Enable mixed-mode size vGPU

 ##cd /sys/bus/pci/devices/$bus/   # cd to virtfn devices directory
 
 fn1=$(realpath /sys/bus/pci/devices/$bus/virtfn1)  #Extract virtfn path to a variable
 fn0=$(realpath /sys/bus/pci/devices/$bus/virtfn0)
 
 ##cd fn1/nvidia       #Go into nvidia directory to configure vGPU type of the virtual function. cat creatable_vgpu_type to see what is supported
 
 echo 665 > $fn1/nvidia/current_vgpu_type        # Enable Nvidia 12Q Ram vgpu for virtfn1

 echo "frame_rate_limiter=0, disable_vnc=1" > $fn1/nvidia/vgpu_params  #Disable frame rate limiter and vnc for gaming)
 
 ##cd fn0/nvidia     # Go into nvidia directory to configure vGPU type of the virtual function.

cat creatable_vgpu_type #to see what is supported
 
 echo 660 > $fn0/nvidia/current_vgpu_type        #Enable Nvidia 2Q Ram vgpu for encoding purposes
 
 echo args: -device vfio-pci,sysfsdev=$fn1 -uuid $(qm config 103 | grep 'uuid=' | cut -c15-) >> /etc/pve/qemu-server/103.conf  #Add vGPU 12Q to Virtual Machine with ID 103
 
 echo args: -device vfio-pci,sysfsdev=$fn0 -uuid $(qm config 101 | grep 'uuid=' | cut -c15-) >> /etc/pve/qemu-server/101.conf     #Add vGPU 2Q to Virtual Machine with ID 101
 
Last edited:
It says display mode enabled. You can see details on modules with lsmod and details on PCIe with lspci. Once SR-IOV is enabled you should see many (virtual) PCIe cards
I know what it should look like, had it working with previous driver version. Just checked with the displaymodeselector tool again and it was set on physical_display_disabled already. I set it again and rebooted, but same situation.

I don't think
Bash:
Display Mode                          : Enabled
from the nvidia-smi output means anything in this situation. (see @Boysa22 version above).

I think I will uninstall everything and start fresh, maybe I missed something or there was something that went wrong at some point.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!