vGPU with nVIDIA on Kernel 6.8

I ended up going this route...for now. Everything is working fine. This cluster is in production, with NVIDIA GRID licensed A10 GPUs. After getting a renewal quote from Broadcom for VMware licensing at 10x last year's amount, it was a no-brainer. We already have Proxmox as our production server cluster for several years, but this one is for VDI, where Horizon View was king (unfortunately).
Is it just as simple as uninstalling the NVIDIA GRID drivers then rebuilding the 6.8 kernel? I'm in that position right now as I'm pinned to the 6.5 kernel right now in order to have the cluster continue to run. However, I'm running into an issue as I'm seeing kernel taint errors which may explain why the hosts in my cluster are randomly dropping out.
 
Hello evryone , kernel 6.8.12-2-pve + Nvidia driver 550.90.05 (17.3) + patch , everything work fine :)
Nvidia RTX2080
Great news. Did you uninstall the NVIDIA drivers first? I'm currently pinned to the 6.5 kernel and everytime I do apt dist-upgrade the kernel build fails and if I try to boot into the 6.8 kernel it hangs on boot.
 
Great news. Did you uninstall the NVIDIA drivers first? I'm currently pinned to the 6.5 kernel and everytime I do apt dist-upgrade the kernel build fails and if I try to boot into the 6.8 kernel it hangs on boot.
First boot with 6.8 , next download nvidia driver and patch it , then install na patched driver (when start install the driver , ask you to uninstall ....)

proxmox-boot-tool kernel pin 6.8.12-2-pve
reboot
chmod +x NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run --apply-patch 550.90.05.patch
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm-custom.run --dkms -m=kernel
reboot
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2080 On | 00000000:86:00.0 Off | N/A |
| 0% 52C P8 29W / 260W | 8163MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 28716 C+G vgpu 2024MiB |
| 0 N/A N/A 31114 C+G vgpu 2024MiB |
| 0 N/A N/A 38171 C+G vgpu 2024MiB |
| 0 N/A N/A 46857 C+G vgpu 2024MiB |
+-----------------------------------------------------------------------------------------+
 
My problem is that on the hosts that had 535 and 550 drivers installed and running the 6.5 kernel, during the apt dist-upgrade to get the new kernel the 6.8.4 and 6.8.12 kernels both fail to compile properly (based on the dkms make.log errors) and when I try to boot either it hangs on boot. I suppose I could try those kernels in recovery mode...

But I think I might just do a clean uninstall of the drivers, re-run apt dist-upgrade and hope for the best.

Do you want to continue? [Y/n] y
Get:1 https://enterprise.proxmox.com/debian/pve bookworm/pve-enterprise amd64 pve-headers all 8.2.0 [2,848 B]
Fetched 2,848 B in 1s (4,934 B/s)
Selecting previously unselected package pve-headers.
(Reading database ... 247183 files and directories currently installed.)
Preparing to unpack .../pve-headers_8.2.0_all.deb ...
Unpacking pve-headers (8.2.0) ...
Setting up pve-headers (8.2.0) ...
Setting up proxmox-kernel-6.8.8-4-pve-signed (6.8.8-4) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.8.8-4-pve /boot/vmlinuz-6.8.8-4-pve
dkms: running auto installation service for kernel 6.8.8-4-pve.
Sign command: /lib/modules/6.8.8-4-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Building module:
Cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.8.8-4-pve modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 6.8.8-4-pve (x86_64)
Consult /var/lib/dkms/nvidia/535.154.02/build/make.log for more information.
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
dkms: autoinstall for kernel: 6.8.8-4-pve failed!
run-parts: /etc/kernel/postinst.d/dkms exited with return code 11
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.8-4-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.8-4-pve-signed (--configure):
installed proxmox-kernel-6.8.8-4-pve-signed package post-installation script subprocess returned error exit status 2
Setting up proxmox-kernel-6.8.12-2-pve-signed (6.8.12-2) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.8.12-2-pve /boot/vmlinuz-6.8.12-2-pve
dkms: running auto installation service for kernel 6.8.12-2-pve.
Sign command: /lib/modules/6.8.12-2-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Building module:
Cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.8.12-2-pve modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 6.8.12-2-pve (x86_64)
Consult /var/lib/dkms/nvidia/535.154.02/build/make.log for more information.
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
dkms: autoinstall for kernel: 6.8.12-2-pve failed!
run-parts: /etc/kernel/postinst.d/dkms exited with return code 11
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.12-2-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.12-2-pve-signed (--configure):
installed proxmox-kernel-6.8.12-2-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.8:
proxmox-kernel-6.8 depends on proxmox-kernel-6.8.12-2-pve-signed | proxmox-kernel-6.8.12-2-pve; however:
Package proxmox-kernel-6.8.12-2-pve-signed is not configured yet.
Package proxmox-kernel-6.8.12-2-pve is not installed.
Package proxmox-kernel-6.8.12-2-pve-signed which provides proxmox-kernel-6.8.12-2-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.8 (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-default-kernel:
proxmox-default-kernel depends on proxmox-kernel-6.8; however:
Package proxmox-kernel-6.8 is not configured yet.

dpkg: error processing package proxmox-default-kernel (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-ve:
proxmox-ve depends on proxmox-default-kernel; however:
Package proxmox-default-kernel is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
proxmox-kernel-6.8.8-4-pve-signed
proxmox-kernel-6.8.12-2-pve-signed
proxmox-kernel-6.8
proxmox-default-kernel
proxmox-ve
E: Sub-process /usr/bin/dpkg returned an error code (1)
 
First boot with 6.8 , next download nvidia driver and patch it , then install na patched driver (when start install the driver , ask you to uninstall ....)

proxmox-boot-tool kernel pin 6.8.12-2-pve
reboot
chmod +x NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run --apply-patch 550.90.05.patch
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm-custom.run --dkms -m=kernel
reboot
nvidia-smi
Perfect. Actually, I had to patch in 6.5, install the patched driver, then run apt dist-upgrade to rebuild the 6.8.x kernels then reboot to the 6.8 kernel. Without the dist-upgrade step it would hang. I think if I run into a similar issue with future kernels, rebuild the NVIDIA driver first
:cool:
 
Hi all

I have installed the patch for kernel 6.8.12-2-pve with driver version 535.183.04. The installation worked and nvidia-smi shows the graphics card. However, mdevctl types shows nothing.

1730147359281.png

When starting, you can see that the devices are activated:
1730147403097.png

Does anyone know the problem and has a solution for it?

Thank you and best regards
 
Hi all

I have installed the patch for kernel 6.8.12-2-pve with driver version 535.183.04. The installation worked and nvidia-smi shows the graphics card. However, mdevctl types shows nothing.

View attachment 76933

When starting, you can see that the devices are activated:
View attachment 76934

Does anyone know the problem and has a solution for it?

Thank you and best regards
Hi MisterDeeds,

Solutions are given in the first post of the thread by guruevi
 
Is it just as simple as uninstalling the NVIDIA GRID drivers then rebuilding the 6.8 kernel? I'm in that position right now as I'm pinned to the 6.5 kernel right now in order to have the cluster continue to run. However, I'm running into an issue as I'm seeing kernel taint errors which may explain why the hosts in my cluster are randomly dropping out.

I'm still on 6.5 for the time being. It's working. I don't have a need to change it. I want to see how Proxmox handles assigning vGPU resources with the changes in 6.8 before trying to go down that road again.
 
just fyi, my patches were applied.
i think not every package was bumped/built yet, but most of them should be on the pvetest repository
Awesome!

Did I get this right, that the gui will be usable again?
Should we use mdev again?

What is the proper way now to use vgpu? Sorry, I lost the overview :D
 
while the underlying sysfs paths and exact mechanism changed a bit, we opted to map it to our old interface on the pve side

so the pve config is the same as previoiusly (so hostpci0: 000:XX:YY.Z,mdev=nvidia-123) even though there are technically no 'mediated devices' involved
we did this because
* we hope there won't be any more of such changes (fingers crossed)
* to keep the old configs working on upgrade
 
  • Like
Reactions: Funar
Hi all,

I stumbled across this thread on my research on how to get our NVIDIA A100 40GB GPUs working as vGPUs in Proxmox.
Problem is, NVIDIA itself says the A100 is supported until vGPU v15 which is driver version 525.xxx.xx. But because I couldn't build the kernel modules to get this driver version up and running on Proxmox 8.3.0 and kernel 6.8 I tried with driver version 535.216.01 (vGPU v16.8) and at first glance it looks like the driver is working correctly. `nvidia-smi` shows the GPUs, `nvidia-smi vgpu` shows every GPU as vGPU-compatible and I could get SR-IOV running to activate the "Virtual Functions" on every GPU. `lspci -d 10de:` also shows multiple entries for every GPU but `mdevctl types` keeps showing an empty line.

If I understood this correctly, @dcsapak wrote a patch so that you could use the "old" method through mediated devices again to add vGPU profiles to your VM, even though there are technically no mediated devices involved, but the new "vendor specific framework". So does this mean the workaround from @guruevi in his OP isn't longer necessary? And should the "mediated devices" also be assignable through the GUI with this patch? Because the "MDev Type" pulldown menu is greyed out for me, too.

Could you please help me to get the "mdev profiles" for the GPUs so that I could assign them to VMs or tell me, how else I could assign vGPU profiles to the VMs?
 
Hi all,

I stumbled across this thread on my research on how to get our NVIDIA A100 40GB GPUs working as vGPUs in Proxmox.
Problem is, NVIDIA itself says the A100 is supported until vGPU v15 which is driver version 525.xxx.xx. But because I couldn't build the kernel modules to get this driver version up and running on Proxmox 8.3.0 and kernel 6.8 I tried with driver version 535.216.01 (vGPU v16.8) and at first glance it looks like the driver is working correctly. `nvidia-smi` shows the GPUs, `nvidia-smi vgpu` shows every GPU as vGPU-compatible and I could get SR-IOV running to activate the "Virtual Functions" on every GPU. `lspci -d 10de:` also shows multiple entries for every GPU but `mdevctl types` keeps showing an empty line.

If I understood this correctly, @dcsapak wrote a patch so that you could use the "old" method through mediated devices again to add vGPU profiles to your VM, even though there are technically no mediated devices involved, but the new "vendor specific framework". So does this mean the workaround from @guruevi in his OP isn't longer necessary? And should the "mediated devices" also be assignable through the GUI with this patch? Because the "MDev Type" pulldown menu is greyed out for me, too.

Could you please help me to get the "mdev profiles" for the GPUs so that I could assign them to VMs or tell me, how else I could assign vGPU profiles to the VMs?
Hey,

I hassled around with this as well, maybe my "guide" can help you out?
https://forum.proxmox.com/threads/nvidia-supported-gpu-with-vgpu-and-licensing.157802/post-723431
 
Hey,

I hassled around with this as well, maybe my "guide" can help you out?
https://forum.proxmox.com/threads/nvidia-supported-gpu-with-vgpu-and-licensing.157802/post-723431
Hey, thank you for the fast reply.
But I don't think licensing is the cause of our problem. But your guide could definitely be usefull in the future so thanks for that :)

In the meantime we tried to just assign one of the virtual functions to a VM via PCI and without MDev Type and install the guest driver but this also failed. The /var/log/nvidia-installer.log file says:
Code:
[...]
-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[ 6631.912529] nvidia: probe of 0000:06:10.0 failed with error -1
[ 6631.914996] NVRM: The NVIDIA GPU 0000:06:10.0 (PCI ID: 10de:20b0)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 535.216.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[ 6631.917432] nvidia: probe of 0000:06:10.1 failed with error -1
[ 6631.919903] NVRM: The NVIDIA GPU 0000:06:10.0 (PCI ID: 10de:20b0)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 535.216.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[ 6631.922083] nvidia: probe of 0000:06:10.2 failed with error -1
[ 6631.924828] NVRM: The NVIDIA GPU 0000:06:10.0 (PCI ID: 10de:20b0)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 535.216.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[ 6631.926993] nvidia: probe of 0000:06:10.3 failed with error -1
[ 6631.927066] NVRM: The NVIDIA probe routine failed for 4 device(s).
[ 6631.927068] NVRM: None of the NVIDIA devices were initialized.
[ 6631.927631] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Which isn't understandable because if you look in the Appendix A our GPU is definitely listed, including the correct Device PCI_ID. And the same driver version is correctly installed on the host for the same GPUs.

EDIT: In your guide you mentioned
2. Get the list of supported Profiles:
root@pve-21:/sys/bus/pci/devices/0000:41:00.4/nvidia# cat creatable_vgpu_types
But none of our devices in /sys/bus/pci/devices/ got a subfolder called "nvidia" and so there's also no file called "creatable_vgpu_types".
 
Last edited:
@relink: You will need to download 17.4 from your nVIDIA enterprise account to work with the current kernel versions, I haven't had much luck getting 16.8 to run, I believe you can still run the guest with 16.8 if you have a 17.4 hypervisor.
 
  • Like
Reactions: relink
@guruevi : Don't know if this was also a suggestion for our problem, so we tried installing v17.4 with driver version 550.127.06 but the problem is still the same.

While digging deeper we also found some notes in the NVIDIA docs (you already shared in your post), why there are no mediated devices, which i wanted to share to maybe clear things up for some other readers:

mdev_supported_types
A directory named mdev_supported_types is required under the sysfs directory foreach physical GPU that will be configured with NVIDIA vGPU. How this directory is created for a GPU depends on whether the GPU supports SR-IOV.
For a GPU that supports SR-IOV, such as a GPU based on the NVIDIA Ampere architecture, you must create this directory by enabling the virtual function for the GPU as explained in Creating an NVIDIA vGPU on a Linux with KVM Hypervisor. The mdev_supported_types directory itself is never visible on the physical function.
How to create an NVIDIA vGPU on a Linux with KVM hypervisor depends on the following factors:
- Whether the NVIDIA vGPU supports single root I/O virtualization (SR-IOV)
- Whether the hypervisor uses a vendor-specific Virtual Function I/O (VFIO) framework for an NVIDIA vGPU that supports SR-IOV.


Note:
A hypervisor that uses a vendor-specific VFIO framework uses it only for an NVIDIA vGPU that supports SR-IOV. The hypervisor still uses the mediated VFIO mdevdriver framework for a legacy NVIDIA vGPU. A vendor-specific VFIO framework does not support the mediated VFIO mdev driver framework.
And also:
For GPUs that support SR-IOV, use of a vendor-specific VFIO framework is introduced in Ubuntu release 24.04.

The A100 is an Ampere Card, so SR-IOV is supported and the hypervisor uses vendor-specific VFIO. SR-IOV is enabled and the virtual functions are also visible:

Code:
ls -l /sys/bus/pci/devices/0000:07:00.0/ | grep virtfn

lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn0 -> ../0000:07:00.4
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn1 -> ../0000:07:00.5
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn10 -> ../0000:07:01.6
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn11 -> ../0000:07:01.7
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn12 -> ../0000:07:02.0
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn13 -> ../0000:07:02.1
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn14 -> ../0000:07:02.2
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn15 -> ../0000:07:02.3
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn2 -> ../0000:07:00.6
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn3 -> ../0000:07:00.7
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn4 -> ../0000:07:01.0
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn5 -> ../0000:07:01.1
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn6 -> ../0000:07:01.2
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn7 -> ../0000:07:01.3
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn8 -> ../0000:07:01.4
lrwxrwxrwx 1 root root           0 Dec  4 15:56 virtfn9 -> ../0000:07:01.5

But if we follow the link on Creating the NVIDIA vGPU it doesn't help either because just like the workaround of your OP it requires the nvidia folder containing the creatable_vgpu_types file, which we don't have.

EDIT: Mentioned that @guruevi shared the link to the docs in his OP, sorry.
 
Last edited:
Sorry, didn't realize what you're asking for, I thought it wasn't working at all. What are the outputs from
nvidia-vgpud.service
nvidia-vgpu-mgr.service
/usr/lib/nvidia/sriov-manage -e ALL

But A100s are different than the A40s and what most people want on this thread is a vGPU (workstation type) which requires vGPU licenses. Is your card in MIG mode? The A100 has both a time sliced and MIG mode, I think on A100 MIG mode creates virtual functions that you have to pass through as a 'regular' PCIe card, but I don't have any A100s to test this with.
 
Oh nevermind, thanks for your help.

No the Cards aren't in MIG mode, I just checked with nvidia-smi -q :
Code:
nvidia-smi -q | grep -e "vGPU Device Capability" -A 6

    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
--
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled

Sorry, one of the Cards is currently through passed in one of the VMs.

But the output of the Services you mentioned are more interesting:
Code:
systemctl status nvidia-vgpud.service 

○ nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; preset: enabled)
     Active: inactive (dead) since Thu 2024-12-05 08:47:40 CET; 5s ago
    Process: 396999 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 396999 (code=exited, status=0/SUCCESS)
        CPU: 19ms

Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:48:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:4c:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:7:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:b:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:c8:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:88:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: GPU not supported by vGPU at PCI Id: 0:8b:0:0 DevID: 0x10de / 0x20b0 / 0x10de / 0x0000
Dec 05 08:47:40 groot nvidia-vgpud[396999]: error: failed to send vGPU configuration info to RM: 6
Dec 05 08:47:40 groot systemd[1]: nvidia-vgpud.service: Deactivated successfully.
Dec 05 08:47:40 groot systemd[1]: Finished nvidia-vgpud.service - NVIDIA vGPU Daemon.

Code:
systemctl status nvidia-vgpu-mgr.service 

× nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Thu 2024-12-05 08:49:15 CET; 7s ago
   Duration: 1ms
    Process: 397630 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
    Process: 397633 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 397632 (code=exited, status=1/FAILURE)
        CPU: 5ms

Dec 05 08:49:15 groot systemd[1]: Starting nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon...
Dec 05 08:49:15 groot systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
Dec 05 08:49:15 groot nvidia-vgpu-mgr[397632]: notice: vmiop_env_log: Directory /var/run/nvidia-vgpu-mgr will not be removed on exit
Dec 05 08:49:15 groot nvidia-vgpu-mgr[397632]: error: vmiop_env_log: Failed to open PID file: File exists
Dec 05 08:49:15 groot systemd[1]: nvidia-vgpu-mgr.service: Main process exited, code=exited, status=1/FAILURE
Dec 05 08:49:15 groot systemd[1]: nvidia-vgpu-mgr.service: Failed with result 'exit-code'.

And here the output of sriov-manage:

Code:
/usr/lib/nvidia/sriov-manage -e ALL

GPU at 0000:07:00.0 already has VFs enabled.
GPU at 0000:0b:00.0 already has VFs enabled.
GPU at 0000:48:00.0 already has VFs enabled.
GPU at 0000:4c:00.0 already has VFs enabled.
GPU at 0000:88:00.0 already has VFs enabled.
GPU at 0000:8b:00.0 already has VFs enabled.
GPU at 0000:c8:00.0 already has VFs enabled.
GPU at 0000:cb:00.0 already has VFs enabled.

I don't know why it says the GPU doesn't support vGPU. Maybe it's because the A100 doesn't operate properly with vGPU v16/v17 and their corresponding drivers?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!