[TUTORIAL] PVE 8.22 / Kernel 6.8 and NVidia vGPU

patrickob · Aug 10, 2024

Thanks for pointing to the vGPU-Unlock-patcher! I tried to get this to work. For it, I had to update the patch.sh file to include the L40S parameters:

Code:

vcfgclone ${TARGET}/vgpuConfig.xml 0x26B5 0x176F 0x26B9 0x0000      # L40S

this however is just one profile of the L40S.

Then I included the NVIDIA-Linux-x86_64-535.161.05-vgpu-kvm package and ran the patch and then ./nvidia-installer --dkms -m=kernel.

Sadly, it gave another error:

Code:

Unable to load the kernel module 'nvidia-vgpu-vfio.ko'.  This happens most frequently when this kernel    
         module was built against the wrong or improperly configured kernel sources, with a version of gcc that
         differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and
         prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device    
         installed in this system is supported by this NVIDIA Linux graphics driver release.
                                                                                                                   
         Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file
         '/var/log/nvidia-installer.log' for more information.

In the log, there are a ton of errors.

I found the patch github has a discord server. There I noticed that another user had a similar experience. All installed well, but mdevctl types empty with L40S and ProxMox 8.

Sadly, in his last post, while he says it's solved, he mentioned he went back to PVE 7.4 with kernel 6.2.16-20-bpo11-pve and host driver 535.183.04. That's quite the pain to do, as there is no rollback from 8 -> 7, but only full wipe and reinstall...

KciNicK · Aug 10, 2024

NVIDIA-Linux-x86_64-535.161.05-vgpu-kvm will not work, use NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm instead, and no need to edit the patch as L40S is already supported

patrickob · Aug 10, 2024

ah lol

I thought I had to patch the file, and it gave an error on 550.90. That's why I went to 535.161 as it was in the patch written already.

With 550.90, it gave the errors in attachment...

KciNicK · Aug 10, 2024

i'm really sorry, my mistake, this is the correct line to get the 550.90 patch

# git clone --recursive --branch 550.90 https://github.com/VGPU-Community-Drivers/vGPU-Unlock-patcher

patrickob · Aug 10, 2024

Super! That indeed made it that I could run the patcher. So one step further!

When I then run

Code:

./nvidia-installer --dkms -m=kernel

I get the following:

Code:

[16182.286839] NVRM: The NVIDIA GPU 0000:e3:00.0 (PCI ID: 10de:26b9)
               NVRM: installed in this vGPU host system is not supported by
               NVRM: proprietary nvidia.ko.
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU)
               NVRM: Software documentation, available at docs.nvidia.com.
[16182.286902] nvidia: probe of 0000:e3:00.0 failed with error -1
[16182.287016] NVRM: The NVIDIA GPU 0000:82:00.0 (PCI ID: 10de:26b9)
               NVRM: installed in this vGPU host system is not supported by
               NVRM: proprietary nvidia.ko.
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU)
               NVRM: Software documentation, available at docs.nvidia.com.
[16182.287066] nvidia: probe of 0000:82:00.0 failed with error -1
[16182.289862] NVRM: The NVIDIA probe routine was not called for 64 device(s).
[16182.289866] NVRM: This can occur when another driver was loaded and
               NVRM: obtained ownership of the NVIDIA device(s).

Maybe I should first do an uninstallation of previously installed packages? I already tried to uninstall the NVidia driver, and reinstall the patched one. It however gave the same error...

KciNicK · Aug 10, 2024

patrickob said:

Super! That indeed made it that I could run the patcher. So one step further!

When I then run

Code:

./nvidia-installer --dkms -m=kernel

I get the following:

Code:

[16182.286839] NVRM: The NVIDIA GPU 0000:e3:00.0 (PCI ID: 10de:26b9)
               NVRM: installed in this vGPU host system is not supported by
               NVRM: proprietary nvidia.ko.
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU)
               NVRM: Software documentation, available at docs.nvidia.com.
[16182.286902] nvidia: probe of 0000:e3:00.0 failed with error -1
[16182.287016] NVRM: The NVIDIA GPU 0000:82:00.0 (PCI ID: 10de:26b9)
               NVRM: installed in this vGPU host system is not supported by
               NVRM: proprietary nvidia.ko.
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU)
               NVRM: Software documentation, available at docs.nvidia.com.
[16182.287066] nvidia: probe of 0000:82:00.0 failed with error -1
[16182.289862] NVRM: The NVIDIA probe routine was not called for 64 device(s).
[16182.289866] NVRM: This can occur when another driver was loaded and
               NVRM: obtained ownership of the NVIDIA device(s).

Maybe I should first do an uninstallation of previously installed packages? I already tried to uninstall the NVidia driver, and reinstall the patched one. It however gave the same error...

im running out of ideas, have you blacklisted nouveau at /etc/modprobe.d/pve-blacklist.conf ?

patrickob · Aug 11, 2024

fun fact, it was blacklisted in blacklist.conf, but not in pve-blacklist.conf.

I reinstalled the patch, but still the same error message. I'm indeed also out of bright ideas

KciNicK · Aug 11, 2024

you can try the 535 with polloloco's patch (see first post, it's 535 patched to work with 6.8) also do you have proxmox-headers-`uname -r` installed?

KciNicK · Aug 11, 2024

i just upgraded to 6.8.12-1 to see if is working, and it works just fine, btw, did you reboot after installing the driver?

patrickob · Aug 11, 2024

I have installed proxmox-headers:

Code:

root@pve:~# dpkg -l | grep "proxmox-headers-$(uname -r)"
ii  proxmox-headers-6.8.12-1-pve         6.8.12-1                            amd64        Proxmox Kernel Headers

I also downgraded to 535, which works with the patch for 6.8, but sadly, still gives empty

Code:

root@pve:~# mdevctl types

root@pve:~#

Seems a touch cookie!

patrickob · Aug 13, 2024

I have downgraded Proxmox (reinstalled with 7) and tried to install 550.90 drivers.

Now I get the following error:

Code:

Unable to find the kernel source tree for the currently running kernel.    
         Please make sure you have installed the kernel source files for your kernel
         and that they are properly configured; on Red Hat Linux systems, for        
         example, be sure you have the 'kernel-source' or 'kernel-devel' RPM        
         installed.  If you know the correct kernel source files are installed, you  
         may specify the kernel source path with the '--kernel-source-path' command  
         line option.

When I do:

Code:

uname -r
5.15.102-1-pve

Code:

ls -l /usr/src
total 4
drwxr-xr-x  4 root root  138 Aug 13 21:07 linux-headers-5.10.0-32-amd64
drwxr-xr-x  4 root root   77 Aug 13 21:07 linux-headers-5.10.0-32-common
drwxr-xr-x 25 root root 4096 Aug 13 21:08 linux-headers-6.8.12-1-pve
lrwxrwxrwx  1 root root   24 Aug 10 08:09 linux-kbuild-5.10 -> ../lib/linux-kbuild-5.10

Not sure why it uses the old linux-headers, while there are the more recent 6.8 headers available. I'm not sure what to do to fix this...

PS for some reason I get an error when I do dist-upgrade, I think it is related:

Code:

root@pve:/var/log# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages will be REMOVED:
  proxmox-ve pve-container pve-ha-manager pve-manager qemu-server
The following NEW packages will be installed:
  ethtool pve-edk2-firmware-legacy pve-edk2-firmware-ovmf python3-distutils python3-lib2to3 python3-setuptools python3-systemd
The following packages have been kept back:
  grub-efi-amd64-bin libpve-storage-perl pve-xtermjs zfs-initramfs
The following packages will be upgraded:
  ifupdown2 libpve-cluster-api-perl libpve-cluster-perl libpve-common-perl libpve-guest-common-perl libpve-http-server-perl pve-edk2-firmware
7 upgraded, 7 newly installed, 5 to remove and 4 not upgraded.
Need to get 0 B/9,670 kB of archives.
After this operation, 279 MB disk space will be freed.
Do you want to continue? [Y/n] Y
W: (pve-apt-hook) !! WARNING !!
W: (pve-apt-hook) You are attempting to remove the meta-package 'proxmox-ve'!
W: (pve-apt-hook)
W: (pve-apt-hook) If you really want to permanently remove 'proxmox-ve' from your system, run the following command
W: (pve-apt-hook)       touch '/please-remove-proxmox-ve'
W: (pve-apt-hook) run apt purge proxmox-ve to remove the meta-package
W: (pve-apt-hook) and repeat your apt invocation.
W: (pve-apt-hook)
W: (pve-apt-hook) If you are unsure why 'proxmox-ve' would be removed, please verify
W: (pve-apt-hook)       - your APT repository settings
W: (pve-apt-hook)       - that you are using 'apt full-upgrade' to upgrade your system
E: Sub-process /usr/share/proxmox-ve/pve-apt-hook returned an error code (1)
E: Failure running script /usr/share/proxmox-ve/pve-apt-hook
root@pve:/var/log#

KciNicK · Aug 13, 2024

the headers are installed manually,

# apt install linux-headers-5.15.102-1-pve

i'm guessing here, my first proxmox was 8.2-1, so if that's not the package search for it with

# apt list linux-headers-* |grep 5.15

then install it.

KciNicK · Aug 13, 2024

patrickob said:
Not sure why it uses the old linux-headers, while there are the more recent 6.8 headers available. I'm not sure what to do to fix this...

the new headers are not compatible with the old kernel, need to use the exact same version as the kernel

KciNicK · Aug 13, 2024

i would try to install a fresh https://enterprise.proxmox.com/iso/proxmox-ve_7.4-1.iso instead of downgrade the kernel

patrickob · Aug 13, 2024

I did do a fresh install of 7.4.1. Thanks for the headers fix. Sadly, it gives the same issue as before,

Code:

root@pve:~# mdevctl types

root@pve:~#

Seems like I have no luck with this... Maybe L40S just is not supported by Proxmox yet...

KciNicK · Aug 13, 2024

try this one, put it on /usr/share/nvidia/vgpu/, replace the existing one, then reboot, then try mdevctl again.

http://kcinick.gxzone.com/vgpuConfig.xml

patrickob · Aug 16, 2024

I dove a bit further. After some trouble installing on 7.4 (kernel issues again), I managed to get as far as with PVE 8. I also installed your latest suggestion. Sadly, mdevctl types remains empty...

I however see that nvdiai-vgpud.service is dead. It does start up successfully, but shuts down.

Code:

root@pve:~# systemctl status nvidia-vgpud.service
● nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-vgpud.service.d
             └─vgpu_unlock.conf
     Active: inactive (dead) since Fri 2024-08-16 13:57:39 CEST; 47s ago
    Process: 2046 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
    Process: 2385 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 2056 (code=exited, status=0/SUCCESS)
        CPU: 295ms


Aug 16 13:57:39 pve nvidia-vgpud[2056]: Frame Rate Limiter enabled: 0x1
Aug 16 13:57:39 pve nvidia-vgpud[2056]: Number of Displays: 1
Aug 16 13:57:39 pve nvidia-vgpud[2056]: Max pixels: 1310720
Aug 16 13:57:39 pve nvidia-vgpud[2056]: Display: width 1280, height 1024
Aug 16 13:57:39 pve nvidia-vgpud[2056]: Multi-vGPU Exclusive supported: 0x1
Aug 16 13:57:39 pve nvidia-vgpud[2056]: License: GRID-Virtual-Apps,3.0
Aug 16 13:57:39 pve nvidia-vgpud[2056]: PID file unlocked.
Aug 16 13:57:39 pve nvidia-vgpud[2056]: PID file closed.
Aug 16 13:57:39 pve nvidia-vgpud[2056]: Shutdown (2056)
Aug 16 13:57:39 pve systemd[1]: nvidia-vgpud.service: Succeeded.

Might this be the cause for the empty types?

Some places it's suggested that this has to do with licensing. However, licenses only should be installed on the client (VMs), not on the hypervisor, afaik? In any case, I only installed NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm on the hypervisor.

KciNicK · Aug 16, 2024

the license is only for clients, the only idea left on my, something in the BIOS is disabled, don't know what.

patrickob · Aug 18, 2024

I think I kinda found out what is going wrong. It is quite the unfortunate circumstance.

I am using Proxmox 8.2 with kernel 6.8.12-1-pve.
I have L40S GPU - no consumer GPUs.
I run this on an AMD system.

Proxmox installation goes fine. The NVIDIA drive can be installed properly, using the 550.90 driver. nvidia-smi returns correct:

Code:

root@pve:~# nvidia-smi
Sun Aug 18 15:33:52 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:82:00.0 Off |                    0 |
| N/A   30C    P0            127W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    On  |   00000000:E3:00.0 Off |                    0 |
| N/A   33C    P0            125W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Sadly, the

Code:

root@pve:~# mdevctl types

root@pve:~#

returns empty.

This is the root of all troubles. Upon further investigation on what this actually does, it apparently goes search for info in the following folder:

Code:

/sys/bus/pci/devices/0000:e3:00.0

with the last folder the id of one of my GPUs. This folder should contain a folder call mdev_supported_types. This is exactly the folder that is polled to populate the MDev Type dropdown in the Proxmox GUI:

https://github.com/proxmox/pve-comm...0d526a3da4a7d1cf52/src/PVE/SysFSTools.pm#L155

Sadly, given that this folder is empty, MDev Type remains disabled for me.

Now, apparently, this folder should be populated using vfio_mdev. This kernel module however is not available in Proxmox kernel.

I see however that in the Patch 550.90 of Polloloco something seems to be done with vgpu-vfio-mdev.

I thought, maybe that is actually being activated and then will create the correct folder structure in my Proxmox.

Sadly, I cannot run the patched driver, as I get a weird error message:

Code:

-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[ 2560.234573] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2560.234574] NVRM: None of the NVIDIA devices were initialized.
[ 2560.234763] nvidia-nvlink: Unregistered Nvlink Core, major device number 509
[ 2677.645663] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 2677.645674] NVRM: GPU 0000:82:00.0 is already bound to pci-pf-stub.
[ 2677.647305] NVRM: The NVIDIA GPU 0000:e3:00.0 (PCI ID: 10de:26b9)
               NVRM: installed in this vGPU host system is not supported by
               NVRM: proprietary nvidia.ko.
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU)
               NVRM: Software documentation, available at docs.nvidia.com.

Here is the full log: https://pastebin.com/raw/x6PMb2DX

Basically, it seems that the Ubuntu/Proxmox system does not support non-consumer grade GPUs. Using the tutorials online, consumer grade GPUs can be made to work, but professional GPUs seem not to be supported (yet).

Maybe someone still sees a way, but I am close to the end of the trickbox...

KciNicK · Aug 19, 2024

have you read this https://forum.proxmox.com/threads/vgpu-with-nvidia-on-kernel-6-8.150840/ ?

[TUTORIAL] PVE 8.22 / Kernel 6.8 and NVidia vGPU

New Member

Member

New Member

Attachments

Member

New Member

Member

New Member

Member

Member

New Member

New Member

Member

Member

Member

New Member

Member

New Member

Member

New Member

Member

We value your privacy