vGPU with nVIDIA on Kernel 6.8

I ended up going this route...for now. Everything is working fine. This cluster is in production, with NVIDIA GRID licensed A10 GPUs. After getting a renewal quote from Broadcom for VMware licensing at 10x last year's amount, it was a no-brainer. We already have Proxmox as our production server cluster for several years, but this one is for VDI, where Horizon View was king (unfortunately).
Is it just as simple as uninstalling the NVIDIA GRID drivers then rebuilding the 6.8 kernel? I'm in that position right now as I'm pinned to the 6.5 kernel right now in order to have the cluster continue to run. However, I'm running into an issue as I'm seeing kernel taint errors which may explain why the hosts in my cluster are randomly dropping out.
 
Hello evryone , kernel 6.8.12-2-pve + Nvidia driver 550.90.05 (17.3) + patch , everything work fine :)
Nvidia RTX2080
Great news. Did you uninstall the NVIDIA drivers first? I'm currently pinned to the 6.5 kernel and everytime I do apt dist-upgrade the kernel build fails and if I try to boot into the 6.8 kernel it hangs on boot.
 
Great news. Did you uninstall the NVIDIA drivers first? I'm currently pinned to the 6.5 kernel and everytime I do apt dist-upgrade the kernel build fails and if I try to boot into the 6.8 kernel it hangs on boot.
First boot with 6.8 , next download nvidia driver and patch it , then install na patched driver (when start install the driver , ask you to uninstall ....)

proxmox-boot-tool kernel pin 6.8.12-2-pve
reboot
chmod +x NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run --apply-patch 550.90.05.patch
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm-custom.run --dkms -m=kernel
reboot
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2080 On | 00000000:86:00.0 Off | N/A |
| 0% 52C P8 29W / 260W | 8163MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 28716 C+G vgpu 2024MiB |
| 0 N/A N/A 31114 C+G vgpu 2024MiB |
| 0 N/A N/A 38171 C+G vgpu 2024MiB |
| 0 N/A N/A 46857 C+G vgpu 2024MiB |
+-----------------------------------------------------------------------------------------+
 
My problem is that on the hosts that had 535 and 550 drivers installed and running the 6.5 kernel, during the apt dist-upgrade to get the new kernel the 6.8.4 and 6.8.12 kernels both fail to compile properly (based on the dkms make.log errors) and when I try to boot either it hangs on boot. I suppose I could try those kernels in recovery mode...

But I think I might just do a clean uninstall of the drivers, re-run apt dist-upgrade and hope for the best.

Do you want to continue? [Y/n] y
Get:1 https://enterprise.proxmox.com/debian/pve bookworm/pve-enterprise amd64 pve-headers all 8.2.0 [2,848 B]
Fetched 2,848 B in 1s (4,934 B/s)
Selecting previously unselected package pve-headers.
(Reading database ... 247183 files and directories currently installed.)
Preparing to unpack .../pve-headers_8.2.0_all.deb ...
Unpacking pve-headers (8.2.0) ...
Setting up pve-headers (8.2.0) ...
Setting up proxmox-kernel-6.8.8-4-pve-signed (6.8.8-4) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.8.8-4-pve /boot/vmlinuz-6.8.8-4-pve
dkms: running auto installation service for kernel 6.8.8-4-pve.
Sign command: /lib/modules/6.8.8-4-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Building module:
Cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.8.8-4-pve modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 6.8.8-4-pve (x86_64)
Consult /var/lib/dkms/nvidia/535.154.02/build/make.log for more information.
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
dkms: autoinstall for kernel: 6.8.8-4-pve failed!
run-parts: /etc/kernel/postinst.d/dkms exited with return code 11
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.8-4-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.8-4-pve-signed (--configure):
installed proxmox-kernel-6.8.8-4-pve-signed package post-installation script subprocess returned error exit status 2
Setting up proxmox-kernel-6.8.12-2-pve-signed (6.8.12-2) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.8.12-2-pve /boot/vmlinuz-6.8.12-2-pve
dkms: running auto installation service for kernel 6.8.12-2-pve.
Sign command: /lib/modules/6.8.12-2-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Building module:
Cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.8.12-2-pve modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 6.8.12-2-pve (x86_64)
Consult /var/lib/dkms/nvidia/535.154.02/build/make.log for more information.
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
dkms: autoinstall for kernel: 6.8.12-2-pve failed!
run-parts: /etc/kernel/postinst.d/dkms exited with return code 11
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.8.12-2-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.8.12-2-pve-signed (--configure):
installed proxmox-kernel-6.8.12-2-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.8:
proxmox-kernel-6.8 depends on proxmox-kernel-6.8.12-2-pve-signed | proxmox-kernel-6.8.12-2-pve; however:
Package proxmox-kernel-6.8.12-2-pve-signed is not configured yet.
Package proxmox-kernel-6.8.12-2-pve is not installed.
Package proxmox-kernel-6.8.12-2-pve-signed which provides proxmox-kernel-6.8.12-2-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.8 (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-default-kernel:
proxmox-default-kernel depends on proxmox-kernel-6.8; however:
Package proxmox-kernel-6.8 is not configured yet.

dpkg: error processing package proxmox-default-kernel (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-ve:
proxmox-ve depends on proxmox-default-kernel; however:
Package proxmox-default-kernel is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
proxmox-kernel-6.8.8-4-pve-signed
proxmox-kernel-6.8.12-2-pve-signed
proxmox-kernel-6.8
proxmox-default-kernel
proxmox-ve
E: Sub-process /usr/bin/dpkg returned an error code (1)
 
First boot with 6.8 , next download nvidia driver and patch it , then install na patched driver (when start install the driver , ask you to uninstall ....)

proxmox-boot-tool kernel pin 6.8.12-2-pve
reboot
chmod +x NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run --apply-patch 550.90.05.patch
./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm-custom.run --dkms -m=kernel
reboot
nvidia-smi
Perfect. Actually, I had to patch in 6.5, install the patched driver, then run apt dist-upgrade to rebuild the 6.8.x kernels then reboot to the 6.8 kernel. Without the dist-upgrade step it would hang. I think if I run into a similar issue with future kernels, rebuild the NVIDIA driver first
:cool:
 
Hi all

I have installed the patch for kernel 6.8.12-2-pve with driver version 535.183.04. The installation worked and nvidia-smi shows the graphics card. However, mdevctl types shows nothing.

1730147359281.png

When starting, you can see that the devices are activated:
1730147403097.png

Does anyone know the problem and has a solution for it?

Thank you and best regards
 
Hi all

I have installed the patch for kernel 6.8.12-2-pve with driver version 535.183.04. The installation worked and nvidia-smi shows the graphics card. However, mdevctl types shows nothing.

View attachment 76933

When starting, you can see that the devices are activated:
View attachment 76934

Does anyone know the problem and has a solution for it?

Thank you and best regards
Hi MisterDeeds,

Solutions are given in the first post of the thread by guruevi
 
Is it just as simple as uninstalling the NVIDIA GRID drivers then rebuilding the 6.8 kernel? I'm in that position right now as I'm pinned to the 6.5 kernel right now in order to have the cluster continue to run. However, I'm running into an issue as I'm seeing kernel taint errors which may explain why the hosts in my cluster are randomly dropping out.

I'm still on 6.5 for the time being. It's working. I don't have a need to change it. I want to see how Proxmox handles assigning vGPU resources with the changes in 6.8 before trying to go down that road again.
 
just fyi, my patches were applied.
i think not every package was bumped/built yet, but most of them should be on the pvetest repository
Awesome!

Did I get this right, that the gui will be usable again?
Should we use mdev again?

What is the proper way now to use vgpu? Sorry, I lost the overview :D
 
while the underlying sysfs paths and exact mechanism changed a bit, we opted to map it to our old interface on the pve side

so the pve config is the same as previoiusly (so hostpci0: 000:XX:YY.Z,mdev=nvidia-123) even though there are technically no 'mediated devices' involved
we did this because
* we hope there won't be any more of such changes (fingers crossed)
* to keep the old configs working on upgrade
 
  • Like
Reactions: Funar

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!