Issues with nVidia vGPU since upgrading to kernel 6.5.13-5-pve

kesawi

New Member
Jan 20, 2024
21
5
3
Brisbane, Australia
Since upgrading from 6.5.13-3-pve to 6.5.13-5-pve I've noticed the issues with my VMs that I have my GTX1660Ti passedthrough to as a vGPU.

I run Plex in a Docker container within a Ubuntu 22.04.4 VM. When I start a media file that requires hardware transcoding within Windows, it makes several attempts to start the GPU process. If I change the playback settings during playback, it attempts to restart the transcoding process and then fails (refer to this thread on the Plex forums for more details).

I also run a Xpenology VM and use Deep Video Analytics for people and vehicle detection under Synology Surveillance Assistance. Typically after 48-72 hours the DVA tasks stop working (i.e. don't detect any further events) despite the GPU processes showing up when using the nvidia-smi command.

I'm using the nVidia gridd drivers v535.161.05 and followed the instructions at https://gitlab.com/polloloco/vgpu-proxmox to get my card working with vGPU.

I've tried uninstalling and reinstalling the nVidia drivers, but with no change.
 
I'm on same boat. But I want to ask you about service.
I using Proxmox 8.1 with Nvidia Tesla P4 (it support grid out of the box and have vgpu support to at least in win10 drivers)

in instruction https://gitlab.com/polloloco/vgpu-proxmox there 3 small steps:
1) creation of config for services nvidia-vgpud and nvidia-vgpu-mgr. On https://github.com/wvthoog/proxmox-vgpu-installer this step completed with enabling of that services. but in Proxmox 8.1 I get an error that service was not enabled because there are no /etc/systemd/system/nvidia-vgpud.service and /etc/systemd/system/nvidia-vgpu-mgr.service. I tried to make simple template for service and enabled it. As I understand they should create environment variable LD_PRELOAD. They executed but there are no variable LD_PRELOAD in env.
2) I patched driver 535.161.08 by instruction and install it.
3) nvidia-smi return ok result but "nvidia-smi vgpu" return "No supported devices in vGPU mode"
4) mdevctl types return empty

so I'm stuck on 1) (all this service should do is set env variable, but it don't and don't load any module) and don't understand why 3) happened?

Can you help me to show how you fix 1) and do you have this variable LD_PRELOAD in env?
 
I'm on same boat. But I want to ask you about service.
I using Proxmox 8.1 with Nvidia Tesla P4 (it support grid out of the box and have vgpu support to at least in win10 drivers)

in instruction https://gitlab.com/polloloco/vgpu-proxmox there 3 small steps:
1) creation of config for services nvidia-vgpud and nvidia-vgpu-mgr. On https://github.com/wvthoog/proxmox-vgpu-installer this step completed with enabling of that services. but in Proxmox 8.1 I get an error that service was not enabled because there are no /etc/systemd/system/nvidia-vgpud.service and /etc/systemd/system/nvidia-vgpu-mgr.service. I tried to make simple template for service and enabled it. As I understand they should create environment variable LD_PRELOAD. They executed but there are no variable LD_PRELOAD in env.

Can you help me to show how you fix 1) and do you have this variable LD_PRELOAD in env?
I followed the steps manually at https://gitlab.com/polloloco/vgpu-proxmox and did not use the script at https://github.com/wvthoog/proxmox-vgpu-installer.

Have you tried the following manually?

Bash:
mkdir /etc/systemd/system/{nvidia-vgpud.service.d,nvidia-vgpu-mgr.service.d}
echo -e "[Service]\nEnvironment=LD_PRELOAD=/opt/vgpu_unlock-rs/target/release/libvgpu_unlock_rs.so" > /etc/systemd/system/nvidia-vgpud.service.d/vgpu_unlock.conf
echo -e "[Service]\nEnvironment=LD_PRELOAD=/opt/vgpu_unlock-rs/target/release/libvgpu_unlock_rs.so" > /etc/systemd/system/nvidia-vgpu-mgr.service.d/vgpu_unlock.conf

If I do printenv from the command line, then I don't see an LD_PRELOAD variable listed.

Both services created with the command above are running for me.
 
First I found https://gitlab.com/polloloco/vgpu-proxmox and go manually step by step. I already have passthrough for iGPU AMD 610M and don't want to damage it with the script.
Because I make an experiment around service descriptions, I get an Proxmox boot error. I have to reinstall Proxmox )) so after small search I found script https://github.com/wvthoog/proxmox-vgpu-installer. And it worked on clean installation of Proxmox 8.1. I update sources and upgrade it. As result kernel updated to 6.5.13 and everything works.

uname -a
Linux h340 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64 GNU/Linux

There are some differences with what I do manually.
1) Script stuck to older drivers. With P4 I limited to 16.x branch. I tried latest 535.161.08, patch it and vgpu don't work. Script using 535.104.06, patch it too and VGPU works.
2) Before I read any articles I tried 551 from 17.x but it is not installed.
3) Script install UNSIGNED kernel module, as result it is not loaded with secure boot. I have to manually reinstall patched 535.104.06 that script created. Choose signed version. Generate key pair and import it with mokutils. Next boot I have to make MOK enrolment of keypair.
4) On profile list I have A, B, Q profiles but I don't see any С profiles.

Right now I don't know how to split Tesla P4 to 2 instances. I want one instance with 4Gb, and other 3.76 Gb. Default option is to split to 2 Gb instances and waste 3.76 Gb
 
Right now I don't know how to split Tesla P4 to 2 instances. I want one instance with 4Gb, and other 3.76 Gb. Default option is to split to 2 Gb instances and waste 3.76 Gb
I've stopped using vGPU with my nVidia card and am just doing a straight passthrough of the GTX1660Ti to my Xpenology VM. I am using vGPU for the iGPU on my i7-7700K which I'm passing through to my Ubuntu VM for Plex.

When I was using vGPU for my GTX1660TI, I had options to create Q, A & B profiles. I don't remember whether there were any C profiles. I used a Q profile and could select VRAM sizes in 1GB increments from 1GB to 6GB.

I couldn't mix VRAM sizes, i.e apply 2GB to one VM and 4GB to another, they all had to be the same. With the P4 you should be able to split it into two 4GB instances.
 
First I found https://gitlab.com/polloloco/vgpu-proxmox and go manually step by step. I already have passthrough for iGPU AMD 610M and don't want to damage it with the script.
Because I make an experiment around service descriptions, I get an Proxmox boot error. I have to reinstall Proxmox )) so after small search I found script https://github.com/wvthoog/proxmox-vgpu-installer. And it worked on clean installation of Proxmox 8.1. I update sources and upgrade it. As result kernel updated to 6.5.13 and everything works.

uname -a
Linux h340 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64 GNU/Linux

There are some differences with what I do manually.
1) Script stuck to older drivers. With P4 I limited to 16.x branch. I tried latest 535.161.08, patch it and vgpu don't work. Script using 535.104.06, patch it too and VGPU works.
2) Before I read any articles I tried 551 from 17.x but it is not installed.
3) Script install UNSIGNED kernel module, as result it is not loaded with secure boot. I have to manually reinstall patched 535.104.06 that script created. Choose signed version. Generate key pair and import it with mokutils. Next boot I have to make MOK enrolment of keypair.
4) On profile list I have A, B, Q profiles but I don't see any С profiles.

Right now I don't know how to split Tesla P4 to 2 instances. I want one instance with 4Gb, and other 3.76 Gb. Default option is to split to 2 Gb instances and waste 3.76 Gb
A bit late but, I'm just reading this now..

I had pretty much the same experience with Proxmox 8.1and my Tesla P4. For me to get the mdevctl types to show, I had to overwrite the vgpuConfig.xml with one I took from a v16.4 driver. Everything worked after I did this and although all my profiles are for a 'Nvidia Grid P40', I can split the card into many variants/instances. I'm using a 4x 2Gb profile and the 'nvidia-smi vgpu' command shows the VMs using their share.

One other thing to mention, I have one of the 2Gb instances showing in TrueNAS Scale but it won't work. I can see in the TrueNAS cli that the correct nvidia driver is unable to install correctly. I'm thinking that TrueNAS may have the correct drivers if I change my 'vgpuConfig.xml' to one of the other versions.
 
A bit late but, I'm just reading this now..

I had pretty much the same experience with Proxmox 8.1and my Tesla P4. For me to get the mdevctl types to show, I had to overwrite the vgpuConfig.xml with one I took from a v16.4 driver. Everything worked after I did this and although all my profiles are for a 'Nvidia Grid P40', I can split the card into many variants/instances. I'm using a 4x 2Gb profile and the 'nvidia-smi vgpu' command shows the VMs using their share.

One other thing to mention, I have one of the 2Gb instances showing in TrueNAS Scale but it won't work. I can see in the TrueNAS cli that the correct nvidia driver is unable to install correctly. I'm thinking that TrueNAS may have the correct drivers if I change my 'vgpuConfig.xml' to one of the other versions.
Tesla P4 have 7.86Gb of ram so 4x2Gb will not run. I tried it. So I have to override RAM for other instances. like 3x2Gb and 1x1.5Gb

Can you share your configs like vgpuConfig.xml and driver version? 16.4 equal to 535.161.06 or higher? I saw some example of patching 551 and mixing it with 535 branch configs, but I can't reproduce it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!