[SOLVED] vGPU just stopped working randomly (solution includes 6.14, pascal fixes for 17.5, changing mock p4 to A5500 thanks to GreenDam )

v570 is awesome if it works! my biggest gripe with the P4 is needing to downgrade CUDA versions in current packages and recompile due to backwards compatibility issues.

but I tried to install a v550 driver yesterday and while it installed on the host fine, I couldnt get the mdev types to show correctly as a P4 (they showed as a P40) and when I tried to install the same GRID drivers (550) on the guests it failed saying there wasnt a supported card installed.

so if this one already has a working vgou xml file, are you using the same guest driver files from the v570 driver package? and it all just works as long as its unlocked? and you dont need to patch it? I cant see any patch file for this driver version...
it takes some extra tweaking because it also needs a newer license server and a dll / binary patch client side but it works.

it also spoofs the gpu as a T4 not a P4 so everything thinks its a T4 with P4 features so it all works fine (the only reason pascal quit working is nvidia removed lines from the code, it really works fine still with newer software versions)

yeah you need to patch it and the patch is in the commits it hasnt been added yet, it also sadly doesnt seem to compile on proxmox 9 you have to use 8.4 to compile it ( i just booted up a vm of 8.4 to patch drivers and kept it for future uses)

i attached the patch file for 18.4 just remove the .txt extension.
 

Attachments

it takes some extra tweaking because it also needs a newer license server and a dll / binary patch client side but it works.

it also spoofs the gpu as a T4 not a P4 so everything thinks its a T4 with P4 features so it all works fine (the only reason pascal quit working is nvidia removed lines from the code, it really works fine still with newer software versions)

yeah you need to patch it and the patch is in the commits it hasnt been added yet, it also sadly doesnt seem to compile on proxmox 9 you have to use 8.4 to compile it ( i just booted up a vm of 8.4 to patch drivers and kept it for future uses)

i attached the patch file for 18.4 just remove the .txt extension.
cool - yeah none of those compile on pve 9 - I read thats cos the trixie patch binary version is too high so it needs debian bookworm or lower. thanks for all the info, time to have a tinker and see if I can get it working!
 
it takes some extra tweaking because it also needs a newer license server and a dll / binary patch client side but it works.

it also spoofs the gpu as a T4 not a P4 so everything thinks its a T4 with P4 features so it all works fine (the only reason pascal quit working is nvidia removed lines from the code, it really works fine still with newer software versions)

yeah you need to patch it and the patch is in the commits it hasnt been added yet, it also sadly doesnt seem to compile on proxmox 9 you have to use 8.4 to compile it ( i just booted up a vm of 8.4 to patch drivers and kept it for future uses)

i attached the patch file for 18.4 just remove the .txt extension.
cant seem to get it to work - patched vgpu.run file installs ok. nvidia-smi and nvidia-smi vgpu show the card ok. mdevctl types shows nothing, with the original vgpuConfig.xml, and the ones from v535-261, v535-230. nvidia-vgpud.service isnt running, but it says it ran successfully, didnt seem to find any vGPU profiles though... the output is a lot less than the working v535.261 driver:-

Code:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Global settings:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Size: 16
                                        Version 1
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Homogeneous vGPUs: 1
Aug 31 14:23:43 pve nvidia-vgpud[5461]: vGPU types: 586
Aug 31 14:23:43 pve nvidia-vgpud[5461]:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: pciId of gpu [0]: 0:d:0:0
Aug 31 14:23:44 pve systemd[1]: nvidia-vgpud.service: Deactivated successfully.
Aug 31 14:23:44 pve systemd[1]: Finished nvidia-vgpud.service - NVIDIA vGPU Dae>
Aug 31 14:23:44 pve systemd[1]: nvidia-vgpud.service: Consumed 1.119s CPU time,>

so dont know what I am doing wrong - but its a no-go for me so far... but the errors about the kernel are gone.
 
Last edited:
cant seem to get it to work - patched vgpu.run file installs ok. nvidia-smi and nvidia-smi vgpu show the card ok. mdevctl types shows nothing, with the original vgpuConfig.xml, and the ones from v535-261, v535-230. nvidia-vgpud.service isnt running, but it says it ran successfully, didnt seem to find any vGPU profiles though... the output is a lot less than the working v535.261 driver:-

Code:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Global settings:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Size: 16
                                        Version 1
Aug 31 14:23:43 pve nvidia-vgpud[5461]: Homogeneous vGPUs: 1
Aug 31 14:23:43 pve nvidia-vgpud[5461]: vGPU types: 586
Aug 31 14:23:43 pve nvidia-vgpud[5461]:
Aug 31 14:23:43 pve nvidia-vgpud[5461]: pciId of gpu [0]: 0:d:0:0
Aug 31 14:23:44 pve systemd[1]: nvidia-vgpud.service: Deactivated successfully.
Aug 31 14:23:44 pve systemd[1]: Finished nvidia-vgpud.service - NVIDIA vGPU Dae>
Aug 31 14:23:44 pve systemd[1]: nvidia-vgpud.service: Consumed 1.119s CPU time,>

so dont know what I am doing wrong - but its a no-go for me so far... but the errors about the kernel are gone.
if you use the newer drivers with the linked repo you no longer replace the xml file, just compile that repo and install the patched driver. i got stuck on the same thing for a minute thinking that was still how it worked as the guide said, but it has been patched i guess.
 
I sort of got something working but it seems quite unstable so far so I dont know whether I will stick with it. It looks like the vgpu_unlock_rs repo also had some updates by GreenDamTan along with the patches, and I missed those since the instructions never reference his repo directly. Once I updated that the vGPU types were detected with the included vgpuConfig.xml, but its being detected as an A5500. on one of my VMs running ubuntu 24.04 with nvidia docker - ollama didnt like it at all - it locked up the VM. it seems ok on one of my other debian 12 VMs as long as I use the standard profiles (4Q in my case), but if I try to override to balance the VRAM usage to my liking - i.e. 1 server with 6GB, 2 x servers with 1GB each, it locks up the VM as well.

still better than frigate - I cant get that to work no matter what driver I try - that running on docker with nvidia-container shuts down the whole pve after about 10-15 mins of running normally - no errors no nothing! I thought having a 570 driver would fix that since their minimum driver recommendation is 570 unless you want to build your own cuda ffmpeg binary (which I have been working on but getting it completely static into docker is seemingly very hard) - but anyway no change on that - even with 570 drivers that bombed out as well.

kind of stupid to expect it would be all smooth sailing, but this is about to get thrown in the too hard basket!

oh and the original "remap_pfn_range_internal" error is still there in dmesg
 
Last edited:
use this link for your unlock repo, it should then appear as a T4, the A5500 is the older fix repo
https://github.com/rbqvq/vgpu_unlock-rs

it should also have zero issues with VMs, i have tried it in a few including ubuntu and windows 10/11 with zero problems aside from licensing expiring and me forgetting it causes it to slow to a crawl after a minute.

do you have the license setup?

newer versions of ollama are horrible for performance i am noticing, i am trying LM studio at the moment as an alternative, ollama removed num_threads which caused me a 5x slow down forcing over committing threads to max (for some reason my cpu performs best on 7/8 cores and not max 16) and they changed some of the code so it cant seem to properly use avx 512 anymore at all, so i get 60% higher cpu use, lock ups and slow downs, higher power use and slower responses...... LM Studio is faster with only 20-40% cpu at only avx2 and doesnt max out the cpu for long periods of time like ollama. it also seems like every time detail performance improments and what the issue is in issues they change ollama in a way that negates performance potential and reinforces the issue as a forced aspect....

so ive moved on and would definitely recommend trying another option now, the other options are easier on CPU and power use, ollama is in decline.

kind of sounds like your issues may be license related,, ive also tried docker on ubuntu and never had it crash the system. im not sure what could be causing that part.
 
use this link for your unlock repo, it should then appear as a T4, the A5500 is the older fix repo
https://github.com/rbqvq/vgpu_unlock-rs

it should also have zero issues with VMs, i have tried it in a few including ubuntu and windows 10/11 with zero problems aside from licensing expiring and me forgetting it causes it to slow to a crawl after a minute.

do you have the license setup?

newer versions of ollama are horrible for performance i am noticing, i am trying LM studio at the moment as an alternative, ollama removed num_threads which caused me a 5x slow down forcing over committing threads to max (for some reason my cpu performs best on 7/8 cores and not max 16) and they changed some of the code so it cant seem to properly use avx 512 anymore at all, so i get 60% higher cpu use, lock ups and slow downs, higher power use and slower responses...... LM Studio is faster with only 20-40% cpu at only avx2 and doesnt max out the cpu for long periods of time like ollama. it also seems like every time detail performance improments and what the issue is in issues they change ollama in a way that negates performance potential and reinforces the issue as a forced aspect....

so ive moved on and would definitely recommend trying another option now, the other options are easier on CPU and power use, ollama is in decline.

kind of sounds like your issues may be license related,, ive also tried docker on ubuntu and never had it crash the system. im not sure what could be causing that part.
the license is seemingly working - I ran the patch and nvidia-smi says I have 3 months until expiry.

thanks, will try that new repo and see how that goes. could also be aging hardware, I dont think my CPU even supports AVX2, I am stuck on AVX - really old repurposed server hardware from 2014 :D I am about to get some more free hand-me-down upgrades that should do a little better hopefully.
 
the license is seemingly working - I ran the patch and nvidia-smi says I have 3 months until expiry.

thanks, will try that new repo and see how that goes. could also be aging hardware, I dont think my CPU even supports AVX2, I am stuck on AVX - really old repurposed server hardware from 2014 :D I am about to get some more free hand-me-down upgrades that should do a little better hopefully.
oh good, wonder what is causing your issues then. what does your profile config look like?

here is an example of mine on my main VM i currently am using all 8GB on (using the "GRID T4-Q8 (nvidia-233)" profile)
Code:
[vm.128]
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
frl_enabled = 0
framebuffer = 0x1DC000000
framebuffer_reservation = 0x24000000 # 8GB


well at least you have that p4 thats awesome compared to your cpu if you dont even have avx2 on it. hopefully you can get that new hardware, at least avx2 would be great if you're running AI models.
 
it takes some extra tweaking because it also needs a newer license server and a dll / binary patch client side but it works.

it also spoofs the gpu as a T4 not a P4 so everything thinks its a T4 with P4 features so it all works fine (the only reason pascal quit working is nvidia removed lines from the code, it really works fine still with newer software versions)

yeah you need to patch it and the patch is in the commits it hasnt been added yet, it also sadly doesnt seem to compile on proxmox 9 you have to use 8.4 to compile it ( i just booted up a vm of 8.4 to patch drivers and kept it for future uses)

i attached the patch file for 18.4 just remove the .txt extension.
Hi. I followed your suggested steps and have been able to successfully install vGPU driver version 18.4 along with the corresponding vgpu_ulock-rs. All the installation went smoothly without problem. I am able to see expected outputs for the following commands:

nvidia-smi

mdevctl types

and

nvidia-smi vgpu

I have also set up a new version of the license server. Though not tested it yet because of the reason below.

I have a red hat rhel 10 vm to which I have attached the vGPU with with one of the available mdev profiles.

Now the question is how to install and patch the guest driver. I have both the the following files for 18.4 guest drivers:

nvidia-linux-grid-570-570.172.08-1.x86_64.rpm
and
nvidia-gridd (binary patch for the above guest driver)

I know how to install the rpm driver itself but how to patch it in the guest OS

What steps to take. Should I just install the rpm driver and then run the patch file?
 
Hi. I followed your suggested steps and have been able to successfully install vGPU driver version 18.4 along with the corresponding vgpu_ulock-rs. All the installation went smoothly without problem. I am able to see expected outputs for the following commands:

nvidia-smi

mdevctl types

and

nvidia-smi vgpu

I have also set up a new version of the license server. Though not tested it yet because of the reason below.

I have a red hat rhel 10 vm to which I have attached the vGPU with with one of the available mdev profiles.

Now the question is how to install and patch the guest driver. I have both the the following files for 18.4 guest drivers:

nvidia-linux-grid-570-570.172.08-1.x86_64.rpm
and
nvidia-gridd (binary patch for the above guest driver)

I know how to install the rpm driver itself but how to patch it in the guest OS

What steps to take. Should I just install the rpm driver and then run the patch file?
yeah it's a simple as it sounds. install the grid driver which includes the nvidia-gridd binary and patch it. you'll need to have the licensing server running so you can download your certificate from it. the repo for the patching utility includes instructions from memory.

I normally use the .run file for drivers, haven't had much luck with the .deb file in debian. .rpm may play nicer?
 
Hi. I followed your suggested steps and have been able to successfully install vGPU driver version 18.4 along with the corresponding vgpu_ulock-rs. All the installation went smoothly without problem. I am able to see expected outputs for the following commands:

nvidia-smi

mdevctl types

and

nvidia-smi vgpu

I have also set up a new version of the license server. Though not tested it yet because of the reason below.

I have a red hat rhel 10 vm to which I have attached the vGPU with with one of the available mdev profiles.

Now the question is how to install and patch the guest driver. I have both the the following files for 18.4 guest drivers:

nvidia-linux-grid-570-570.172.08-1.x86_64.rpm
and
nvidia-gridd (binary patch for the above guest driver)

I know how to install the rpm driver itself but how to patch it in the guest OS

What steps to take. Should I just install the rpm driver and then run the patch file?
that is wonderful to hear. glad you got it working.

here is a repo associated with the patching of the grid driver, it details the steps fairly well
gridd-unlock-patcher

if you follow that repo and the license repo instructions it should work, then just run nvidia-smi -q | grep "License" as listed in the license repo instructions to check it to make sure.
 
yeah it's a simple as it sounds. install the grid driver which includes the nvidia-gridd binary and patch it. you'll need to have the licensing server running so you can download your certificate from it. the repo for the patching utility includes instructions from memory.

I normally use the .run file for drivers, haven't had much luck with the .deb file in debian. .rpm may play nicer?
same, i always use the run files, never had any luck with the others.
 
same, i always use the run files, never had any luck with the others.
Many thanks for the grid-unlocker-patcher repo link.

As a first step to install the guest driver itself, I ran the NVIDIA-Linux-x86_64-570.172.08-grid.run in the rhel 10 guest VM and I get the following error. Should we continue with installation


Code:
You appear to be running an X server.  Installing the NVIDIA driver while X is running is not recommended, as doing so may prevent the installer from detecting some potential installation problems, and it may not be possible to start new graphics applications after a new driver is installed.  If you choose to continue     

installation, it is highly recommended that you reboot your computer after installation to use the newly installed driver.

I also found this relevant issue How to exit the X server to install the NVIDIA driver? on the nvidia
 
Many thanks for the grid-unlocker-patcher repo link.

As a first step to install the guest driver itself, I ran the NVIDIA-Linux-x86_64-570.172.08-grid.run in the rhel 10 guest VM and I get the following error. Should we continue with installation


Code:
You appear to be running an X server.  Installing the NVIDIA driver while X is running is not recommended, as doing so may prevent the installer from detecting some potential installation problems, and it may not be possible to start new graphics applications after a new driver is installed.  If you choose to continue    

installation, it is highly recommended that you reboot your computer after installation to use the newly installed driver.

I also found this relevant issue How to exit the X server to install the NVIDIA driver? on the nvidia
are you running an x-server? if you are, just shut it down while you install the driver.

depending on whether you want x11 to use your vGPU or not, you can edit the x-config to either point towards the vGPU or just keep it on the standard VGA adapter in the VM. I normally keep mine on the standard one so I can still use the console in PVE, because if you run x11 with the vGPU you cant use the console anymore, you need to use x11 remote desktop.

depending on whether you want to use your vGPU for compute or display will be the decider.
 
Follow up: this warning:
`WARNING: CPU: 1 PID: 560 at drivers/pci/msi/msi.c:888 __pci_enable_msi_range+0x1b3/0x1d0`
was because in my VM I did not have the nouveau driver blacklisted.

I am, however, still getting this warning on the host:
`WARNING: CPU: 5 PID: 24912 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0`

I only get that warning when I start up a VM that has the vGPU passed to it. Either that 6.14 kernel, or the 17.5 drivers introduced that (or maybe the latest qemu that was installed with 8.4?) because I didn't have that before I updated everything.

It seems to be working, so I'm not worried.
@Randell - FYI I stopped getting this error when I did a clean install with PVE8. Haven't worked out what the cause is yet, but so far on a clean install with only the 570 unlocked drivers that error is gone. now checking whether it has actually made my GPU encoding stable, and then if it has, will upgrade to PVE9 again and see whether anything changes.
 
I have been able to install the nvidia-grdd guest driver in rhel10 VM after lots of reading and troubleshooting. The xserver issue was easy as rhel10 does not even have xserver or xorg stuff (it was removed in rhel10) but the the hardcoded .run installer was just complaining as false alarm. I also patched the driver using instructions at gridd-unlock-patcher. Downloaed the client token on client VM using the following command

Code:
wget --no-check-certificate -O /etc/nvidia/ClientConfigToken/client_configuration_token_$(date '+%d-%m-%Y-%H-%M-%S').tok https://10.10.1.135/-/client-token


Now the only issue remains is licencing. Running
Code:
systemctl status nvidia-gridd.service
gives the following erros:

Sep 09 22:14:20 localhost.localdomain nvidia-gridd[3148]: vGPU Software package (0)
Sep 09 22:14:20 localhost.localdomain nvidia-gridd[3148]: Ignore service provider and node-locked licensing
Sep 09 22:14:20 localhost.localdomain nvidia-gridd[3148]: NLS initialized
Sep 09 22:14:20 localhost.localdomain nvidia-gridd[3148]: Acquiring license. (Info: 10.10.1.135; NVIDIA Virtual Applications)
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: Mismatch between client and server with respect to licenses held. Returning the licenses
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: License returned successfully. (Info: 10.10.1.135)
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: Failed to verify signature (error:02000068:rsa routines::bad signature)
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: Failed to verify signature on lease response
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: Failed to validate lease response
Sep 09 22:14:22 localhost.localdomain nvidia-gridd[3148]: Failed to acquire license from 10.10.1.135


I have checked and confirmed that the time zone on both the fastapi license server and client VM is identical. Fastapi-DLS server is running on an lxc container (Ubuntu 24.04). In fact I also recreated the webserver.crt and webserver.key and then patched the original nvidia-gridd file again.

What could be wrong. After so much of effort still not working.
 
Last edited:
cert CN is fine as it points to my ip address of license server which is 10.10.1.135. License server hostname is just localhost (default ubuntu lxc container name give at the time of container creation). I am not sure why we need dns A records when the license server and client are on the same network (same ip range)
 
you dont have to use DNS if you arent using a FQDN in your cert or hostname settings, but I think at least the host name / IP in the licensing server config needs to align with the CN of your certificate - if one is 10.10.1.135 and the other is localhost (127.0.0.1) it might be causing issues with a match.

see the notes in the fastapi-dls config file:-

x-dls-variables: &dls-variables
TZ: Europe/Berlin # REQUIRED, set your timezone correctly on fastapi-dls AND YOUR CLIENTS !!!
DLS_URL: localhost # REQUIRED, change to your ip or hostname
DLS_PORT: 443
LEASE_EXPIRE_DAYS: 90
DATABASE: sqlite:////app/database/db.sqlite
DEBUG: false
 
Last edited:
you dont have to use DNS if you arent using a FQDN in your cert or hostname settings, but I think at least the host name / IP in the licensing server config needs to align with the CN of your certificate - if one is 10.10.1.135 and the other is localhost (127.0.0.1) it might be causing issues with a match.

see the notes in the fastapi-dls config file:-
The relevant part of /etc/fastapi-dls/env (on fastapi-dls server) is configured as below. Server version in 2.03 as I i have installed version 18.4 of nvidia-gridd:
Code:
# Cert Path
CERT_PATH="/etc/fastapi-dls/cert"

# Where the client can find the DLS server
DLS_URL=10.10.1.135
DLS_PORT=443

# CORS configuration
## comma separated list without spaces
#CORS_ORIGINS="https://$DLS_URL:$DLS_PORT"

# Lease expiration in days
LEASE_EXPIRE_DAYS=90
LEASE_RENEWAL_PERIOD=0.2

# Database location
## https://docs.sqlalchemy.org/en/14/core/engines.html
DATABASE=sqlite:////etc/fastapi-dls/db.sqlite

The relevant part of /etc/nvidia/gridd.conf (on on rhel client VM) is configured as below:


Code:
# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=10.10.1.135

# Description: Set License Server port number
# Data type: integer
# Format:  <port>, default is 7070
ServerPort=443

# Description: Set Backup License Server Address

The VM and lxc can ping each other and I can download client-token on the VM from the licence server using the IP. I am not sure what am I doing wrong or is there some issue with lxc or ubuntu 24.04 OS running in an lxc container.

I followed the guide for apt based install of licence server as described on Debian / Ubuntu (using dpkg / apt)
 
Last edited: