[SOLVED] Trouble installing Nvidia Drivers for Tesla P4 vGPU inside Linux VM

itachi737

Member
Mar 18, 2021
5
0
6
34
Hi,

I recently bought a Tesla P4. I followed PolloLoco vgpu install guide to set it up. The gpu is recognized inside the pve and I can pass vgpus to the vms using the different profiles. Nvidia-smi inside pve shows the different vgpus assigned to vms. Both Linux and Windows vms see the vgpu. In windows, I managed to install the drivers and everything works fine. In Linux (Mint) however, I can't seem to get the drivers to work properly. Driver manager sees the GPU and it lets me install the last drivers for the P4 (the nvidia drivers not nouveau). The problem is that after the install when I run nvidia-smi inside the vm it says that there is no gpu. If I passthrough the whole gpu instead of the vgpu, nvidia-smi shows the gpu, but it still doesn't use it when I test it with unigine heaven. I tried with both q vgpu profiles and with c vgpu profiles, but none work. I'm not quiet sure what I'm doing wrong with the drivers inside . Has anybody experience this issue? If not, can somebody help me understand the right process to install the drivers in linux? I must be doing something wrong, but I don't know what.

FYI: I'm trying to setup a Linux vm for plex. Right now, I'm using a windows one, but hdr tone mapping is not supported on the gpu so it hammers my cpu. In Linux it's supposed to be supported on the gpu.

Error message when running nvidia-smi with vgpu: "Nvidia-smi has failed because it couldn't communicate with the driver."

Thank you

I solved it, by manually installing the drivers using the terminal rather than using the gui driver manager in Linux Mint.
 
Last edited:
are you using the correct drivers?
Installation in a VM: After you create a Linux VM on the hypervisor and boot the VM, install the NVIDIA vGPU software graphics driver in the VM to fully enable GPU operation. [1]
I'd have a look and skim the whole user guide [2].

If that is not the problem, have you already tried if the drivers work on Debian/Ubuntu/RHEL?

[1]: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-vgpu-drivers-linux
[2]: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html
 
I have Proxmox 8.1 (debian), Nvidia Tesla P4, AMD Ryzen 7745hx CPU. Working igpu and gpu passthrough.

I'm going with PolloLoco setup.. and I stuck on unlock service.
prep:
undo gpu passthrough and IOMMU was enabled

1 step:
1) download from git patch and rust script. build it. and tried to load it in form of 2 services nvidia-vgpud and nvidia-vgpu-mgr. Configs are the same, one .so lib in both. If I try to enable service than I got error that no nvidia-vpud.service provided. I'm stuck here.
2) I don't create file with unlock=false to block my actions

2 step completed:
1) download latest driver for Tesla P4 version 535.161.08 with 16.x cuda, patch it with downloaded patch (tried without patch first, same final result). installed it. sign it to use it in secure boot
2) import sign keys to keychain "mokutils --import cert.der"
3) enroll MOK keyring update on next boot and apply keys to load module.
4) patched module loaded and secure boot completed.

3 step final touches:
1) driver loaded and nvidia-smi is working.
2) nvidia-smi -q have section with vgpu
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Applications
License Status : Licensed (Expiry: N/A)
But I don't setup license server gridd. I don't have any license but in driver info I see some license.
3) nvidia-smi vgpu return funny error message, because -q show that support is provided
"No supported devices in vGPU mode"
4) mdevctl types and mdevctl list are empty. No modes provided.

My guess that on step 1 unlock lib wrapped in service don't load and as result I don't have any vgpu. But P4 in supported list and I should not have any issues and potentially don't need to patch driver. I tried with patch and without patch. same result.


Another funny thing is in PCI related section of nvidia-smi -q :
Bus : 0x01
Device : 0x0
Domain : 0x0000
Device Id : 0x1BB310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11D810DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 3
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A


Problem with slot speed confirmed with SMBIOS info:
dmidecode --type slot
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.5.0 present.
Handle 0x0022, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE1
Type: x8 PCI Express x8
Current Usage: In Use
Length: Short
ID: 1
Characteristics:
3.3 V is provided
PME signal is supported
Bus Address: 0000:00:01.1

Slot capable of full speed PCIE 5.0 x16. P4 designed for PCIE 3.0 x16. But card using only PCIE 1.0 and limit itself to x8.

@itachi737 and @noel can you help me what I do wrong with VGPU and how do you both solve unlock lib loading?
 
I have Proxmox 8.1 (debian), Nvidia Tesla P4, AMD Ryzen 7745hx CPU. Working igpu and gpu passthrough.

I'm going with PolloLoco setup.. and I stuck on unlock service.
prep:
undo gpu passthrough and IOMMU was enabled

1 step:
1) download from git patch and rust script. build it. and tried to load it in form of 2 services nvidia-vgpud and nvidia-vgpu-mgr. Configs are the same, one .so lib in both. If I try to enable service than I got error that no nvidia-vpud.service provided. I'm stuck here.
2) I don't create file with unlock=false to block my actions

2 step completed:
1) download latest driver for Tesla P4 version 535.161.08 with 16.x cuda, patch it with downloaded patch (tried without patch first, same final result). installed it. sign it to use it in secure boot
2) import sign keys to keychain "mokutils --import cert.der"
3) enroll MOK keyring update on next boot and apply keys to load module.
4) patched module loaded and secure boot completed.

3 step final touches:
1) driver loaded and nvidia-smi is working.
2) nvidia-smi -q have section with vgpu
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Applications
License Status : Licensed (Expiry: N/A)
But I don't setup license server gridd. I don't have any license but in driver info I see some license.
3) nvidia-smi vgpu return funny error message, because -q show that support is provided
"No supported devices in vGPU mode"
4) mdevctl types and mdevctl list are empty. No modes provided.

My guess that on step 1 unlock lib wrapped in service don't load and as result I don't have any vgpu. But P4 in supported list and I should not have any issues and potentially don't need to patch driver. I tried with patch and without patch. same result.


Another funny thing is in PCI related section of nvidia-smi -q :
Bus : 0x01
Device : 0x0
Domain : 0x0000
Device Id : 0x1BB310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11D810DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 3
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A


Problem with slot speed confirmed with SMBIOS info:
dmidecode --type slot
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.5.0 present.
Handle 0x0022, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE1
Type: x8 PCI Express x8
Current Usage: In Use
Length: Short
ID: 1
Characteristics:
3.3 V is provided
PME signal is supported
Bus Address: 0000:00:01.1

Slot capable of full speed PCIE 5.0 x16. P4 designed for PCIE 3.0 x16. But card using only PCIE 1.0 and limit itself to x8.

@itachi737 and @noel can you help me what I do wrong with VGPU and how do you both solve unlock lib loading?
535 worked for me also like a charm. I think that is because that vGPU patch is not fully required for these drivers. Did you manage to upgrade that 535 to latest 550 drivers on host? I was able to install drivers nvidia-smi returns correct info, but mdevctl show no profiles, nvidia-vgpud service is not running, saying that: error: failed to send vGPU configuration info to RM: 6
 
After research I find that Nvidia after 535 drop support for Pascal GPUs. I can't make 535.161.08 to support VGPU. but 535.106.06 works.

I hoped that vgpu support will be out of the box for P4 because it should be used for this.
 
After research I find that Nvidia after 535 drop support for Pascal GPUs. I can't make 535.161.08 to support VGPU. but 535.106.06 works.

I hoped that vgpu support will be out of the box for P4 because it should be used for this.
The problem is nvidia drop supports for Pascal after version 16. We can use the patched host grid driver to keep up withe latest Proxmox OS, but there is no matching guest drivers for windows/ubuntu, etc. So from this point on, we have two choices:
  1. Upgrade Proxmox to the latest OS, patch the latest host vgpu drivers, and run the version 16 in guest. The nvidia host and guest will not matched.
  2. Upgrade Proxmox to the latest OS, install and pin kernel 6.5, and run the version 16 in guest. The nvidia host and guest will matched.
I will take option 2 for now.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!