vGPU Tesla P4 wrong mdevctl gpu

MinerAle00

Member
Apr 3, 2022
16
4
8
www.youtube.com
Good evening, I have an issue with the Tesla P4 and vGPU. I managed to install the driver and by selecting the mdev profiles, I succeeded in passing the graphics card to both Windows and Ubuntu. The main problem is that on various profiles, the graphics card is recognized as a Tesla P40. What procedure should I follow to change the GPU?

root@pve:~# nvidia-smi vgpu
Wed Mar 13 19:00:37 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 Tesla P4 | 00000000:42:00.0 | 0% |
+---------------------------------+------------------------------+------------+

root@pve:~# mdevctl types
0000:42:00.0
nvidia-156
Available instances: 12
Device API: vfio-pci
Name: GRID P40-2B
Description: num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=12
nvidia-215
Available instances: 12
Device API: vfio-pci
Name: GRID P40-2B4
Description: num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=12
nvidia-241
Available instances: 24
Device API: vfio-pci
Name: GRID P40-1B4
Description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=24
nvidia-46
Available instances: 24
Device API: vfio-pci
Name: GRID P40-1Q
Description: num_heads=4, frl_config=60, framebuffer=1024M, max_resolution=5120x2880, max_instance=24
nvidia-47
Available instances: 12
Device API: vfio-pci
Name: GRID P40-2Q
Description: num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=7680x4320, max_instance=12
nvidia-48
Available instances: 8
Device API: vfio-pci
Name: GRID P40-3Q
Description: num_heads=4, frl_config=60, framebuffer=3072M, max_resolution=7680x4320, max_instance=8
nvidia-49
Available instances: 6
Device API: vfio-pci
Name: GRID P40-4Q
Description: num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=7680x4320, max_instance=6
nvidia-50
Available instances: 4
Device API: vfio-pci
Name: GRID P40-6Q
Description: num_heads=4, frl_config=60, framebuffer=6144M, max_resolution=7680x4320, max_instance=4
nvidia-51
Available instances: 3
Device API: vfio-pci
Name: GRID P40-8Q
Description: num_heads=4, frl_config=60, framebuffer=8192M, max_resolution=7680x4320, max_instance=3
nvidia-52
Available instances: 2
Device API: vfio-pci
Name: GRID P40-12Q
Description: num_heads=4, frl_config=60, framebuffer=12288M, max_resolution=7680x4320, max_instance=2
nvidia-53
Available instances: 1
Device API: vfio-pci
Name: GRID P40-24Q
Description: num_heads=4, frl_config=60, framebuffer=24576M, max_resolution=7680x4320, max_instance=1
nvidia-54
Available instances: 24
Device API: vfio-pci
Name: GRID P40-1A
Description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=24
nvidia-55
Available instances: 12
Device API: vfio-pci
Name: GRID P40-2A
Description: num_heads=1, frl_config=60, framebuffer=2048M, max_resolution=1280x1024, max_instance=12
nvidia-56
Available instances: 8
Device API: vfio-pci
Name: GRID P40-3A
Description: num_heads=1, frl_config=60, framebuffer=3072M, max_resolution=1280x1024, max_instance=8
nvidia-57
Available instances: 6
Device API: vfio-pci
Name: GRID P40-4A
Description: num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=1280x1024, max_instance=6
nvidia-58
Available instances: 4
Device API: vfio-pci
Name: GRID P40-6A
Description: num_heads=1, frl_config=60, framebuffer=6144M, max_resolution=1280x1024, max_instance=4
nvidia-59
Available instances: 3
Device API: vfio-pci
Name: GRID P40-8A
Description: num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=1280x1024, max_instance=3
nvidia-60
Available instances: 2
Device API: vfio-pci
Name: GRID P40-12A
Description: num_heads=1, frl_config=60, framebuffer=12288M, max_resolution=1280x1024, max_instance=2
nvidia-61
Available instances: 1
Device API: vfio-pci
Name: GRID P40-24A
Description: num_heads=1, frl_config=60, framebuffer=24576M, max_resolution=1280x1024, max_instance=1
nvidia-62
Available instances: 24
Device API: vfio-pci
Name: GRID P40-1B
Description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=24

1710353495559.png
 
hi,

since those names are provided by the driver, i'd guess it's a driver bug? can you post the output of 'lspci -nnk' ?
in any case i'd probably ask nvidia support why your p4 reports p40 profiles...
 
Good morning, I issued that command and this is the output it generated.
root@pve:~# lspci -nnk
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
Subsystem: ASRock Incorporation Family 17h (Models 00h-0fh) Root Complex [1849:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Kernel driver in use: pcieport
00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Kernel driver in use: pcieport
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Kernel driver in use: pcieport
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Kernel driver in use: pcieport
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
Subsystem: ASRock Incorporation FCH SMBus Controller [1849:ffff]
Kernel driver in use: piix4_smbus
Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
Subsystem: ASRock Incorporation FCH LPC Bridge [1849:ffff]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
Kernel driver in use: k10temp
Kernel modules: k10temp
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
00:19.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:19.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:19.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:19.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
Kernel driver in use: k10temp
Kernel modules: k10temp
00:19.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:19.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:19.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:19.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller [1022:43ba] (rev 02)
Subsystem: ASMedia Technology Inc. X399 Series Chipset USB 3.1 xHCI Controller [1b21:1142]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
01:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller [1022:43b6] (rev 02)
Subsystem: ASMedia Technology Inc. X399 Series Chipset SATA Controller [1b21:1062]
Kernel driver in use: ahci
Kernel modules: ahci
01:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge [1022:43b1] (rev 02)
Subsystem: ASMedia Technology Inc. X399 Series Chipset PCIe Bridge [1b21:0201]
Kernel driver in use: pcieport
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
Subsystem: ASMedia Technology Inc. 300 Series Chipset PCIe Port [1b21:3306]
Kernel driver in use: pcieport
02:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
Subsystem: ASMedia Technology Inc. 300 Series Chipset PCIe Port [1b21:3306]
Kernel driver in use: pcieport
02:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
Subsystem: ASMedia Technology Inc. 300 Series Chipset PCIe Port [1b21:3306]
Kernel driver in use: pcieport
02:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
Subsystem: ASMedia Technology Inc. 300 Series Chipset PCIe Port [1b21:3306]
Kernel driver in use: pcieport
02:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
Subsystem: ASMedia Technology Inc. 300 Series Chipset PCIe Port [1b21:3306]
Kernel driver in use: pcieport
04:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
Subsystem: ASRock Incorporation I211 Gigabit Network Connection [1849:1539]
Kernel driver in use: igb
Kernel modules: igb
05:00.0 Network controller [0280]: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] [8086:24fb] (rev 10)
Subsystem: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] [8086:2110]
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
06:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
Subsystem: ASRock Incorporation I211 Gigabit Network Connection [1849:1539]
Kernel driver in use: igb
Kernel modules: igb
08:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P5 NVMe PCIe SSD[SlashP5] [c0a9:5412]
Subsystem: Micron/Crucial Technology P5 NVMe PCIe SSD[SlashP5] [c0a9:0100]
Kernel driver in use: nvme
Kernel modules: nvme
09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
09:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor (PSP) 3.0 Device [1022:1456]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor (PSP) 3.0 Device [1022:1456]
Kernel driver in use: ccp
Kernel modules: ccp
09:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:d102]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
0a:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
Subsystem: ASRock Incorporation FCH SATA Controller [AHCI mode] [1849:ffff]
Kernel driver in use: ahci
Kernel modules: ahci
0a:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
Subsystem: ASRock Incorporation Family 17h (Models 00h-0fh) HD Audio Controller [1849:1220]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
40:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
40:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Kernel driver in use: pcieport
40:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
Kernel driver in use: pcieport
40:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Kernel driver in use: pcieport
40:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
40:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
Kernel driver in use: pcieport
41:00.0 Non-Volatile memory controller [0108]: INNOGRIT Corporation NVMe SSD Controller IG5236 [1dbe:5236] (rev 01)
Subsystem: INNOGRIT Corporation NVMe SSD Controller IG5236 [1dbe:5236]
Kernel driver in use: nvme
Kernel modules: nvme
42:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation GP104GL [Tesla P4] [10de:11d8]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
43:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
43:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor (PSP) 3.0 Device [1022:1456]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor (PSP) 3.0 Device [1022:1456]
Kernel driver in use: ccp
Kernel modules: ccp
43:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
44:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
44:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
Subsystem: ASRock Incorporation FCH SATA Controller [AHCI mode] [1849:ffff]
Kernel driver in use: ahci
Kernel modules: ahci

I installed NVIDIA drivers 16.1. Could it be a problem with that specific version? Would it be better to try a later or earlier version?
 
I installed NVIDIA drivers 16.1. Could it be a problem with that specific version? Would it be better to try a later or earlier version?
i don't know but trying shouldn't hurt as i said the names/profiles come from the driver, we don't have any influence what it reports
 
well, then i guess there is no other way to solve this then to ask nvidia support
 
After trying a bunch of configurations, I finally managed to get the correct profiles. The driver version I used is this one.
root@pve:~# nvidia-smi
Thu Mar 14 18:10:42 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti On | 00000000:42:00.0 Off | N/A |
| 0% 34C P8 15W / 200W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P4 On | 00000000:43:00.0 Off | 0 |
| N/A 43C P0 26W / 75W | 1903MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 1577 C+G vgpu 1872MiB |
+---------------------------------------------------------------------------------------+


Screenshot 2024-03-14 alle 18.15.33.png
 
great! what exactly did you do to fix it? (to help future readers)
 
I tried to redo the installation and I understood what I changed. Initially, I tried to use this script https://wvthoog.nl/proxmox-vgpu-v3/, but unfortunately the profiles were incorrect.
I tried using the original Nvidia driver (the non-custom driver), and the profiles are correct.
Everything is working fine now!
Did you simply download the 535.104.06 Nvidia driver and run the installation after using the script? I originally followed the tutorial here: https://gitlab.com/polloloco/vgpu-proxmox but I ended up with the P40 profiles like you did. I tried all the 16.x firmwares, rebooting in between, and nothing worked to show the P4 profiles. I tried running the script you linked, but chose to uninstall the vgpu changes, rebooted, then ran again as a fresh install of vgpu, but ended up with P40 profiles again. I installed the 535.104.06 Nvidia driver in place and selected Yes to uninstall previous Nvidia drivers, and rebooted, but still have P40 profiles. I tried passing the vgpu through to a vm using the P40 profile, but the vms don't seem to like want to recognize them correctly
 
I just found out that if you use this script https://wvthoog.nl/proxmox-vgpu-v3/ when installing the vgpu drivers it will show the wrong mdevctl types but when you choose option 2 in the script and "upgrade" your driver after already having it installed it will show the right mdevctl types, I don't know how or why but had the wrong types on 2 systems and this solved it for both.
 
Last edited:
I just found out that if you use this script https://wvthoog.nl/proxmox-vgpu-v3/ when installing the vgpu drivers it will show the wrong mdevctl types but when you choose option 2 in the script and "upgrade" your driver after already having it installed it will show the right mdevctl types, I don't know how or why but had the wrong types on 2 systems and this solved it for both.
You first downloaded the original Nvidia drivers, and then you used the script to update to the latest version of the Nvidia driver.
Were the mdevctl devices correct?
I currently have the 535.161.05 drivers installed, and mdevctl is working. Which version do you have?
 
When I just "downloaded" the drivers via the wvthoog script the mdevctl types were wrong, but after I did an "upgrade" using the same script the mdevctl types were right for some reason.

I downloaded the drivers via the wvthoog script and then "upgraded" de drivers via that same script, I am now on the 550.54.10 driver for the host system using the patch for Pascal based systems mentioned in the Polloloco vGPU Guide (again, all the drivers were downloaded via the wvthoog script).

Still need to better figure it out but I think this method does brake the vgpu-unlock part of this script which means you need to edit the mdevctl devices yourself
 
When I just "downloaded" the drivers via the wvthoog script the mdevctl types were wrong, but after I did an "upgrade" using the same script the mdevctl types were right for some reason.

I downloaded the drivers via the wvthoog script and then "upgraded" de drivers via that same script, I am now on the 550.54.10 driver for the host system using the patch for Pascal based systems mentioned in the Polloloco vGPU Guide (again, all the drivers were downloaded via the wvthoog script).

Still need to better figure it out but I think this method does brake the vgpu-unlock part of this script which means you need to edit the mdevctl devices yourself
Okay, I'm trying it out now to see if it works. Which kernel are you using?
 

Attachments

  • Screenshot 2024-07-01 alle 15.34.22.png
    Screenshot 2024-07-01 alle 15.34.22.png
    123.4 KB · Views: 16
I'm currently using kernel 6.5.13-5-pve, that's the kernel the wvthoog script forces you to use.
I tried following your instructions, but I'm unable to make the various profiles appear. I read in Polloloco's guide this PSA for Pascal (and older) GPUs like the P4, GTX 1080:

**Starting from driver version 17.0, Nvidia in their infinite wisdom dropped support for older cards. So now, no matter if the card used to be supported (Tesla P4, etc.) or not, you have to patch the driver.**

**In addition to that, you have to copy the `vgpuConfig.xml` from 16.4 and replace the new 17.0 XML. To do that, you install and patch the 17.0 driver as described above, and then extract the 16.4 driver with `./driver.run -x`, and copy the `vgpuConfig.xml` from inside the extracted archive to `/usr/share/nvidia/vgpu/vgpuConfig.xml` (replace the existing file). Then reboot, and you should see vGPU profiles in mdevctl again.**

Now, I'll try this procedure and see if it works.
 
Great, I'm glad to hear that the procedure worked! You're welcome for the help. It's a good idea to reach out to the script's author to see if this process can be automated, as it is indeed quite slow. If integrated, it would save a lot of time and effort for others facing the same issue. Good luck!

Screenshot 2024-07-01 alle 17.13.18.png
 
Yeah, I can look into it. I was already looking at the code but didn't see "the problem" at a first glance so need to look further into it.

Glad it works for you now!
 
I tried following your instructions, but I'm unable to make the various profiles appear. I read in Polloloco's guide this PSA for Pascal (and older) GPUs like the P4, GTX 1080:

**Starting from driver version 17.0, Nvidia in their infinite wisdom dropped support for older cards. So now, no matter if the card used to be supported (Tesla P4, etc.) or not, you have to patch the driver.**

**In addition to that, you have to copy the `vgpuConfig.xml` from 16.4 and replace the new 17.0 XML. To do that, you install and patch the 17.0 driver as described above, and then extract the 16.4 driver with `./driver.run -x`, and copy the `vgpuConfig.xml` from inside the extracted archive to `/usr/share/nvidia/vgpu/vgpuConfig.xml` (replace the existing file). Then reboot, and you should see vGPU profiles in mdevctl again.**

Now, I'll try this procedure and see if it works.
I am making sure that I got the process down:

You ran the script and used option 4 to download driver version 16.4
You ran the script again with option 2 and upgraded to driver version 17.0 which also patches the driver in the process
You extracted the vgpuConfig.xml file from the 16.4 version deriver and copied it overriding the vgpuConfig.xml provided from the 17.0 version driver
You rebooted


Is that right? Were you able to use the mdev profiles in a vm without licensing issues?

Did you have to edit any other files, like including C libraries from the vgpu_unlock directory or customizing the vgpu profile override toml?

When you ran the VM were you able to do it from the GUI or did you have to attach the PCI device via the configuration file of the vm manually?


Thanks a lot in advance, I am facing the same issue as you have and tried to repeat the steps as I understood them, however I am still facing Issues. I am running the Tesla P4 with the 6.5.13-6-pve kernel. I am currently on version 550.54.10 but I also ran the 535.161.05 driver with similar failure.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!