Proxmox 5.3 - Tesla P40 vGPU issues

dominiaz

Well-Known Member
Sep 16, 2016
34
1
48
37
I have issues with Tesla P40 with Proxmox 5.3 with vGPU support.

root@vgpu:/# lspci -d 10de: -k
86:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia


I have installed gpu manager driver: NVIDIA-Linux-x86_64-410.91-vgpu-kvm

I have disabled ECC memory with nvidia-smi as nvidia documentation says that.

I have enabled intel_iommu=on.

When I am trying to start VM then I am getting an error:

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:86:00.0/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio error: 00000000-0000-0000-0000-000000000100: error getting device from group 79: Connection timed out

Verify all devices in group 79 are bound to vfio-<bus> or pci-stub and not already in use

dmesg:

[ 150.834555] iommu: Adding device 00000000-0000-0000-0000-000000000100 to group 79
[ 150.834557] vfio_mdev 00000000-0000-0000-0000-000000000100: MDEV: group_id = 79
[ 161.498679] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured


Please help me :)


 
[ 161.498679] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured
seems the card has not activated the vgpu

how exactly did you obtain the driver? are you sure it is compatible with our kernel version?
 
Im pretty sure NVIDIA needs to write the drivers to support Debian and vGPU, The drivers you are using are from Red Hat KVM.

Whats the ouput from
Code:
nvidia-smi vgpu
 
How to check if the driver is compatible with proxmox kernel?

I think that that nvidia driver should support Debian.

On enterprise download page I have to download a few drivers for vgpu:
I have installed the first on driver like:

apt-get install pve-headers-$(uname -r)
apt-get install make make-install gcc g++
./NVIDIA-Linux-x86_64-410.91-vgpu-kvm.run


Driver were install sucsessqfuly.


root@vgpu:~# nvidia-smi vgpu
Tue Feb 26 01:43:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.91 Driver Version: 410.91 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 Tesla P40 | 00000000:86:00.0 | 0% |
+---------------------------------+------------------------------+------------+
 
Your only showing device:
Code:
86:00
On our RedHat system with Tesla M10. You dont pass through with device id. I pass through with:
Code:
2122657

Output of nvidia-smi vgpu should be :
Code:
|===============================+================================+============|
|   0  Tesla M10                | 00000000:88:00.0               |  52%       |
|      2122661    GRID M10-4Q   | 2122682  centos7.4-xmpl-211... |     19%    |
|      2122663    GRID M10-4Q   | 2122692  centos7.4-xmpl-211... |      0%    |
|      2122659    GRID M10-4Q   | 2122664  centos7.4-xmpl-211... |     25%    |
|      2122657    GRID M10-4Q   | 2122663  centos7.4-xmpl-211... |     24%    |
+-------------------------------+--------------------------------+------------+

See my device id of 88:00 but my device IDs for vGPU are:
Code:
2122661  
2122663  
2122659  
2122657

NVIDIA need to write driver compatible with Debian kernel for Proxmox to take advantage.


Can you run below:
Code:
 nvidia-smi vgpu -s -i 0
 
Last edited:
root@vgpu:~# nvidia-smi vgpu -s -i 0
GPU 00000000:86:00.0
GRID P40-1Q
GRID P40-2Q
GRID P40-3Q
GRID P40-4Q
GRID P40-6Q
GRID P40-8Q
GRID P40-12Q
GRID P40-24Q
GRID P40-1A
GRID P40-2A
GRID P40-3A
GRID P40-4A
GRID P40-6A
GRID P40-8A
GRID P40-12A
GRID P40-24A
GRID P40-1B
GRID P40-2B
GRID P40-2B4
 
Ok,

Does Proxmox give you an option to choose Mdev Type after you choose 86:00 for Device?
 
T
Ok,

Does Proxmox give you an option to choose Mdev Type after you choose 86:00 for Device?
Yes, I can choose all mdev profiles from device 86:00.

Any ideas?

Everything works until I am staring VM. After start VM then I see the error.
 
Last edited:
I want to ask you, is it finally solved? Can I use Nvidia's vGPU features in proxmox? Thank you

T
Yes, I can choose all mdev profiles from device 86:00.

Any ideas?

Everything works until I am staring VM. After start VM then I see the error.
 
I am currently doing the same test, using P4, I have the same problem as you.

I have noticed the following log information:

nvidia-vgpu-mgr[4716]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57
kernel: [60186.757610] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000102: start failed. status: 0x65 Timeout Occured

Isn't nvidia having the command to call libvirt causing the problem?

Ive tried many versions 6.x and 7.x
 
nvidia-vgpu-mgr[4716]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57
seems the vgpu mgr wants to parse the uuid of the vm via its commandline?

maybe it helps to set
Code:
args: -uuid some-uuid-here
?
 
Yes, that's it. I modified my virtual machine configuration

args: -uuid 7e6ada15-02b7-48ce-b06c-2e1c43fde32e

A new error occurred when starting up again

May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: notice: vmiop_env_log: Received start call from nvidia-vgpu-vfio module: vGPU uuid 00000000-0000-0000-0000-000000000102 GPU PCI id 00:04:00.0 config params vgpu_type_id=64
May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: notice: vmiop-env: guest_max_gpfn:0x0
May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: notice: pluginconfig: vgpu_type_id=64,gpu-pci-id=00:04:00.0,vdev_id=0x1bb3:0x1205,vgpu_type=Quadro,framebufferlength=0x74000000
May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: error: vmioplugin dlopen: /usr/lib64/libnvidia-vgpu.so: cannot open shared object file: No such file or directory
May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: error: vmiope_process_configuration: plugin registration error
May 7 16:52:28 vcloud nvidia-vgpu-mgr[9285]: error: vmiop_env_log: vmiope_process_configuration failed with 0x57
May 7 16:52:28 vcloud kernel: [ 3771.066490] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000102: start failed. status: 0x57

Prompt /usr/lib64/libnvidia-vgpu.so not found , Looked up so file

root@vcloud:~# find / -name *nvidia*.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-vgxcfg.so
/usr/lib/x86_64-linux-gnu/libnvidia-vgpu.so
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so

Add lib64 soft link to /usr/lib/x86_64-linux-gnu

root@vcloud:/usr# ls -lat
total 64
drwxr-xr-x 2 root root 20480 May 7 17:21 bin
drwxr-xr-x 10 root root 4096 May 7 17:12 .
lrwxrwxrwx 1 root root 25 May 7 17:12 lib64 -> /usr/lib/x86_64-linux-gnu
drwxr-xr-x 54 root root 4096 May 7 15:46 lib
drwxr-xr-x 112 root root 4096 May 7 15:46 share
drwxr-xr-x 5 root root 4096 May 7 15:46 src
drwxr-xr-x 37 root root 4096 Mar 19 10:46 include
drwxr-xr-x 2 root root 12288 Mar 18 15:20 sbin
drwxr-xr-x 22 root root 4096 Mar 18 15:20 ..
drwxr-xr-x 2 root root 4096 Feb 20 22:45 games
drwxr-xr-x 10 root root 4096 Feb 20 22:44 local

Start VM again, OK!!!!!!!!

TKS


seems the vgpu mgr wants to parse the uuid of the vm via its commandline?

maybe it helps to set
Code:
args: -uuid some-uuid-here
?
 
It is working !!!

Thank you very much guys!

P.S. On Proxmox 6.0 there is kernel bug in vfio and vm freezes sometimes with syslog message about kernel bug in vfio. Proxmox 5.4 works great with Nvidia vgpu.
 
Last edited:
  • Like
Reactions: andrewgrishenko
It is working !!!

Thank you very much guys!

P.S. On Proxmox 6.0 there is kernel bug in vfio and vm freezes sometimes with syslog message about kernel bug in vfio. Proxmox 5.4 works great with Nvidia vgpu.

Hello!
thanks, this topic really helped me
 
Last edited:
Instruction step-by-step:

1) Enable iommu on proxmox host: https://pve.proxmox.com/wiki/Pci_passthrough
2) Download NVIDIA vGPU for Linux KVM (NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run or newer) from https://nvid.nvidia.com/dashboard/
3) Execute code on proxmox host
Code:
chmod +x NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run
apt install gcc make pve-headers-`uname -r`
./NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run
mkdir /usr/lib64
ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so /usr/lib64/libnvidia-ml.so
ln -s /usr/lib/x86_64-linux-gnu/libnvidia-vgxcfg.so /usr/lib64/libnvidia-vgxcfg.so
ln -s /usr/lib/x86_64-linux-gnu/libnvidia-vgpu.so /usr/lib64/libnvidia-vgpu.so
ln -s /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so /usr/lib64/libnvidia-cfg.so
4) Reboot proxmox host
5) Add it all vm with diffrent uuid 'args: -uuid 00000000-0000-0000-0000-000000000100' to vm file /etc/pve/local/qemu-server/100.conf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!