VGPU with a16 in a pve cluster

Sereno

Active Member
Aug 10, 2018
10
0
41
42
Hi Everyone,

We have a 3 node cluster where one of the nodes has a A16 Nvidia GPU. The installation is working perfectly and we are able to see the GPU when running the nvidia-smi.

We can even select the GPU and the part we want when creating the VM. The problem is that everytime we try to have 2 VGPUs in the same VM, it gives this error:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000999004004,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000999004004: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000999004004,id=hostpci1,bus=pci.0,addr=0x11: vfio 00000001-0000-0000-0000-000999004004: error getting device from group 344: Input/output error

Following the NVIDIA documentation, its says that we should be able to do it using C and Q versions but we are unable. Has anyone has the same problems?

Best Regards
 
I don't have the same issue but a related one: Since vGPU 15.2 Nvidia allows adding 2 vGPUs to the same VM, but when I only have 1 physical host GPU proxmox can't add 2 mdevs from the same host pci id, because it generates the mdev uuid from the pci id and the vm id.

Can the mdev generator get a new feature, like a "dual gpu" option to allow generating different uuids for the same physical hardware and the same vm?
 
Hi,

I don't have the same issue but a related one: Since vGPU 15.2 Nvidia allows adding 2 vGPUs to the same VM, but when I only have 1 physical host GPU proxmox can't add 2 mdevs from the same host pci id, because it generates the mdev uuid from the pci id and the vm id.

Can the mdev generator get a new feature, like a "dual gpu" option to allow generating different uuids for the same physical hardware and the same vm?
this already happens, we include the vmid and hostpci id in the uuid (so hostpci0 and hostpci1 get a different mdev) see also the output from the original poster

Hi Everyone,

We have a 3 node cluster where one of the nodes has a A16 Nvidia GPU. The installation is working perfectly and we are able to see the GPU when running the nvidia-smi.

We can even select the GPU and the part we want when creating the VM. The problem is that everytime we try to have 2 VGPUs in the same VM, it gives this error:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000999004004,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000999004004: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000999004004,id=hostpci1,bus=pci.0,addr=0x11: vfio 00000001-0000-0000-0000-000999004004: error getting device from group 344: Input/output error

Following the NVIDIA documentation, its says that we should be able to do it using C and Q versions but we are unable. Has anyone has the same problems?

Best Regards
sorry i missed this post until now, did you get it working in the meantime?

i could test it successfully, but i was only able to add multiple of the same vgpu to the vm (even though the nvidia docs says you can mix it? maybe only when they come from different gpus?)
 
Hi,


this already happens, we include the vmid and hostpci id in the uuid (so hostpci0 and hostpci1 get a different mdev) see also the output from the original poster


sorry i missed this post until now, did you get it working in the meantime?

i could test it successfully, but i was only able to add multiple of the same vgpu to the vm (even though the nvidia docs says you can mix it? maybe only when they come from different gpus?)

My apologies, you're correct, I misread the error message:


Code:
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000100: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000000000100,id=hostpci1,bus=ich9-pcie-port-2,addr=0x0: vfio 00000001-0000-0000-0000-000000000100: error getting device from group 126: Input/output error
Verify all devices in group 126 are bound to vfio-<bus> or pci-stub and not already in use
TASK ERROR: start failed: QEMU exited with code 1

The error in my case is caused by using a Tesla P100. The Tesla P100 only supports 16C and 16Q vGPUs when assigning multiple to the same VM, so I'd need a second P100.
Fractional vGPU assignment only works with Ampere, but the OP in this post is using Ampere, so their problem and mine are different. I'll leave the thread, my sub-problem is resolved. Thank you all and have a beautiful day.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!