VGPU with a16 in a pve cluster

Sereno · Jan 23, 2023

Hi Everyone,

We have a 3 node cluster where one of the nodes has a A16 Nvidia GPU. The installation is working perfectly and we are able to see the GPU when running the nvidia-smi.

We can even select the GPU and the part we want when creating the VM. The problem is that everytime we try to have 2 VGPUs in the same VM, it gives this error:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000999004004,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000999004004: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000999004004,id=hostpci1,bus=pci.0,addr=0x11: vfio 00000001-0000-0000-0000-000999004004: error getting device from group 344: Input/output error

Following the NVIDIA documentation, its says that we should be able to do it using C and Q versions but we are unable. Has anyone has the same problems?

Best Regards

insuna · May 16, 2023

I don't have the same issue but a related one: Since vGPU 15.2 Nvidia allows adding 2 vGPUs to the same VM, but when I only have 1 physical host GPU proxmox can't add 2 mdevs from the same host pci id, because it generates the mdev uuid from the pci id and the vm id.

Can the mdev generator get a new feature, like a "dual gpu" option to allow generating different uuids for the same physical hardware and the same vm?

dcsapak · May 17, 2023

Hi,

insuna said:
I don't have the same issue but a related one: Since vGPU 15.2 Nvidia allows adding 2 vGPUs to the same VM, but when I only have 1 physical host GPU proxmox can't add 2 mdevs from the same host pci id, because it generates the mdev uuid from the pci id and the vm id.

Can the mdev generator get a new feature, like a "dual gpu" option to allow generating different uuids for the same physical hardware and the same vm?

this already happens, we include the vmid and hostpci id in the uuid (so hostpci0 and hostpci1 get a different mdev) see also the output from the original poster

Sereno said:
Hi Everyone,

We have a 3 node cluster where one of the nodes has a A16 Nvidia GPU. The installation is working perfectly and we are able to see the GPU when running the nvidia-smi.

We can even select the GPU and the part we want when creating the VM. The problem is that everytime we try to have 2 VGPUs in the same VM, it gives this error:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000999004004,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000999004004: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000999004004,id=hostpci1,bus=pci.0,addr=0x11: vfio 00000001-0000-0000-0000-000999004004: error getting device from group 344: Input/output error

Following the NVIDIA documentation, its says that we should be able to do it using C and Q versions but we are unable. Has anyone has the same problems?

Best Regards

sorry i missed this post until now, did you get it working in the meantime?

i could test it successfully, but i was only able to add multiple of the same vgpu to the vm (even though the nvidia docs says you can mix it? maybe only when they come from different gpus?)

insuna · May 17, 2023

dcsapak said:
Hi,

this already happens, we include the vmid and hostpci id in the uuid (so hostpci0 and hostpci1 get a different mdev) see also the output from the original poster

sorry i missed this post until now, did you get it working in the meantime?

i could test it successfully, but i was only able to add multiple of the same vgpu to the vm (even though the nvidia docs says you can mix it? maybe only when they come from different gpus?)

My apologies, you're correct, I misread the error message:

Code:

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000100: Could not enable error recovery for the device
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000001-0000-0000-0000-000000000100,id=hostpci1,bus=ich9-pcie-port-2,addr=0x0: vfio 00000001-0000-0000-0000-000000000100: error getting device from group 126: Input/output error
Verify all devices in group 126 are bound to vfio-<bus> or pci-stub and not already in use
TASK ERROR: start failed: QEMU exited with code 1

The error in my case is caused by using a Tesla P100. The Tesla P100 only supports 16C and 16Q vGPUs when assigning multiple to the same VM, so I'd need a second P100.
Fractional vGPU assignment only works with Ampere, but the OP in this post is using Ampere, so their problem and mine are different. I'll leave the thread, my sub-problem is resolved. Thank you all and have a beautiful day.

Search

Search

VGPU with a16 in a pve cluster

Sereno

Active Member

insuna

New Member

dcsapak

Proxmox Staff Member

insuna

New Member