Update:
We made a large step towards working vGPUs with our A100.
Out of curiosity we just tried it with the NVIDIA AI Enterprise drivers from the NGC Catalog and BAAM...we got the
nvidia
folder with
creatable_vgpu_types
, etc.
Output of
nvidia-smi vgpu -c
to display the creatable vgpu types is now:
Code:
GPU 00000000:07:00.0
GRID A100X-8C
GPU 00000000:0B:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:48:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:4C:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:88:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:8B:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:C8:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
GPU 00000000:CB:00.0
GRID A100X-4C
GRID A100X-5C
GRID A100X-8C
GRID A100X-10C
GRID A100X-20C
GRID A100X-40C
As you can see, for the first GPU we already set the ID for type
GRID A100X-8C
in the first virtual function
virtfn0
(which points to PCI-ID
0000:07:00.4
) in
current_vgpu_type
. Must be something like
459
.
But we couldn't start the VM configured with PCI device
0000:07:00.4
because it said:
Code:
error writing '0000:07:00.4' to '/sys/bus/pci/drivers/vfio-pci/bind': Invalid argument
TASK ERROR: Cannot bind 0000:07:00.4 to vfio
We tried with another GPU (
0000:0b:00
) and added PCI devices
0000:0b:00.4
and
0000:0b:00.5
with MDev Type
nvidia-459
configured via GUI to the VM and TADAAAA...it works!!!
After reboot, the first GPU also could be configured via GUI again.
We tested a few thing and until now everything works correctly.
Can't believe it was the Enterprise AI drivers. This realization took us almost 5 days.