I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to the VM.
https://docs.nvidia.com/deeplearnin...I switches have ACS,IO virtualization or VT-d.
This article suggests disabling IOMMU, but I can't do that because I need to export GPUs to a VM. How are other people doing GPU Direct without redirecting all PCI point-to-point traffic to the CPU root complex?
I assume that when AWS or GCloud provides GPUs, they are doing it with VMs, so this must be possible.
https://docs.nvidia.com/deeplearnin...I switches have ACS,IO virtualization or VT-d.
This article suggests disabling IOMMU, but I can't do that because I need to export GPUs to a VM. How are other people doing GPU Direct without redirecting all PCI point-to-point traffic to the CPU root complex?
I assume that when AWS or GCloud provides GPUs, they are doing it with VMs, so this must be possible.