IOMMU 4 NVIDIA GPUs with NCCL

Nathan Stratton

Well-Known Member
Dec 28, 2018
50
3
48
49
I have a VM with four exported 3090 GPUs. The GPUs work and I can run things like gpuburn, but when I try to train my models with NCCL I run into errors. I don't have a ACS option in bios (I believe its off now so no option) Supermicro H12SSL, but I do have IOMMU on so I can export the cards to the VM.

https://docs.nvidia.com/deeplearnin...I switches have ACS,IO virtualization or VT-d.

This article suggests disabling IOMMU, but I can't do that because I need to export GPUs to a VM. How are other people doing GPU Direct without redirecting all PCI point-to-point traffic to the CPU root complex?

I assume that when AWS or GCloud provides GPUs, they are doing it with VMs, so this must be possible.