Hello,
I am working on a server (PowerEdge XE8545) with the latest version of Proxmox. It has 4 GPUs (NVIDIA A100-SXM4-40GB).
I followed the instructions on your site, enabling IOMMU Passthrough Mode and the appropriate kernel modules.
I also did this to attach the vfio driver intead of the nouveau driver:
As for the VM, at first it was the i440fx machine using OVMF. Later, I changed to q35, OVMF, PCIe. My guest VMs run Ubuntu 20.04 and Ubuntu 22.04. The problem remains the same.
I try to assign 1 GPU to the Ubuntu 22.04 VM and 3 GPUs to the Ubuntu 20.04 VM. Whatever VM I start first, everything seems to work. nvidia-smi sees all the GPUs. When I try torch.cuda.is_available() it returns True. However, when I start the second VM, it will take about 5 minutes before it starts booting. Then, nvidia-smi will see all the GPUs but when I try torch.cuda.is_available() it will hang for about 2 minutes and then say it does not detect CUDA. If I try nvidia-smi during that hanging period it will also hang and then return ERROR instead of the values.
So CUDA works on whatever VM I start first. If I stop both and start the second one, CUDA will work on it. That's strange because even tough nvidia-smi sees the GPUs, CUDA only works on the VM I start first... Have you every experiences this issue? What can I try?
Thank you very much for any suggestion you might have!
I am working on a server (PowerEdge XE8545) with the latest version of Proxmox. It has 4 GPUs (NVIDIA A100-SXM4-40GB).
I followed the instructions on your site, enabling IOMMU Passthrough Mode and the appropriate kernel modules.
I also did this to attach the vfio driver intead of the nouveau driver:
Code:
root@compute02:~# cat /etc/modprobe.d/vfio.conf
softdep nouveau pre: vfio-pci
options vfio-pci ids=10de:20b0
Code:
root@compute02:~# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[ 1.353420] AMD-Vi: Using global IVHD EFR:0x59f77efa2094ade, EFR2:0x0
[ 2.001628] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.002787] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.003871] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.005300] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.007914] pci 0000:e0:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.009172] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.010837] pci 0000:a0:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.011869] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[ 2.013079] pci 0000:60:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013081] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013090] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013091] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013098] pci 0000:20:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013099] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013106] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013107] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013113] pci 0000:e0:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013114] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013121] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013122] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013128] pci 0000:a0:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013129] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013136] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 2.013137] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[ 2.013143] AMD-Vi: Interrupt remapping enabled
[ 2.013144] AMD-Vi: X2APIC enabled
[ 2.014448] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 2.014456] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 2.014463] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[ 2.014470] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[ 2.014478] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[ 2.014485] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[ 2.014492] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[ 2.014499] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).
As for the VM, at first it was the i440fx machine using OVMF. Later, I changed to q35, OVMF, PCIe. My guest VMs run Ubuntu 20.04 and Ubuntu 22.04. The problem remains the same.
I try to assign 1 GPU to the Ubuntu 22.04 VM and 3 GPUs to the Ubuntu 20.04 VM. Whatever VM I start first, everything seems to work. nvidia-smi sees all the GPUs. When I try torch.cuda.is_available() it returns True. However, when I start the second VM, it will take about 5 minutes before it starts booting. Then, nvidia-smi will see all the GPUs but when I try torch.cuda.is_available() it will hang for about 2 minutes and then say it does not detect CUDA. If I try nvidia-smi during that hanging period it will also hang and then return ERROR instead of the values.
So CUDA works on whatever VM I start first. If I stop both and start the second one, CUDA will work on it. That's strange because even tough nvidia-smi sees the GPUs, CUDA only works on the VM I start first... Have you every experiences this issue? What can I try?
Thank you very much for any suggestion you might have!