GPU Passthrough fails - CUDA not working on second started VM

danpc

New Member
Feb 24, 2024
3
0
1
Hello,

I am working on a server (PowerEdge XE8545) with the latest version of Proxmox. It has 4 GPUs (NVIDIA A100-SXM4-40GB).

I followed the instructions on your site, enabling IOMMU Passthrough Mode and the appropriate kernel modules.

I also did this to attach the vfio driver intead of the nouveau driver:
Code:
root@compute02:~# cat /etc/modprobe.d/vfio.conf
softdep nouveau pre: vfio-pci
options vfio-pci ids=10de:20b0

Code:
root@compute02:~# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[    1.353420] AMD-Vi: Using global IVHD EFR:0x59f77efa2094ade, EFR2:0x0
[    2.001628] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    2.002787] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    2.003871] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    2.005300] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    2.007914] pci 0000:e0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.009172] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.010837] pci 0000:a0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.011869] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    2.013079] pci 0000:60:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013081] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013090] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013091] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013098] pci 0000:20:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013099] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013106] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013107] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013113] pci 0000:e0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013114] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013121] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013122] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013128] pci 0000:a0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013129] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013136] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013137] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013143] AMD-Vi: Interrupt remapping enabled
[    2.013144] AMD-Vi: X2APIC enabled
[    2.014448] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    2.014456] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    2.014463] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    2.014470] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[    2.014478] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[    2.014485] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[    2.014492] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[    2.014499] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).


As for the VM, at first it was the i440fx machine using OVMF. Later, I changed to q35, OVMF, PCIe. My guest VMs run Ubuntu 20.04 and Ubuntu 22.04. The problem remains the same.

1708779723002.png
1708779702337.png

I try to assign 1 GPU to the Ubuntu 22.04 VM and 3 GPUs to the Ubuntu 20.04 VM. Whatever VM I start first, everything seems to work. nvidia-smi sees all the GPUs. When I try torch.cuda.is_available() it returns True. However, when I start the second VM, it will take about 5 minutes before it starts booting. Then, nvidia-smi will see all the GPUs but when I try torch.cuda.is_available() it will hang for about 2 minutes and then say it does not detect CUDA. If I try nvidia-smi during that hanging period it will also hang and then return ERROR instead of the values.

So CUDA works on whatever VM I start first. If I stop both and start the second one, CUDA will work on it. That's strange because even tough nvidia-smi sees the GPUs, CUDA only works on the VM I start first... Have you every experiences this issue? What can I try?

Thank you very much for any suggestion you might have!
 
Are they part of the same iommu group?

https://pve.proxmox.com/wiki/PCI_Passthrough

Check the section called "Verify IOMMU isolation".

Might need to separate them.
Thank you for your reply. They are on separated groups:
Code:
┌──────────┬────────┬──────────────┬────────────┬────────┬─────────────────────────────────────────────────────────┬──────┬───────────────
│ class    │ device │ id           │ iommugroup │ vendor │ device_name                                             │ mdev │ subsystem_devi
╞══════════╪════════╪══════════════╪════════════╪════════╪═════════════════════════════════════════════════════════╪══════╪═══════════════
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:01:00.0 │         58 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:41:00.0 │         23 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:81:00.0 │        120 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:c1:00.0 │         90 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e