GPU Passthrough fails - CUDA not working on second started VM

danpc

New Member
Feb 24, 2024
3
0
1
Hello,

I am working on a server (PowerEdge XE8545) with the latest version of Proxmox. It has 4 GPUs (NVIDIA A100-SXM4-40GB).

I followed the instructions on your site, enabling IOMMU Passthrough Mode and the appropriate kernel modules.

I also did this to attach the vfio driver intead of the nouveau driver:
Code:
root@compute02:~# cat /etc/modprobe.d/vfio.conf
softdep nouveau pre: vfio-pci
options vfio-pci ids=10de:20b0

Code:
root@compute02:~# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[    1.353420] AMD-Vi: Using global IVHD EFR:0x59f77efa2094ade, EFR2:0x0
[    2.001628] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[    2.002787] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    2.003871] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[    2.005300] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    2.007914] pci 0000:e0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.009172] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.010837] pci 0000:a0:00.2: AMD-Vi: IOMMU performance counters supported
[    2.011869] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    2.013079] pci 0000:60:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013081] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013090] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013091] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013098] pci 0000:20:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013099] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013106] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013107] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013113] pci 0000:e0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013114] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013121] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013122] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013128] pci 0000:a0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013129] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013136] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[    2.013137] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    2.013143] AMD-Vi: Interrupt remapping enabled
[    2.013144] AMD-Vi: X2APIC enabled
[    2.014448] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    2.014456] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    2.014463] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    2.014470] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[    2.014478] perf/amd_iommu: Detected AMD IOMMU #4 (2 banks, 4 counters/bank).
[    2.014485] perf/amd_iommu: Detected AMD IOMMU #5 (2 banks, 4 counters/bank).
[    2.014492] perf/amd_iommu: Detected AMD IOMMU #6 (2 banks, 4 counters/bank).
[    2.014499] perf/amd_iommu: Detected AMD IOMMU #7 (2 banks, 4 counters/bank).


As for the VM, at first it was the i440fx machine using OVMF. Later, I changed to q35, OVMF, PCIe. My guest VMs run Ubuntu 20.04 and Ubuntu 22.04. The problem remains the same.

1708779723002.png
1708779702337.png

I try to assign 1 GPU to the Ubuntu 22.04 VM and 3 GPUs to the Ubuntu 20.04 VM. Whatever VM I start first, everything seems to work. nvidia-smi sees all the GPUs. When I try torch.cuda.is_available() it returns True. However, when I start the second VM, it will take about 5 minutes before it starts booting. Then, nvidia-smi will see all the GPUs but when I try torch.cuda.is_available() it will hang for about 2 minutes and then say it does not detect CUDA. If I try nvidia-smi during that hanging period it will also hang and then return ERROR instead of the values.

So CUDA works on whatever VM I start first. If I stop both and start the second one, CUDA will work on it. That's strange because even tough nvidia-smi sees the GPUs, CUDA only works on the VM I start first... Have you every experiences this issue? What can I try?

Thank you very much for any suggestion you might have!
 
Are they part of the same iommu group?

https://pve.proxmox.com/wiki/PCI_Passthrough

Check the section called "Verify IOMMU isolation".

Might need to separate them.
Thank you for your reply. They are on separated groups:
Code:
┌──────────┬────────┬──────────────┬────────────┬────────┬─────────────────────────────────────────────────────────┬──────┬───────────────
│ class    │ device │ id           │ iommugroup │ vendor │ device_name                                             │ mdev │ subsystem_devi
╞══════════╪════════╪══════════════╪════════════╪════════╪═════════════════════════════════════════════════════════╪══════╪═══════════════
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:01:00.0 │         58 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:41:00.0 │         23 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:81:00.0 │        120 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e      
├──────────┼────────┼──────────────┼────────────┼────────┼─────────────────────────────────────────────────────────┼──────┼───────────────
│ 0x030200 │ 0x20b0 │ 0000:c1:00.0 │         90 │ 0x10de │ GA100 [A100 SXM4 40GB]                                  │      │ 0x144e
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!