nVidia A100-80G / Q35 / PCIE=YES query :)

Manxmann

Renowned Member
Aug 28, 2014
7
0
66
Hey Folks,

Quick question, we have a Dell 750xa configured with 4x A100-80G cards, IOMMU etc is configured and everything appears to be working, yay!

So whats my problem?

If I configure a VM Q35 latest version with 2x GPU's, with ROMBAR and PCIE enabled on the config for both GPU's. The VM will boot, eventually any word on superpages?, lspci shows the 2 nVidia 3D accelerators however nVidia-smi only shows 1. Checking nvidia-persistanced journal shows erros connecting to the 1st card but then binds happily to the 2nd.

Now if I un-check the PCIE option in both cards everything in the VM works perfectly, both cards are seen and running a quick test like gpu-burn uses all cards and reports everything as OK.

Using an older HP DL380G8 with V100-16G cards and setting PCIE on or off makes no difference and all cards work.

So my question is what is it what PCIE is setting that breaks the guest driver for the A100's and what impact not setting it have?

Cheers

Conext info:

root@imgpu3:~# cat /etc/pve/qemu-server/2000.conf
agent: 1
balloon: 0
boot: order=scsi0;ide2;net0
cores: 4
cpu: host,flags=+pdpe1gb
hostpci0: mapping=A100_GPU1,pcie=1
hostpci1: mapping=A100_GPU4,pcie=1
hostpci2: mapping=A100_GPU2,pcie=1
hostpci3: mapping=A100_GPU3,pcie=1
hugepages: 1024
ide2: local:iso/debian-12.8.0-amd64-netinst.iso,media=cdrom,size=631M
machine: q35,viommu=intel
memory: 81920
meta: creation-qemu=9.0.2,ctime=1731958536
name: CUSTB-GPU-ENABLED
net0: virtio=BC:24:11:68:4A:F6,bridge=vmbr1,tag=30
numa: 1
ostype: l26
scsi0: DATA:vm-2000-disk-0,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=09b64a6f-be5a-4792-97f2-53cc3a4405ad
sockets: 2
vmgenid: cb8fb0cb-b9a5-461e-84da-400850b2c5b8

root@imgpu3:~# cat /proc/cmdline
initrd=\EFI\proxmox\6.8.12-4-pve\initrd.img-6.8.12-4-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt hugepagesz=1G hugepages=0:200,1:200 default_hugepagesz=1G

root@imgpu3:~# pvesh get /nodes/imgpu3/hardware/pci --pci-class-blacklist "" | grep A100

│ 0x030200 │ 0x20b5 │ 0000:17:00.0 │ 12 │ 0x10de │ GA100 [A100 PCIe 80GB] │ │ 0x1533 │ │ 0x10de │ NVIDIA Corporation │ NVIDIA Corporation
│ 0x030200 │ 0x20b5 │ 0000:65:00.0 │ 1 │ 0x10de │ GA100 [A100 PCIe 80GB] │ │ 0x153 │ │ 0x10de │ NVIDIA Corporation │ NVIDIA Corporation
│ 0x030200 │ 0x20b5 │ 0000:ca:00.0 │ 16 │ 0x10de │ GA100 [A100 PCIe 80GB] │ │ 0x1533 │ │ 0x10de │ NVIDIA Corporation │ NVIDIA Corporation
│ 0x030200 │ 0x20b5 │ 0000:e3:00.0 │ 14 │ 0x10de │ GA100 [A100 PCIe 80GB] │ │ 0x1533 │ │ 0x10de │ NVIDIA Corporation │ NVIDIA Corporation


From the guest VM:

lspci
...
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
02:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
...

root@custagpu:~# nvidia-smi


Thu Nov 21 10:17:30 2024

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:02:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+


root@custagpu:~/gpu-burn# ./gpu_burn -l
ID 0: NVIDIA A100 80GB PCIe, 85097MB



P.P.S Any tips on shortening the ROMBAR allocation time, each A100 adds 4 minutes to my VM boot times!!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!