PCI Passthrough with NVIDIA DGX A100 80GB: 4 VMs, GPU only works on one

vcasadei · May 9, 2023

Hi, I'm new to the forum, but am a long term user of Proxmox and really, really need help, because I don't have any clue of what I should do to make this work.

I'll try to make it short: I followed the official documentation (https://pve.proxmox.com/wiki/PCI_Passthrough) and did all steps necessary:

I'm working on a beast of a server with:

112 x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (2 Sockets)
400GB of RAM
4 x A100 80GB (DGX card with 4 cards connected via NVLink) (the docs are here: https://docs.nvidia.com/datacenter/tesla/hgx-software-guide/index.html#abstract)

My goal is to have a Proxmox server configured with 4 VMs: each VM having its own GPU via PCI Passthrough, so I created 4 identical VMs, all running Ubuntu Server 22.04.2LTS, a good amount of RAM and Processors. Also the BIOS is OVMF (UEFI) and Machine q35. You can check each configuration on the images below:

As you can see below, the result of lspci on the host machine, I have 4 GPUs showing, they all have separate PCI slots showing.

Bash:

0000:17:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 5
    Flags: fast devsel, IRQ 18, NUMA node 0
    Memory at d4000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 24000000000 (64-bit, prefetchable) [size=128G]
    Memory at 27428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:31:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 6
    Flags: fast devsel, IRQ 18, NUMA node 0
    Memory at d8000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 28000000000 (64-bit, prefetchable) [size=128G]
    Memory at 2b428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:b1:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 3
    Flags: fast devsel, IRQ 18, NUMA node 1
    Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 3c000000000 (64-bit, prefetchable) [size=128G]
    Memory at 3f428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 4
    Flags: fast devsel, IRQ 18, NUMA node 1
    Memory at f2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 40000000000 (64-bit, prefetchable) [size=128G]
    Memory at 43428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

However, there is an issue, that I think might be the root of my problems: When I run lspci -n -s ID to get the vendor id and set it on /etc/modprobe.d/vfio.conf as shown on the documentation here: https://pve.proxmox.com/wiki/PCI_Passthrough, I get the same vendor id to all the GPUs:

Bash:

0000:17:00.0 0302: 10de:20b2 (rev a1)
0000:b1:00.0 0302: 10de:20b2 (rev a1)
0000:31:00.0 0302: 10de:20b2 (rev a1)
0000:ca:00.0 0302: 10de:20b2 (rev a1)

I configured each GPU to it's own VM, and again configured it and it's shwon on the documentation I cited above (you can check the configuration on the spoiler below:

Ok, now that I showed all the configuration, I can tell about the VM installation: I installed CUDA-Toolkit version 12.1 and NVidia Driver version 530.30.02, but I have also tried everything witj CUDA 11.8 and 11.7 and drivers 525 and 520 for example.

In all machines I get the GPU recognized on nvidia-smi:

Bash:

$ nvidia-smi
Mon May  8 23:18:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:01:00.0 Off |                    0 |
| N/A   33C    P0               73W / 500W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, this is when the problems beggin:
I use cuda-samples (https://github.com/NVIDIA/cuda-samples) to check my installation and when I run deviceQuery, on all but the first VM, I get the following error. These errors happen only on the other VMs, the first one I created works fine and I can use the GPU:

Bash:

$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

Also, deviceQueryDrv returns a similar error:

Bash:

$ ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
checkCudaErrors() Driver API error = 0003 "initialization error" from file <deviceQueryDrv.cpp>, line 54.

And of course bandwidthTest:

Bash:

$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

cudaGetDeviceProperties returned 3
-> initialization error
CUDA error at bandwidthTest.cu:256 code=3(cudaErrorInitializationError) "cudaSetDevice(currentDevice)"

Also I tried installing Pytorch and Tensorflow. The installation goes without a problem. For instance, I used Pytorch 2.0 on conda, following the directions at the official website (https://pytorch.org/get-started/locally/). However, when I try to test the installation, I get the following error as well (again, this happens only on the other three VMs, the first one works without a problem):

Bash:

~$ python
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
>>> available_gpus
[]
>>> torch.cuda.device_count()
0

Now, this is the whole story and I'm at a loss as to why it works on the first VM and not on the other three. And this is when I ask for your help.

Again, I suspect that it has something to do with the way the GPUs are from the DGX configuration and having the same vendor_id, but if this is the problem, I also don't know how to overcome it and found nothing online about this problem on a similar hardware configuration.

If anyone can help me fix this or if I somehow solve this problem myself, I will create a blogpost and documentation about it.

Please, I need some help.

noel. · May 9, 2023

Hi,

Could you send the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist "" to see if the GPUs are in separate IOMMU groups as they should be?

Also, please send the output of journalctl including the timestamps when you try running something like /deviceQuery.

vcasadei · May 9, 2023

noel. said:
Hi,

Could you send the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist "" to see if the GPUs are in separate IOMMU groups as they should be?

Also, please send the output of journalctl including the timestamps when you try running something like /deviceQuery.

I tried running the command on the host machine, but did not get a good output:

Bash:

$ pvesh get /nodes/100/hardware/pci --pci-class-blacklist ""
ipcc_send_rec[1] failed: Is a directory
ipcc_send_rec[2] failed: Is a directory
ipcc_send_rec[3] failed: Is a directory
Unable to load access control list: Is a directory

# with sudo

$ sudo pvesh get /nodes/101/hardware/pci --pci-class-blacklist ""
proxy handler failed: ssh: connect to host 0.0.0.101 port 22: Connection timed out

_gabriel · May 9, 2023

if I'm not read too fast,
nodename is not the vmid !

kingwilliam · May 10, 2023

Why you can see 4 physical card, but only first one can use?
because, you do not have enable "lane replication via PLX" function on your bios, or your mainboard does NOT have any lane replication via PLX.
if your mainboard does not have PLX function, you cannot separate 4card from 1 physical card.
I have same problem on chelsio T520 10G NIC, It have 8card inside 1physical card, but my mainboard does not have PLX function, so.....i cannot seprate 8card to VM.

if your mainboard support PLX, you can found some setting like this

BUT, I suppose you read below two link.
1. "NVIDIA vGPU on Proxmox"
https://gitlab.com/polloloco/vgpu-proxmox#nvidia-vgpu-on-proxmox

2. if you have interest on upper link, I think you have same interest on next link
https://gitea.publichub.eu/oscar.krause/fastapi-dls

noel. · May 11, 2023

vcasadei said:

I tried running the command on the host machine, but did not get a good output:

Bash:

$ pvesh get /nodes/100/hardware/pci --pci-class-blacklist ""
ipcc_send_rec[1] failed: Is a directory
ipcc_send_rec[2] failed: Is a directory
ipcc_send_rec[3] failed: Is a directory
Unable to load access control list: Is a directory

# with sudo

$ sudo pvesh get /nodes/101/hardware/pci --pci-class-blacklist ""
proxy handler failed: ssh: connect to host 0.0.0.101 port 22: Connection timed out

As @_gabriel mentioned, this is not the nodename, that is the VM-id you plugged in. You can see the names of your node(s) by running sudo pvesh get nodes.

Also, please attach some journal logs

noel. · May 11, 2023

kingwilliam said:
Why you can see 4 physical card, but only first one can use?
because, you do not have enable "lane replication via PLX" function on your bios, or your mainboard does NOT have any lane replication via PLX.
if your mainboard does not have PLX function, you cannot separate 4card from 1 physical card.
I have same problem on chelsio T520 10G NIC, It have 8card inside 1physical card, but my mainboard does not have PLX function, so.....i cannot seprate 8card to VM.

if your mainboard support PLX, you can found some setting like this
View attachment 50199

BUT, I suppose you read below two link.
1. "NVIDIA vGPU on Proxmox"
https://gitlab.com/polloloco/vgpu-proxmox#nvidia-vgpu-on-proxmox

2. if you have interest on upper link, I think you have same interest on next link
https://gitea.publichub.eu/oscar.krause/fastapi-dls

As far as I understand @vcasadei is running a HGX server which has 4 physical GPUs on it. They are not trying to split one physical GPU into 4 vGPUs. So they should not enable lane replication.

ychto · May 18, 2023

According to Nvidia's GRID documentation:
"All GPUs directly connected to each other through NVLink must be assigned to the same VM."

https://docs.nvidia.com/grid/14.0/g...ndex.html#preparing-vgpu-gpu-for-pass-through

noel. · May 19, 2023

ychto said:
According to Nvidia's GRID documentation:
"All GPUs directly connected to each other through NVLink must be assigned to the same VM."

https://docs.nvidia.com/grid/14.0/g...ndex.html#preparing-vgpu-gpu-for-pass-through

good hint, thank you.

I am guessing you will need to split your GPUs into vGPUs to pass them through individually

JasonFeng · Jan 19, 2024

I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""

dcsapak · Jan 19, 2024

JasonFeng said:
I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""

the iommugroup -1 indicates that IOMMU/VT-d/AMD-v is not enabled, did you do so in your bios and on the kernel commandline? see https://pve.proxmox.com/wiki/PCI(e)_Passthrough

leesteken · Jan 19, 2024

JasonFeng said:
I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""

What is the output of cat /proc/cmdline? Is intel_iommu=on active? What is the motherboard chipset and the CPU? Do they both support Intel VT-d and it is enabled in the motherboard BIOS?

JasonFeng · Jan 20, 2024

leesteken said:
What is the output of cat /proc/cmdline? Is intel_iommu=on active? What is the motherboard chipset and the CPU? Do they both support Intel VT-d and it is enabled in the motherboard BIOS?

output of /proc/cmdline

I'm not using passthrough on VM,I just want LXCs can use GPUs.So I didn't enable iommu on my server.I'm sure Intel VT-D is enabled in BIOS

leesteken · Jan 20, 2024

JasonFeng said:
output of /proc/cmdlineView attachment 61783
I'm not using passthrough on VM,I just want LXCs can use GPUs.So I didn't enable iommu on my server.I'm sure Intel VT-D is enabled in BIOS

intel_iommu=on is not present and therefore IOMMU is not enabled in Proxmox, but you only need that for VMs.
For containers (LXC), you do not need IOMMU, so everything look fine. I guess I just got the wrong impression of the -1 values in the screenshot you showed.
Maybe start a new thread because this one originally was about passthrough to VMs with IOMMU?

JasonFeng · Feb 5, 2024

Same thing happened again when pass through H100s GPU card into VM,here’s the output of pvesh get /nodes/node/hardware/pci command

JasonFeng · Feb 5, 2024

Error occurs when VM machine install Nvidia driver and got cannot connect to driver, but lspci | grep Nvidia could find pci device on VM

dcsapak · Feb 6, 2024

JasonFeng said:
Error occurs when VM machine install Nvidia driver and got cannot connect to driver, but lspci | grep Nvidia could find pci device on VM

whats the exact error in the vm ?

JasonFeng · Feb 6, 2024

dcsapak said:
whats the exact error in the vm ?

on windows ,host may frozen and crash to reboot when installing driver.on ubuntu 20.04/22.04 ,nvidia-smi return a can not communicate to driver log,even the latest driver have been installed

JasonFeng · Feb 6, 2024

dcsapak said:
whats the exact error in the vm ?

Feb 06 20:39:41 H100 kernel: vfio-pci 0000:98:00.0: vfio_bar_restore: reset recovery - restoring BARs
this log appear when VM freeze and host reboot,which 0000:98:00.0 is the H100 gpu's pcie bus

dcsapak · Feb 6, 2024

can you post the dmesg log from the linux vm ?

PCI Passthrough with NVIDIA DGX A100 80GB: 4 VMs, GPU only works on one

New Member

Active Member

New Member

Well-Known Member

New Member

Active Member

Active Member

New Member

Active Member

New Member

Proxmox Staff Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Attachments

Proxmox Staff Member