PCI Passthrough with NVIDIA DGX A100 80GB: 4 VMs, GPU only works on one

vcasadei

New Member
May 9, 2023
2
0
1
Hi, I'm new to the forum, but am a long term user of Proxmox and really, really need help, because I don't have any clue of what I should do to make this work.

I'll try to make it short: I followed the official documentation (https://pve.proxmox.com/wiki/PCI_Passthrough) and did all steps necessary:

  1. https://pve.proxmox.com/wiki/PCI_Passthrough#Enable_the_IOMMU
  2. https://pve.proxmox.com/wiki/PCI_Passthrough#Required_Modules
  3. https://pve.proxmox.com/wiki/PCI_Passthrough#IOMMU_Interrupt_Remapping
  4. https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_Isolation
  5. etc ...
I'm working on a beast of a server with:
My goal is to have a Proxmox server configured with 4 VMs: each VM having its own GPU via PCI Passthrough, so I created 4 identical VMs, all running Ubuntu Server 22.04.2LTS, a good amount of RAM and Processors. Also the BIOS is OVMF (UEFI) and Machine q35. You can check each configuration on the images below:

vm1.pngvm2.pngvm3.pngvm4.png

As you can see below, the result of lspci on the host machine, I have 4 GPUs showing, they all have separate PCI slots showing.

Bash:
0000:17:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 5
    Flags: fast devsel, IRQ 18, NUMA node 0
    Memory at d4000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 24000000000 (64-bit, prefetchable) [size=128G]
    Memory at 27428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:31:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 6
    Flags: fast devsel, IRQ 18, NUMA node 0
    Memory at d8000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 28000000000 (64-bit, prefetchable) [size=128G]
    Memory at 2b428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:b1:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 3
    Flags: fast devsel, IRQ 18, NUMA node 1
    Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 3c000000000 (64-bit, prefetchable) [size=128G]
    Memory at 3f428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

0000:ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
    Subsystem: NVIDIA Corporation Device 147f
    Physical Slot: 4
    Flags: fast devsel, IRQ 18, NUMA node 1
    Memory at f2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 40000000000 (64-bit, prefetchable) [size=128G]
    Memory at 43428000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] Null
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Capabilities: [bb0] Physical Resizable BAR
    Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00] Lane Margining at the Receiver <?>
    Capabilities: [e00] Data Link Feature <?>
    Kernel modules: nvidiafb, nouveau

However, there is an issue, that I think might be the root of my problems: When I run lspci -n -s ID to get the vendor id and set it on /etc/modprobe.d/vfio.conf as shown on the documentation here: https://pve.proxmox.com/wiki/PCI_Passthrough, I get the same vendor id to all the GPUs:

Bash:
0000:17:00.0 0302: 10de:20b2 (rev a1)
0000:b1:00.0 0302: 10de:20b2 (rev a1)
0000:31:00.0 0302: 10de:20b2 (rev a1)
0000:ca:00.0 0302: 10de:20b2 (rev a1)

I configured each GPU to it's own VM, and again configured it and it's shwon on the documentation I cited above (you can check the configuration on the spoiler below:

opera_2zkaoMlOg3.pngopera_cZWcJVe1QN.pngopera_FVRACnkVej.pngopera_Zn6PVbBZ4D.png

Ok, now that I showed all the configuration, I can tell about the VM installation: I installed CUDA-Toolkit version 12.1 and NVidia Driver version 530.30.02, but I have also tried everything witj CUDA 11.8 and 11.7 and drivers 525 and 520 for example.

In all machines I get the GPU recognized on nvidia-smi:

Bash:
$ nvidia-smi
Mon May  8 23:18:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:01:00.0 Off |                    0 |
| N/A   33C    P0               73W / 500W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, this is when the problems beggin:
I use cuda-samples (https://github.com/NVIDIA/cuda-samples) to check my installation and when I run deviceQuery, on all but the first VM, I get the following error. These errors happen only on the other VMs, the first one I created works fine and I can use the GPU:

Bash:
$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

Also, deviceQueryDrv returns a similar error:
Bash:
$ ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
checkCudaErrors() Driver API error = 0003 "initialization error" from file <deviceQueryDrv.cpp>, line 54.

And of course bandwidthTest:
Bash:
$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

cudaGetDeviceProperties returned 3
-> initialization error
CUDA error at bandwidthTest.cu:256 code=3(cudaErrorInitializationError) "cudaSetDevice(currentDevice)"

Also I tried installing Pytorch and Tensorflow. The installation goes without a problem. For instance, I used Pytorch 2.0 on conda, following the directions at the official website (https://pytorch.org/get-started/locally/). However, when I try to test the installation, I get the following error as well (again, this happens only on the other three VMs, the first one works without a problem):

Bash:
~$ python
Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
>>> available_gpus
[]
>>> torch.cuda.device_count()
0

Now, this is the whole story and I'm at a loss as to why it works on the first VM and not on the other three. And this is when I ask for your help.

Again, I suspect that it has something to do with the way the GPUs are from the DGX configuration and having the same vendor_id, but if this is the problem, I also don't know how to overcome it and found nothing online about this problem on a similar hardware configuration.

If anyone can help me fix this or if I somehow solve this problem myself, I will create a blogpost and documentation about it.

Please, I need some help.
 
Hi,

Could you send the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist "" to see if the GPUs are in separate IOMMU groups as they should be?

Also, please send the output of journalctl including the timestamps when you try running something like /deviceQuery.
 
  • Like
Reactions: leesteken
Hi,

Could you send the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist "" to see if the GPUs are in separate IOMMU groups as they should be?

Also, please send the output of journalctl including the timestamps when you try running something like /deviceQuery.
I tried running the command on the host machine, but did not get a good output:

Bash:
$ pvesh get /nodes/100/hardware/pci --pci-class-blacklist ""
ipcc_send_rec[1] failed: Is a directory
ipcc_send_rec[2] failed: Is a directory
ipcc_send_rec[3] failed: Is a directory
Unable to load access control list: Is a directory

# with sudo

$ sudo pvesh get /nodes/101/hardware/pci --pci-class-blacklist ""
proxy handler failed: ssh: connect to host 0.0.0.101 port 22: Connection timed out
 
Why you can see 4 physical card, but only first one can use?
because, you do not have enable "lane replication via PLX" function on your bios, or your mainboard does NOT have any lane replication via PLX.
if your mainboard does not have PLX function, you cannot separate 4card from 1 physical card.
I have same problem on chelsio T520 10G NIC, It have 8card inside 1physical card, but my mainboard does not have PLX function, so.....i cannot seprate 8card to VM.

if your mainboard support PLX, you can found some setting like this
4x4.jpg

BUT, I suppose you read below two link.
1. "NVIDIA vGPU on Proxmox"
https://gitlab.com/polloloco/vgpu-proxmox#nvidia-vgpu-on-proxmox

2. if you have interest on upper link, I think you have same interest on next link
https://gitea.publichub.eu/oscar.krause/fastapi-dls
 
Last edited:
I tried running the command on the host machine, but did not get a good output:

Bash:
$ pvesh get /nodes/100/hardware/pci --pci-class-blacklist ""
ipcc_send_rec[1] failed: Is a directory
ipcc_send_rec[2] failed: Is a directory
ipcc_send_rec[3] failed: Is a directory
Unable to load access control list: Is a directory

# with sudo

$ sudo pvesh get /nodes/101/hardware/pci --pci-class-blacklist ""
proxy handler failed: ssh: connect to host 0.0.0.101 port 22: Connection timed out
As @_gabriel mentioned, this is not the nodename, that is the VM-id you plugged in. You can see the names of your node(s) by running sudo pvesh get nodes.

Also, please attach some journal logs
 
Why you can see 4 physical card, but only first one can use?
because, you do not have enable "lane replication via PLX" function on your bios, or your mainboard does NOT have any lane replication via PLX.
if your mainboard does not have PLX function, you cannot separate 4card from 1 physical card.
I have same problem on chelsio T520 10G NIC, It have 8card inside 1physical card, but my mainboard does not have PLX function, so.....i cannot seprate 8card to VM.

if your mainboard support PLX, you can found some setting like this
View attachment 50199

BUT, I suppose you read below two link.
1. "NVIDIA vGPU on Proxmox"
https://gitlab.com/polloloco/vgpu-proxmox#nvidia-vgpu-on-proxmox

2. if you have interest on upper link, I think you have same interest on next link
https://gitea.publichub.eu/oscar.krause/fastapi-dls
As far as I understand @vcasadei is running a HGX server which has 4 physical GPUs on it. They are not trying to split one physical GPU into 4 vGPUs. So they should not enable lane replication.
 
I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist "" 1705662763694.png
 
I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
the iommugroup -1 indicates that IOMMU/VT-d/AMD-v is not enabled, did you do so in your bios and on the kernel commandline? see https://pve.proxmox.com/wiki/PCI(e)_Passthrough
 
I got the same problem now ,on a supermicro server with 4 A100s .Also did the follow neccesary steps,here's the output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
What is the output of cat /proc/cmdline? Is intel_iommu=on active? What is the motherboard chipset and the CPU? Do they both support Intel VT-d and it is enabled in the motherboard BIOS?
 
What is the output of cat /proc/cmdline? Is intel_iommu=on active? What is the motherboard chipset and the CPU? Do they both support Intel VT-d and it is enabled in the motherboard BIOS?
output of /proc/cmdline1705729712019.png
I'm not using passthrough on VM,I just want LXCs can use GPUs.So I didn't enable iommu on my server.I'm sure Intel VT-D is enabled in BIOS
 
output of /proc/cmdlineView attachment 61783
I'm not using passthrough on VM,I just want LXCs can use GPUs.So I didn't enable iommu on my server.I'm sure Intel VT-D is enabled in BIOS
intel_iommu=on is not present and therefore IOMMU is not enabled in Proxmox, but you only need that for VMs.
For containers (LXC), you do not need IOMMU, so everything look fine. I guess I just got the wrong impression of the -1 values in the screenshot you showed.
Maybe start a new thread because this one originally was about passthrough to VMs with IOMMU?
 
Same thing happened again when pass through H100s GPU card into VM,here’s the output of pvesh get /nodes/node/hardware/pci command IMG_0557.jpeg
 
Error occurs when VM machine install Nvidia driver and got cannot connect to driver, but lspci | grep Nvidia could find pci device on VM
 
Error occurs when VM machine install Nvidia driver and got cannot connect to driver, but lspci | grep Nvidia could find pci device on VM
whats the exact error in the vm ?
 
whats the exact error in the vm ?
on windows ,host may frozen and crash to reboot when installing driver.on ubuntu 20.04/22.04 ,nvidia-smi return a can not communicate to driver log,even the latest driver have been installed
 
whats the exact error in the vm ?
1707223418035.pngFeb 06 20:39:41 H100 kernel: vfio-pci 0000:98:00.0: vfio_bar_restore: reset recovery - restoring BARs
this log appear when VM freeze and host reboot,which 0000:98:00.0 is the H100 gpu's pcie bus
 

Attachments

  • 1707223425882.png
    1707223425882.png
    81.3 KB · Views: 3
can you post the dmesg log from the linux vm ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!