Help: Multiple GPU dispatch multiple Ubuntu VM

moyy

New Member
Apr 21, 2024
4
0
1
+ SuperMicro 7048gr-tr
+ GPU: 4 * 2080Ti
+ VM OS: Ubuntu 22.04

I need gpu for run deep learning model so I nedn't nvidia for vga.

When I only use one VM, I can start the virtual machine and enter the desktop whether I assign no graphics card or assign 1 to 1 graphics card;

But when I open the allocation of two VMs at the same time and have different graphics cards, the second vm starts and the entire pve system is not projected.

===================================================

1714807625867.png


1714807700844.png
 
Last edited:
root@pve:~# for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

IOMMU group 119 83:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
IOMMU group 120 83:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
IOMMU group 121 83:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
IOMMU group 122 83:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)

IOMMU group 123 84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
IOMMU group 124 84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
IOMMU group 125 84:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
IOMMU group 126 84:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)

IOMMU group 91 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
IOMMU group 92 02:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
IOMMU group 93 02:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
IOMMU group 94 02:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)

IOMMU group 95 03:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
IOMMU group 96 03:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
IOMMU group 97 03:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
IOMMU group 98 03:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)

IOMMU group 99 06:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
 
root@pve:~# lspci -nnv | grep VGA -A 22

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) (prog-if 00 [VGA controller])
Subsystem: LeadTek Research Inc. TU102 [GeForce RTX 2080 Ti Rev. A] [107d:1e07]
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, IRQ 11, NUMA node 0, IOMMU group 91
Memory at ef000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 6000 [ size 128 ]
Expansion ROM at 000c0000 [disabled] [size 128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

03:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) (prog-if 00 [VGA controller])
Subsystem: LeadTek Research Inc. TU102 [GeForce RTX 2080 Ti Rev. A] [107d:1e07]
Physical Slot: 4
Flags: fast devsel, IRQ 11, NUMA node 0, IOMMU group 95
Memory at ed000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 383fe0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at 383ff0000000 (64-bit, prefetchable) [disabled] [size=32M]
I/O ports at 5000 [disabled] [size 128]
Expansion ROM at ee000000 [disabled] [size 512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

83:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) (prog-if 00 [VGA controller])
Subsystem: LeadTek Research Inc. TU102 [GeForce RTX 2080 Ti Rev. A] [107d:1e07]
Physical Slot: 8
Flags: fast devsel, IRQ 11, NUMA node 1, IOMMU group 119
Memory at fa000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 387fe0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at 387ff0000000 (64-bit, prefetchable) [disabled] [size=32M]
I/O ports at e000 [disabled] [size 128]
Expansion ROM at fb000000 [disabled] [size 512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) (prog-if 00 [VGA controller])
Subsystem: LeadTek Research Inc. TU102 [GeForce RTX 2080 Ti Rev. A] [107d:1e07]
Physical Slot: 6
Flags: fast devsel, IRQ 11, NUMA node 1, IOMMU group 123
Memory at f8000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at 387fc0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at 387fd0000000 (64-bit, prefetchable) [disabled] [size=32M]
I/O ports at d000 [disabled] [size 128]
Expansion ROM at f9000000 [disabled] [size 512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

06:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30) (prog-if 00 [VGA controller])
DeviceName: ASPEED Video AST2400
Subsystem: Super Micro Computer Inc ASPEED Graphics Family [15d9:0852]
Flags: medium devsel, IRQ 11, NUMA node 0, IOMMU group 99
Memory at eb000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at ec000000 (32-bit, non-prefetchable) [disabled] [size=128K]
I/O ports at 4000 [disabled] [size 128]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+
Kernel modules: ast
 
Last edited:
The reason has been found, and it has nothing to do with VGA or GPU Vendor ID.

This is because: the total memory of all VM with GPU-passthrough cannot exceed the actual physical memory.
For example, my physical memory is 128GB.

2 VMs with GPU were allocated 100GB of memory, so when the second VM was started, the entire PVE crashed and restarted.

But if 2 VMs with GPU are allocated 32GB of memory, both can run.

BTW: The sum of the memory allocated to other VMs without a GPU can exceed the actual physical memory.
 
This is because: the total memory of all VM with GPU-passthrough cannot exceed the actual physical memory.

BTW: The sum of the memory allocated to other VMs without a GPU can exceed the actual physical memory.
This is to be expected. When using PCI(e) passthrough, all VM memory must be pinned into actual host RAM because of direct memory access (DMA) of the PCI(e) devices (to any part of the VM memory at any time).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!