Proxmox VM q35 with 512Gb and 2x PCIe Passthrough GPU H100 - Fail um PCIe

regiswinet

New Member
May 26, 2024
3
0
1
I have a Dell PowerEdge XE8640 server with 2 TB of RAM, 4 Xeon processors with 48 cores each, and 4 Nvidia H100 GPUs. I created a VM on Proxmox using Seabios and Q35 machine type with 128 GB of RAM and 2 GPUs via PCIe Passthrough, which works normally. However, when I allocate 512 GB of RAM to the VM, it stops recognizing one of the two assigned GPUs. I tried adjusting the q35-pcihost.pci-hole64-size, which is suggested as a solution in some forums, but that didn't solve the problem. Has anyone else experienced this and found a solution
 
Is NUMA on for the VM? What CPU(s) are the GPU’s you passing through attached to (use hwloc)? Have you tried pinning the CPUs?

Do you have sufficient memory (other VMs?) on that numa node to allocate to the H100?

Also, the recommendation from NVIDIA is to use UEFI for your host and VM (not BIOS or compatibility mode) those should no longer be used for anything the past decade or so.
 
Last edited:
Is NUMA on for the VM? What CPU(s) are the GPU’s you passing through attached to (use hwloc)? Have you tried pinning the CPUs?

Do you have sufficient memory (other VMs?) on that numa node to allocate to the H100?

Also, the recommendation from NVIDIA is to use UEFI for your host and VM (not BIOS or compatibility mode) those should no longer be used for anything the past decade or so.
1o. Numa is On on VM . Numa=1
2o. CPU list with hwloc
3o. Not Pinned up - but I will try.
4o. MEmory sufficiente - 2 Tb de RAM and 512 Gb alocatted to 114 VM .
5o. Using Seafile ... But I had problems with OVMF with GPUs .
6o. VM with 128 Gb RAM works with 2x GPU H100 . Change to 512 Gb just One GPU Reconize. VM Ubuntu


root@mandu2:~# cat /etc/pve/qemu-server/114.conf
agent: 1,fstrim_cloned_disks=1
args: -global q35-pcihost.pci-hole64-size=512G
bios: seabios
boot: order=scsi0;ide2;net0
cores: 16
cpu: host
hostpci0: 0000:db:00,pcie=1
hostpci1: 0000:cb:00,pcie=1
ide2: none,media=cdrom
machine: q35,viommu=intel
memory: 524288
meta: creation-qemu=9.2.0,ctime=1747858952
name: UBUNTU-FAST-15
net0: virtio=BC:24:11:B9:F6:12,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: ZFS2:vm-114-disk-0,iothread=1,size=2T
scsihw: virtio-scsi-single
smbios1: uuid=d0eb8da3-3e05-4548-aa85-681668d30c51
sockets: 2
vmgenid: d37ca27c-6ddc-4f40-93a9-7b1e6c3be8d4
root@mandu2:~# lstopo --no-legend -p
Machine (2015GB total)
Package P#0
NUMANode P#0 (1007GB)
L3 P#0 (300MB)
L2 P#0 (2048KB) + L1d P#0 (48KB) + L1i P#0 (32KB) + Core P#0
PU P#0
PU P#112
L2 P#26 (2048KB) + L1d P#26 (48KB) + L1i P#26 (32KB) + Core P#26
PU P#2
PU P#98
L2 P#16 (2048KB) + L1d P#16 (48KB) + L1i P#16 (32KB) + Core P#16
PU P#4
PU P#100
L2 P#42 (2048KB) + L1d P#42 (48KB) + L1i P#42 (32KB) + Core P#42
PU P#6
PU P#102
L2 P#7 (2048KB) + L1d P#7 (48KB) + L1i P#7 (32KB) + Core P#7
PU P#8
PU P#104
L2 P#28 (2048KB) + L1d P#28 (48KB) + L1i P#28 (32KB) + Core P#28
PU P#10
PU P#106
L2 P#17 (2048KB) + L1d P#17 (48KB) + L1i P#17 (32KB) + Core P#17
PU P#12
PU P#108
L2 P#45 (2048KB) + L1d P#45 (48KB) + L1i P#45 (32KB) + Core P#45
PU P#14
PU P#110
L2 P#3 (2048KB) + L1d P#3 (48KB) + L1i P#3 (32KB) + Core P#3
PU P#16
PU P#96
L2 P#30 (2048KB) + L1d P#30 (48KB) + L1i P#30 (32KB) + Core P#30
PU P#18
PU P#114
L2 P#14 (2048KB) + L1d P#14 (48KB) + L1i P#14 (32KB) + Core P#14
PU P#20
PU P#116
L2 P#37 (2048KB) + L1d P#37 (48KB) + L1i P#37 (32KB) + Core P#37
PU P#22
PU P#118
L2 P#4 (2048KB) + L1d P#4 (48KB) + L1i P#4 (32KB) + Core P#4
PU P#24
PU P#120
L2 P#27 (2048KB) + L1d P#27 (48KB) + L1i P#27 (32KB) + Core P#27
PU P#26
PU P#122
L2 P#19 (2048KB) + L1d P#19 (48KB) + L1i P#19 (32KB) + Core P#19
PU P#28
PU P#124
L2 P#40 (2048KB) + L1d P#40 (48KB) + L1i P#40 (32KB) + Core P#40
PU P#30
PU P#126
L2 P#8 (2048KB) + L1d P#8 (48KB) + L1i P#8 (32KB) + Core P#8
PU P#32
PU P#128
L2 P#29 (2048KB) + L1d P#29 (48KB) + L1i P#29 (32KB) + Core P#29
PU P#34
PU P#130
L2 P#23 (2048KB) + L1d P#23 (48KB) + L1i P#23 (32KB) + Core P#23
PU P#36
PU P#132
L2 P#41 (2048KB) + L1d P#41 (48KB) + L1i P#41 (32KB) + Core P#41
PU P#38
PU P#134
L2 P#9 (2048KB) + L1d P#9 (48KB) + L1i P#9 (32KB) + Core P#9
PU P#40
PU P#136
L2 P#31 (2048KB) + L1d P#31 (48KB) + L1i P#31 (32KB) + Core P#31
PU P#42
PU P#138
L2 P#13 (2048KB) + L1d P#13 (48KB) + L1i P#13 (32KB) + Core P#13
PU P#44
PU P#140
L2 P#44 (2048KB) + L1d P#44 (48KB) + L1i P#44 (32KB) + Core P#44
PU P#46
PU P#142
L2 P#11 (2048KB) + L1d P#11 (48KB) + L1i P#11 (32KB) + Core P#11
PU P#48
PU P#144
L2 P#33 (2048KB) + L1d P#33 (48KB) + L1i P#33 (32KB) + Core P#33
PU P#50
PU P#146
L2 P#15 (2048KB) + L1d P#15 (48KB) + L1i P#15 (32KB) + Core P#15
PU P#52
PU P#148
L2 P#47 (2048KB) + L1d P#47 (48KB) + L1i P#47 (32KB) + Core P#47
PU P#54
PU P#150
L2 P#1 (2048KB) + L1d P#1 (48KB) + L1i P#1 (32KB) + Core P#1
PU P#56
PU P#152
L2 P#24 (2048KB) + L1d P#24 (48KB) + L1i P#24 (32KB) + Core P#24
PU P#58
PU P#154
L2 P#22 (2048KB) + L1d P#22 (48KB) + L1i P#22 (32KB) + Core P#22
PU P#60
PU P#156
L2 P#36 (2048KB) + L1d P#36 (48KB) + L1i P#36 (32KB) + Core P#36
PU P#62
PU P#158
L2 P#5 (2048KB) + L1d P#5 (48KB) + L1i P#5 (32KB) + Core P#5
PU P#64
PU P#160
L2 P#34 (2048KB) + L1d P#34 (48KB) + L1i P#34 (32KB) + Core P#34
PU P#66
PU P#162
L2 P#12 (2048KB) + L1d P#12 (48KB) + L1i P#12 (32KB) + Core P#12
PU P#68
PU P#164
L2 P#39 (2048KB) + L1d P#39 (48KB) + L1i P#39 (32KB) + Core P#39
PU P#70
PU P#166
L2 P#10 (2048KB) + L1d P#10 (48KB) + L1i P#10 (32KB) + Core P#10
PU P#72
PU P#168
L2 P#25 (2048KB) + L1d P#25 (48KB) + L1i P#25 (32KB) + Core P#25
PU P#74
PU P#170
L2 P#18 (2048KB) + L1d P#18 (48KB) + L1i P#18 (32KB) + Core P#18
PU P#76
PU P#172
L2 P#43 (2048KB) + L1d P#43 (48KB) + L1i P#43 (32KB) + Core P#43
PU P#78
PU P#174
L2 P#6 (2048KB) + L1d P#6 (48KB) + L1i P#6 (32KB) + Core P#6
PU P#80
PU P#176
L2 P#32 (2048KB) + L1d P#32 (48KB) + L1i P#32 (32KB) + Core P#32
PU P#82
PU P#178
L2 P#21 (2048KB) + L1d P#21 (48KB) + L1i P#21 (32KB) + Core P#21
PU P#84
PU P#180
L2 P#46 (2048KB) + L1d P#46 (48KB) + L1i P#46 (32KB) + Core P#46
PU P#86
PU P#182
L2 P#2 (2048KB) + L1d P#2 (48KB) + L1i P#2 (32KB) + Core P#2
PU P#88
PU P#184
L2 P#35 (2048KB) + L1d P#35 (48KB) + L1i P#35 (32KB) + Core P#35
PU P#90
PU P#186
L2 P#20 (2048KB) + L1d P#20 (48KB) + L1i P#20 (32KB) + Core P#20
PU P#92
PU P#188
L2 P#38 (2048KB) + L1d P#38 (48KB) + L1i P#38 (32KB) + Core P#38
PU P#94
PU P#190
HostBridge
PCIBridge
PCI 01:00.0 (NVMExp)
Block(Disk) "nvme0c0n1"
PCIBridge
PCI 02:00.0 (Ethernet)
Net "eno8303"
PCI 02:00.1 (Ethernet)
Net "eno8403"
PCIBridge
PCIBridge
PCI 04:00.0 (VGA)
PCI 00:18.0 (SATA)
PCI 00:19.0 (SATA)
HostBridge
PCIBridge
PCI 27:00.0 (Ethernet)
Net "eno12399np0"
PCI 27:00.1 (Ethernet)
Net "eno12409np1"
PCI 27:00.2 (Ethernet)
Net "eno12419np2"
PCI 27:00.3 (Ethernet)
Net "eno12429np3"
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 4e:00.0 (3D)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 5f:00.0 (3D)
PCIBridge
PCI 60:00.0 (SAS)
HostBridge
PCI 70:00.0 (Co-Processor)
HostBridge
PCI 72:00.0 (Co-Processor)
Package P#1
NUMANode P#1 (1008GB)
L3 P#1 (300MB)
L2 P#67 (2048KB) + L1d P#67 (48KB) + L1i P#67 (32KB) + Core P#3
PU P#1
PU P#97
L2 P#91 (2048KB) + L1d P#91 (48KB) + L1i P#91 (32KB) + Core P#27
PU P#3
PU P#99
L2 P#80 (2048KB) + L1d P#80 (48KB) + L1i P#80 (32KB) + Core P#16
PU P#5
PU P#101
L2 P#107 (2048KB) + L1d P#107 (48KB) + L1i P#107 (32KB) + Core P#43
PU P#7
PU P#103
L2 P#71 (2048KB) + L1d P#71 (48KB) + L1i P#71 (32KB) + Core P#7
PU P#9
PU P#105
L2 P#93 (2048KB) + L1d P#93 (48KB) + L1i P#93 (32KB) + Core P#29
PU P#11
PU P#107
L2 P#81 (2048KB) + L1d P#81 (48KB) + L1i P#81 (32KB) + Core P#17
PU P#13
PU P#109
L2 P#103 (2048KB) + L1d P#103 (48KB) + L1i P#103 (32KB) + Core P#39
PU P#15
PU P#111
L2 P#64 (2048KB) + L1d P#64 (48KB) + L1i P#64 (32KB) + Core P#0
PU P#17
PU P#113
L2 P#88 (2048KB) + L1d P#88 (48KB) + L1i P#88 (32KB) + Core P#24
PU P#19
PU P#115
L2 P#79 (2048KB) + L1d P#79 (48KB) + L1i P#79 (32KB) + Core P#15
PU P#21
PU P#117
L2 P#105 (2048KB) + L1d P#105 (48KB) + L1i P#105 (32KB) + Core P#41
PU P#23
PU P#119
L2 P#68 (2048KB) + L1d P#68 (48KB) + L1i P#68 (32KB) + Core P#4
PU P#25
PU P#121
L2 P#89 (2048KB) + L1d P#89 (48KB) + L1i P#89 (32KB) + Core P#25
PU P#27
PU P#123
L2 P#84 (2048KB) + L1d P#84 (48KB) + L1i P#84 (32KB) + Core P#20
PU P#29
PU P#125
L2 P#106 (2048KB) + L1d P#106 (48KB) + L1i P#106 (32KB) + Core P#42
PU P#31
PU P#127
L2 P#72 (2048KB) + L1d P#72 (48KB) + L1i P#72 (32KB) + Core P#8
PU P#33
PU P#129
L2 P#92 (2048KB) + L1d P#92 (48KB) + L1i P#92 (32KB) + Core P#28
PU P#35
PU P#131
L2 P#87 (2048KB) + L1d P#87 (48KB) + L1i P#87 (32KB) + Core P#23
PU P#37
PU P#133
L2 P#109 (2048KB) + L1d P#109 (48KB) + L1i P#109 (32KB) + Core P#45
PU P#39
PU P#135
L2 P#69 (2048KB) + L1d P#69 (48KB) + L1i P#69 (32KB) + Core P#5
PU P#41
PU P#137
L2 P#94 (2048KB) + L1d P#94 (48KB) + L1i P#94 (32KB) + Core P#30
PU P#43
PU P#139
L2 P#77 (2048KB) + L1d P#77 (48KB) + L1i P#77 (32KB) + Core P#13
PU P#45
PU P#141
L2 P#102 (2048KB) + L1d P#102 (48KB) + L1i P#102 (32KB) + Core P#38
PU P#47
PU P#143
L2 P#73 (2048KB) + L1d P#73 (48KB) + L1i P#73 (32KB) + Core P#9
PU P#49
PU P#145
L2 P#97 (2048KB) + L1d P#97 (48KB) + L1i P#97 (32KB) + Core P#33
PU P#51
PU P#147
L2 P#78 (2048KB) + L1d P#78 (48KB) + L1i P#78 (32KB) + Core P#14
PU P#53
PU P#149
L2 P#104 (2048KB) + L1d P#104 (48KB) + L1i P#104 (32KB) + Core P#40
PU P#55
PU P#151
L2 P#74 (2048KB) + L1d P#74 (48KB) + L1i P#74 (32KB) + Core P#10
PU P#57
PU P#153
L2 P#90 (2048KB) + L1d P#90 (48KB) + L1i P#90 (32KB) + Core P#26
PU P#59
PU P#155
L2 P#83 (2048KB) + L1d P#83 (48KB) + L1i P#83 (32KB) + Core P#19
PU P#61
PU P#157
L2 P#108 (2048KB) + L1d P#108 (48KB) + L1i P#108 (32KB) + Core P#44
PU P#63
PU P#159
L2 P#65 (2048KB) + L1d P#65 (48KB) + L1i P#65 (32KB) + Core P#1
PU P#65
PU P#161
L2 P#95 (2048KB) + L1d P#95 (48KB) + L1i P#95 (32KB) + Core P#31
PU P#67
PU P#163
L2 P#86 (2048KB) + L1d P#86 (48KB) + L1i P#86 (32KB) + Core P#22
PU P#69
PU P#165
L2 P#111 (2048KB) + L1d P#111 (48KB) + L1i P#111 (32KB) + Core P#47
PU P#71
PU P#167
L2 P#70 (2048KB) + L1d P#70 (48KB) + L1i P#70 (32KB) + Core P#6
PU P#73
PU P#169
L2 P#98 (2048KB) + L1d P#98 (48KB) + L1i P#98 (32KB) + Core P#34
PU P#75
PU P#171
L2 P#76 (2048KB) + L1d P#76 (48KB) + L1i P#76 (32KB) + Core P#12
PU P#77
PU P#173
L2 P#101 (2048KB) + L1d P#101 (48KB) + L1i P#101 (32KB) + Core P#37
PU P#79
PU P#175
L2 P#75 (2048KB) + L1d P#75 (48KB) + L1i P#75 (32KB) + Core P#11
PU P#81
PU P#177
L2 P#96 (2048KB) + L1d P#96 (48KB) + L1i P#96 (32KB) + Core P#32
PU P#83
PU P#179
L2 P#82 (2048KB) + L1d P#82 (48KB) + L1i P#82 (32KB) + Core P#18
PU P#85
PU P#181
L2 P#110 (2048KB) + L1d P#110 (48KB) + L1i P#110 (32KB) + Core P#46
PU P#87
PU P#183
L2 P#66 (2048KB) + L1d P#66 (48KB) + L1i P#66 (32KB) + Core P#2
PU P#89
PU P#185
L2 P#99 (2048KB) + L1d P#99 (48KB) + L1i P#99 (32KB) + Core P#35
PU P#91
PU P#187
L2 P#85 (2048KB) + L1d P#85 (48KB) + L1i P#85 (32KB) + Core P#21
PU P#93
PU P#189
L2 P#100 (2048KB) + L1d P#100 (48KB) + L1i P#100 (32KB) + Core P#36
PU P#95
PU P#191
HostBridge
PCIBridge
PCI 98:00.0 (RAID)
Block(Disk) "sda"
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI cb:00.0 (3D)
PCIBridge
PCI ce:00.0 (SAS)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI db:00.0 (3D)
HostBridge
PCI ed:00.0 (Co-Processor)
HostBridge
PCI ef:00.0 (Co-Processor)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
root@mandu2:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190
node 0 size: 1031569 MB
node 0 free: 355268 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 171 173 175 177 179 181 183 185 187 189 191
node 1 size: 1032151 MB
node 1 free: 902343 MB
node distances:
node 0 1
0: 10 21
1: 21 10
root@mandu2:~#
root@mandu2:~# lspci -vvv | grep NVIDIA
4e:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
Subsystem: NVIDIA Corporation GH100 [H100 SXM5 80GB]
5f:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
Subsystem: NVIDIA Corporation GH100 [H100 SXM5 80GB]
cb:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
Subsystem: NVIDIA Corporation GH100 [H100 SXM5 80GB]
db:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
Subsystem: NVIDIA Corporation GH100 [H100 SXM5 80GB]
root@mandu2:~#
 
The question is not whether you have enough memory globally, the question is whether you have sufficient memory on the node your VM + GPU are running.

So for example in your output you have node 0 free: 355268 MB - if you are trying to launch a VM that connects to the GPU on Node 0 then you need 512GB available in that NUMA node.

You also seem to only have 2 CPUs, not 4.

For all intents and purposes, when dealing with physical hardware, you should treat each NUMA node as a separate "computer".
 
Sorry! If I put 128 Gb, the two H1000 GPUs load the nvidia drivers in Ubuntu. I changed to 512 Gb in VM 114 and really didn't check the RAM limits in NUMA0. I also hadn't pinned the CPUs to Numa. I'm going to reduce it to 256 Gb of use in Numa0 and allocate the CPUs. The physical server has 4 Sockets - Dell PowerEdge XE8640. The VM has only 2 sockets.
 
The XE8640 has Two 4th Generation Intel<span>®</span> Xeon<span>®</span> Scalable processor with up to 56 cores per processor. Your NUMA output shows 2 CPUs, that's what made me confused.