9+ GPUs - VM Fails to Boot

Lukium

New Member
Nov 22, 2023
8
0
1
Hardware:

Motherboard: Asrock RomeD8-2T
CPU: EPYC 7C13 (64 Core - 128 Threads)
RAM: Crucial 512GB (8x64) LRDIMM PC4-21300 (2666MHZ) - Motherboard QVL
GPUs: 12xRTX3090 (6x Founders, 2x MSI Ventus, 2x EVGA FTW3, 2x Asus TUF)

Context/Why?: This is supposed to be a GPU server to provide GPU compute to clients.

Setup:
Followed all the info at the Proxmox PCI Passthrough Tutorial.

Issue:
Everything works fine up to 8 GPUs, so don't think that it's a setup issue. It seems like the VM simply won't boot at all once 9+ GPUs are added. Even when using xterm.js, there seems to be no kernel activity whatsoever. Basically, everything works perfectly fine from 1-8 GPUs, then any more GPUs and it completely fails. I saw a possible post in the Forum about it being a networking issue where the NIC gets a different name past a certain number of GPUs, but this does not seem to be the case as I created a service to grab the NIC name at boot, but at 9+ GPUs there's no sign of the service ever running (which makes sense since there seems to be no kernel activity in the serial terminal).

VM Settings with 8 GPUs:
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 96
cpu: host
hostpci0: mapping=RTX3090_4_1,pcie=1
hostpci1: mapping=RTX3090_4_2,pcie=1
hostpci2: mapping=RTX3090_4_3,pcie=1
hostpci3: mapping=RTX3090_4_4,pcie=1
hostpci4: mapping=RTX3090_4_5,pcie=1
hostpci5: mapping=RTX3090_4_6,pcie=1
hostpci6: mapping=RTX3090_4_7,pcie=1
hostpci7: mapping=RTX3090_4_8,pcie=1
machine: q35
memory: 393216
meta: creation-qemu=8.1.5,ctime=1716044244
name: vastai-4
net0: virtio=bc:24:11:2e:d2:3f,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-145-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: vast-4:vm-145-disk-0,backup=0,discard=on,iothread=1,size=9665G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=b011ed9a-fee0-4ef2-b215-c8fc37034ac3
sockets: 1
vga: virtio
vmgenid: a826c2eb-1d7b-4862-b801-4ee8d436d575

With 12 GPUs:
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 96
cpu: host
hostpci0: mapping=RTX3090_4_1,pcie=1
hostpci1: mapping=RTX3090_4_2,pcie=1
hostpci2: mapping=RTX3090_4_3,pcie=1
hostpci3: mapping=RTX3090_4_4,pcie=1
hostpci4: mapping=RTX3090_4_5,pcie=1
hostpci5: mapping=RTX3090_4_6,pcie=1
hostpci6: mapping=RTX3090_4_7,pcie=1
hostpci7: mapping=RTX3090_4_8,pcie=1
hostpci8: mapping=RTX3090_4_9,pcie=1
hostpci9: mapping=RTX3090_4_10,pcie=1
hostpci10: mapping=RTX3090_4_11,pcie=1
hostpci11: mapping=RTX3090_4_12,pcie=1
machine: q35
memory: 393216
meta: creation-qemu=8.1.5,ctime=1716044244
name: vastai-4
net0: virtio=bc:24:11:2e:d2:3f,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-145-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: vast-4:vm-145-disk-0,backup=0,discard=on,iothread=1,size=9665G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=b011ed9a-fee0-4ef2-b215-c8fc37034ac3
sockets: 1
vga: virtio
vmgenid: a826c2eb-1d7b-4862-b801-4ee8d436d575
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!