Hardware:
Motherboard: Asrock RomeD8-2T
CPU: EPYC 7C13 (64 Core - 128 Threads)
RAM: Crucial 512GB (8x64) LRDIMM PC4-21300 (2666MHZ) - Motherboard QVL
GPUs: 12xRTX3090 (6x Founders, 2x MSI Ventus, 2x EVGA FTW3, 2x Asus TUF)
Context/Why?: This is supposed to be a GPU server to provide GPU compute to clients.
Setup:
Followed all the info at the Proxmox PCI Passthrough Tutorial.
Issue:
Everything works fine up to 8 GPUs, so don't think that it's a setup issue. It seems like the VM simply won't boot at all once 9+ GPUs are added. Even when using xterm.js, there seems to be no kernel activity whatsoever. Basically, everything works perfectly fine from 1-8 GPUs, then any more GPUs and it completely fails. I saw a possible post in the Forum about it being a networking issue where the NIC gets a different name past a certain number of GPUs, but this does not seem to be the case as I created a service to grab the NIC name at boot, but at 9+ GPUs there's no sign of the service ever running (which makes sense since there seems to be no kernel activity in the serial terminal).
VM Settings with 8 GPUs:
With 12 GPUs:
Motherboard: Asrock RomeD8-2T
CPU: EPYC 7C13 (64 Core - 128 Threads)
RAM: Crucial 512GB (8x64) LRDIMM PC4-21300 (2666MHZ) - Motherboard QVL
GPUs: 12xRTX3090 (6x Founders, 2x MSI Ventus, 2x EVGA FTW3, 2x Asus TUF)
Context/Why?: This is supposed to be a GPU server to provide GPU compute to clients.
Setup:
Followed all the info at the Proxmox PCI Passthrough Tutorial.
Issue:
Everything works fine up to 8 GPUs, so don't think that it's a setup issue. It seems like the VM simply won't boot at all once 9+ GPUs are added. Even when using xterm.js, there seems to be no kernel activity whatsoever. Basically, everything works perfectly fine from 1-8 GPUs, then any more GPUs and it completely fails. I saw a possible post in the Forum about it being a networking issue where the NIC gets a different name past a certain number of GPUs, but this does not seem to be the case as I created a service to grab the NIC name at boot, but at 9+ GPUs there's no sign of the service ever running (which makes sense since there seems to be no kernel activity in the serial terminal).
VM Settings with 8 GPUs:
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 96
cpu: host
hostpci0: mapping=RTX3090_4_1,pcie=1
hostpci1: mapping=RTX3090_4_2,pcie=1
hostpci2: mapping=RTX3090_4_3,pcie=1
hostpci3: mapping=RTX3090_4_4,pcie=1
hostpci4: mapping=RTX3090_4_5,pcie=1
hostpci5: mapping=RTX3090_4_6,pcie=1
hostpci6: mapping=RTX3090_4_7,pcie=1
hostpci7: mapping=RTX3090_4_8,pcie=1
machine: q35
memory: 393216
meta: creation-qemu=8.1.5,ctime=1716044244
name: vastai-4
net0: virtio=bc:24:11:2e:d2:3f,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-145-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: vast-4:vm-145-disk-0,backup=0,discard=on,iothread=1,size=9665G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=b011ed9a-fee0-4ef2-b215-c8fc37034ac3
sockets: 1
vga: virtio
vmgenid: a826c2eb-1d7b-4862-b801-4ee8d436d575
With 12 GPUs:
agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 96
cpu: host
hostpci0: mapping=RTX3090_4_1,pcie=1
hostpci1: mapping=RTX3090_4_2,pcie=1
hostpci2: mapping=RTX3090_4_3,pcie=1
hostpci3: mapping=RTX3090_4_4,pcie=1
hostpci4: mapping=RTX3090_4_5,pcie=1
hostpci5: mapping=RTX3090_4_6,pcie=1
hostpci6: mapping=RTX3090_4_7,pcie=1
hostpci7: mapping=RTX3090_4_8,pcie=1
hostpci8: mapping=RTX3090_4_9,pcie=1
hostpci9: mapping=RTX3090_4_10,pcie=1
hostpci10: mapping=RTX3090_4_11,pcie=1
hostpci11: mapping=RTX3090_4_12,pcie=1
machine: q35
memory: 393216
meta: creation-qemu=8.1.5,ctime=1716044244
name: vastai-4
net0: virtio=bc:24:11:2e:d2:3f,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-145-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: vast-4:vm-145-disk-0,backup=0,discard=on,iothread=1,size=9665G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=b011ed9a-fee0-4ef2-b215-c8fc37034ac3
sockets: 1
vga: virtio
vmgenid: a826c2eb-1d7b-4862-b801-4ee8d436d575