Hello,
I have a Proxmox system with 11 x GTX 1080 Ti GPUs in which I am attempting to create a VM with all 11 GPUs passed through to the guest. I was excited to see that one of the listed features in Proxmox 6.1 was "PCI(e) passthrough supports up to 16 PCI(e) devices" but I have run into a strange issue where passing through 8 devices to one VM will work fine but passing through 9 devices will cause the VM to fail to boot and will only reach the UEFI shell.
My only thought at this point is that somehow adding the 9th device causes Proxmox/QEMU/KVM to map the GPU to a certain address that the SCSI or some other critical device normally uses, thus causing the 9 devices to be successfully passed through but causing something on the PCI(e) bus critical for guest booting to be overwritten by the device in the process. I don't see anything in either the dmesg or journalctl logs to indicate the issue. Does anyone have any thoughts on what might be causing this or how I can narrow down the issue?
Working VM config:
Resulting guest 'lspci' output:
Not working VM config (only difference is that I un-commented the line for the 'hostpci8' device):
Output from VNC when booting fails:
Host 'lspci' VGA excerpt (the full output would exceed the 10k character post limit) for reference:
I have a Proxmox system with 11 x GTX 1080 Ti GPUs in which I am attempting to create a VM with all 11 GPUs passed through to the guest. I was excited to see that one of the listed features in Proxmox 6.1 was "PCI(e) passthrough supports up to 16 PCI(e) devices" but I have run into a strange issue where passing through 8 devices to one VM will work fine but passing through 9 devices will cause the VM to fail to boot and will only reach the UEFI shell.
My only thought at this point is that somehow adding the 9th device causes Proxmox/QEMU/KVM to map the GPU to a certain address that the SCSI or some other critical device normally uses, thus causing the 9 devices to be successfully passed through but causing something on the PCI(e) bus critical for guest booting to be overwritten by the device in the process. I don't see anything in either the dmesg or journalctl logs to indicate the issue. Does anyone have any thoughts on what might be causing this or how I can narrow down the issue?
Working VM config:
Code:
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 12
cpu: host,hidden=1,flags=+pcid;+pdpe1gb;+aes
efidisk0: local-zfs:vm-100-disk-1,size=1M
hostpci0: 01:00.0,pcie=1
hostpci1: 02:00.0,pcie=1
hostpci2: 03:00.0,pcie=1
hostpci3: 04:00.0,pcie=1
hostpci4: 05:00.0,pcie=1
hostpci5: 82:00.0,pcie=1
hostpci6: 83:00.0,pcie=1
hostpci7: 84:00.0,pcie=1
#hostpci8: 85:00.0,pcie=1
#hostpci9: 86:00.0,pcie=1
hugepages: 1024
ide2: none,media=cdrom
machine: q35
memory: 180224
name: test1
net0: virtio=92:B8:7A:DD:99:56,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,size=800G
scsihw: virtio-scsi-pci
smbios1: uuid=93e2c3de-f892-4538-87cf-d12171088ff9
sockets: 2
vmgenid: 69b03b2c-113b-47aa-aec2-a28bc30a7ee9
Resulting guest 'lspci' output:
Code:
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:10.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0a:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
0a:12.0 Ethernet controller: Red Hat, Inc. Virtio network device
Not working VM config (only difference is that I un-commented the line for the 'hostpci8' device):
Code:
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 12
cpu: host,hidden=1,flags=+pcid;+pdpe1gb;+aes
efidisk0: local-zfs:vm-100-disk-1,size=1M
hostpci0: 01:00.0,pcie=1
hostpci1: 02:00.0,pcie=1
hostpci2: 03:00.0,pcie=1
hostpci3: 04:00.0,pcie=1
hostpci4: 05:00.0,pcie=1
hostpci5: 82:00.0,pcie=1
hostpci6: 83:00.0,pcie=1
hostpci7: 84:00.0,pcie=1
hostpci8: 85:00.0,pcie=1
#hostpci9: 86:00.0,pcie=1
hugepages: 1024
ide2: none,media=cdrom
machine: q35
memory: 180224
name: test1
net0: virtio=92:B8:7A:DD:99:56,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,size=800G
scsihw: virtio-scsi-pci
smbios1: uuid=93e2c3de-f892-4538-87cf-d12171088ff9
sockets: 2
vmgenid: 69b03b2c-113b-47aa-aec2-a28bc30a7ee9
Output from VNC when booting fails:
Host 'lspci' VGA excerpt (the full output would exceed the 10k character post limit) for reference:
Code:
# lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
81:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
84:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
85:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)