VM Failure to boot with more than 8 PCI Express Passthrough GPUs, Only UEFI Shell

tinfever

Member
Jun 30, 2019
16
2
8
33
Hello,

I have a Proxmox system with 11 x GTX 1080 Ti GPUs in which I am attempting to create a VM with all 11 GPUs passed through to the guest. I was excited to see that one of the listed features in Proxmox 6.1 was "PCI(e) passthrough supports up to 16 PCI(e) devices" but I have run into a strange issue where passing through 8 devices to one VM will work fine but passing through 9 devices will cause the VM to fail to boot and will only reach the UEFI shell.

My only thought at this point is that somehow adding the 9th device causes Proxmox/QEMU/KVM to map the GPU to a certain address that the SCSI or some other critical device normally uses, thus causing the 9 devices to be successfully passed through but causing something on the PCI(e) bus critical for guest booting to be overwritten by the device in the process. I don't see anything in either the dmesg or journalctl logs to indicate the issue. Does anyone have any thoughts on what might be causing this or how I can narrow down the issue?

Working VM config:

Code:
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 12
cpu: host,hidden=1,flags=+pcid;+pdpe1gb;+aes
efidisk0: local-zfs:vm-100-disk-1,size=1M
hostpci0: 01:00.0,pcie=1
hostpci1: 02:00.0,pcie=1
hostpci2: 03:00.0,pcie=1
hostpci3: 04:00.0,pcie=1
hostpci4: 05:00.0,pcie=1
hostpci5: 82:00.0,pcie=1
hostpci6: 83:00.0,pcie=1
hostpci7: 84:00.0,pcie=1
#hostpci8: 85:00.0,pcie=1
#hostpci9: 86:00.0,pcie=1
hugepages: 1024
ide2: none,media=cdrom
machine: q35
memory: 180224
name: test1
net0: virtio=92:B8:7A:DD:99:56,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,size=800G
scsihw: virtio-scsi-pci
smbios1: uuid=93e2c3de-f892-4538-87cf-d12171088ff9
sockets: 2
vmgenid: 69b03b2c-113b-47aa-aec2-a28bc30a7ee9

Resulting guest 'lspci' output:

Code:
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:10.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:10.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1c.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
09:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0a:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
0a:12.0 Ethernet controller: Red Hat, Inc. Virtio network device

Not working VM config (only difference is that I un-commented the line for the 'hostpci8' device):

Code:
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 12
cpu: host,hidden=1,flags=+pcid;+pdpe1gb;+aes
efidisk0: local-zfs:vm-100-disk-1,size=1M
hostpci0: 01:00.0,pcie=1
hostpci1: 02:00.0,pcie=1
hostpci2: 03:00.0,pcie=1
hostpci3: 04:00.0,pcie=1
hostpci4: 05:00.0,pcie=1
hostpci5: 82:00.0,pcie=1
hostpci6: 83:00.0,pcie=1
hostpci7: 84:00.0,pcie=1
hostpci8: 85:00.0,pcie=1
#hostpci9: 86:00.0,pcie=1
hugepages: 1024
ide2: none,media=cdrom
machine: q35
memory: 180224
name: test1
net0: virtio=92:B8:7A:DD:99:56,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,size=800G
scsihw: virtio-scsi-pci
smbios1: uuid=93e2c3de-f892-4538-87cf-d12171088ff9
sockets: 2
vmgenid: 69b03b2c-113b-47aa-aec2-a28bc30a7ee9

Output from VNC when booting fails:

Proxmox 9xPCIe devices boot failure uefi shell.JPG

Host 'lspci' VGA excerpt (the full output would exceed the 10k character post limit) for reference:

Code:
# lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
81:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
84:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
85:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
 
Well this is going to be tricky :/

When you are in the EFI shell, try to run the command pci. A list of all the PCI devices should show up. With ALT+SHIFT+Page UP/Down you can scroll up and down. Do all the GPUs show up?
 
I think Linus Tech Tips had this issue and had to enable "above 4G decoding" in the BIOS of the host machine...

I think .
 
Well this is going to be tricky :/

When you are in the EFI shell, try to run the command pci. A list of all the PCI devices should show up. With ALT+SHIFT+Page UP/Down you can scroll up and down. Do all the GPUs show up?

Finding/making tricky problems is a great talent of mine :)

Yes, with 9 GPUs passed through and running the 'pci' command from the EFI shell, all GPUs do show up with PCI addresses 01:00.0 - 09:00.0, like you'd expect. There are also 9 PCI bridges (presumably the PCIe Root Ports) that show up on 00:10.0 - 00:10.4 and 00:1c.0 - 03, also as you'd expect.

In fact, I compared the working 8 GPU 'lspci' output with the 9 GPU EFI shell 'pci' output and these were the only differences:

Not shown in EFI 'pci' output:

Code:
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)

Additions/Changes in ouput:

Code:
00:10.4 > new PCIe Root port for new GPU
09:00.0 > Newly added GPU
0a:01.0 > was 09:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0a:02.0 > was 09:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0a:03.0 > was 09:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0a:04.0 > was 09:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0b:05.0 > was 0a:05.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
0b:12.0 > was 0a:12.0 Ethernet controller: Red Hat, Inc. Virtio network device

One curious thing is that if I go into the boot menu, the SCSI boot device is not listed, only the UEFI QEMU DVD-ROM and the EFI internal shell. Since there is no media in the virtual DVD drive, it obviously is booting to the EFI shell.


Full screenshot of all PCI devices (duplicates due to multiple combined screenshots are indicated by the red lines):
Proxmox PCI Devices.png

So I tried reattaching the hard disk via IDE rather than SCSI and it actually shows up in the boot list. It doesn't actually boot though.

Then I tried running a Ubuntu live CD image in the CD Drive with the following results:

(All tests were done after editing the live CD kernel boot line to remove "quiet spash" and add "nomodeset modprobe.blacklist=nouveau" to prevent the nouveau drivers from preventing the QEMU VGA device from displaying the GUI.)

Ubuntu Live with disk attached via SCSI - stalls at "clocksource: switched to clocksource tsc", kernel panic or something
Proxmox scsi attached ubuntu live boot.JPG

Ubuntu Live with disk attached as IDE - Boots fine, network won't connect for some reason, all disks and PCI devices accounted for. Even though I can't boot to the IDE disk, I can definitely mount it from the Ubuntu live session.

proxmox ide attached lspci 1.JPG
proxmox ide attached lspci 2.JPG

I then cloned the VM but started from scratch with a new IDE disk, new NIC, and new EFI disk. I was able to install Ubuntu server successfully but something is definitely wrong with the NICs since it wasn't able to pull a DHCP address or get online during the installation process. However, after the install, the VM wouldn't boot either.

Proxmox Pcie IDE Ubuntu install no boot.JPG

When using an the disk via IDE, it actually shows up on the boot menu but it definitely won't boot. The screen just flashes black and then goes back to the boot menu.

Proxmox IDE working boot menu.JPG

tl;dr: With 9 GPUs passed through, all GPUs are being passed through properly it would appear. However, the SCSI disk disappears for the guest even though the controller still shows on the PCI bus. The disk will appear for the guest when connected via IDE but is still not bootable. Something is definitely also messed up with the NICs (tested both VirtIO and E1000), even if you do boot to a CD image, since I haven't been able to get the networking to work in either the Ubuntu live session with the main disk attached via IDE, nor during the Ubuntu install process on the cloned test VM.

I think Linus Tech Tips had this issue and had to enable "above 4G decoding" in the BIOS of the host machine...

I think .

This is a SuperMicro X9DRX motherboard for those interested. I checked and "Above 4G decoding" is already enabled. Although perhaps the OVMF BIOS needs to have it's own "Above 4G decoding" setting that doesn't currently exist?

X9DRX BIOS 4G Decoding.JPG
 
Did you also install 2 CPUs as specified by supermicro

See here https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRX_-F.cfm

"
PCI-Express
  • 10 PCI-E 3.0 x8 slots
  • 1 PCI-E 2.0 x4 (in x8) slot

    (Both CPUs need to be installed for full access to PCI-E slots and onboard controllers. See manual block diagram for details.)
Yes. This is running 2 x Xeon E5-2667 CPUs. I don't believe this is a hardware problem.

In fact, for the time being I am running with 8 GPUs passed through using the usual hostpci0: 01:00.0,pcie=1 lines and the last two GPUs are passed through by adding args: -device 'vfio-pci,host=85:00.0,multifunction=on' -device 'vfio-pci,host=86:00.0,multifunction=on' to the VM config file. This technically works but I believe this is placing those two GPUs directly on the PCI bus without a root port which is not recommended according to the QEMU PCI docs that I have seen.
 
this may well be a limitation of the pci spec /iospace (see the document you linked) for details
the document states that only 10 pcie devices with iospace requirements can be used in a pcie root port

so your solution with putting multiple devices in a single root port may be the only variant for now
but you can do this with our config (without args) by specifiying multiple pcie devices on one line like this:

hostpci0: 01:00;02:00,pcie=1....
 
Thanks for the input. I would think that since each GPU is on its own root port, there wouldn't be any iospace limitations but I haven't really dug into it a ton.

I did try what you suggested regarding specifying multiple pcie devices on one line but unfortunately it appears the Nvidia driver had no idea what to do with that.

This PCIe configuration:
Code:
hostpci0: 01:00.0,pcie=1
hostpci1: 02:00.0,pcie=1
hostpci2: 03:00.0,pcie=1
hostpci3: 04:00.0,pcie=1
hostpci4: 05:00.0,pcie=1
hostpci5: 82:00.0,pcie=1
hostpci6: 83:00.0;84:00.0,pcie=1
#hostpci7: 84:00.0,pcie=1

Produced 8 GPU devices in the VM:

Code:
lspci -t -vvv
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
           +-01.0  Device 1234:1111
           +-10.0-[01]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-10.1-[02]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-10.2-[03]--+-00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           |            \-00.1  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-1a.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4
           +-1a.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5
           +-1a.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6
           +-1a.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2
           +-1b.0  Intel Corporation 82801I (ICH9 Family) HD Audio Controller
           +-1c.0-[04]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-1c.1-[05]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-1c.2-[06]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-1c.3-[07]----00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
           +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
           +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
           +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
           +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
           +-1e.0-[08-0c]--+-01.0-[09]--+-05.0  Red Hat, Inc. Virtio SCSI
           |               |            \-12.0  Red Hat, Inc. Virtio network device
           |               +-02.0-[0a]--
           |               +-03.0-[0b]--
           |               \-04.0-[0c]--
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

However only 7 GPUs would appear in 'nvidia-smi' and I found the following in the kernel log.

Code:
[    3.214882] vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=io+mem,decodes=none:owns=none
[    3.322641] vgaarb: device changed decodes: PCI:0000:03:00.1,olddecodes=io+mem,decodes=none:owns=none
[    3.327429] NVRM: GPU 0000:03:00.0: rm_init_private_state() failed!
[    3.329674] nvidia: probe of 0000:03:00.1 failed with error -1
[    3.330926] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[    3.340299] tsc: Refined TSC clocksource calibration: 2900.028 MHz
[    3.341225] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x29cd5d5b3f6, max_idle_ns: 440795203668 ns
[    3.437231] vgaarb: device changed decodes: PCI:0000:05:00.0,olddecodes=io+mem,decodes=none:owns=none
[    3.544855] vgaarb: device changed decodes: PCI:0000:06:00.0,olddecodes=io+mem,decodes=none:owns=none
[    3.653360] vgaarb: device changed decodes: PCI:0000:07:00.0,olddecodes=io+mem,decodes=none:owns=none
[    3.764141] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    3.765538] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.33.01  Wed Nov 13 00:00:22 UTC 2019
[    3.787850] [drm] Initialized drm 1.1.0 20060810
[    4.010054] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    4.035962] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.33.01  Tue Nov 12 23:43:11 UTC 2019
[    4.044292] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    4.045828] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[    4.047387] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[    4.049068] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[    4.050566] [drm] [nvidia-drm] [GPU ID 0x00000500] Loading driver
[    4.051990] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver
[    4.053497] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver

I might play around with it some more but it looks I may have to keep the last two GPUs configured with args directly onto the PCI bus.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!