Pass through error when using PCIe Switch Port

mnovi

New Member
Mar 9, 2021
18
0
1
47
Hello,

I am building a proof-of-concept server for performing some GPGPU related calculations. The idea is to pass through all GPUs to VMs for easier testing of code and better isolation between program and host (if anything goes wrong in best case only the VM has to be stopped or at least the host does not hang and can be easily rebooted without the need for triggering a physical reset button).

The base specs of machine:
- Motherboard: Supermicro X9DRi-LN4F+
- CPU: 2x Intel Xeon E5-2670
- RAM: 96GB DDR3
- GPU: 6x AMD RX570 4GB (Sapphire Nitro+)
- SSDs + ZFS + Kernel 5.4.101 (LTS) + VFIO modules + ACS patch

As the motherboard has only 6 PCIe slots they are populated like this:
- CPU0, slot 1: GPU
- CPU0, slot 2: GPU
- CPU0, slot 3: GPU
- CPU1, slot 4: NVMe SSD
- CPU1, slot 5: NVMe SSD
- CPU1, slot 6: ASM1184e PCIe Switch Port (https://www.amazon.com/XT-XINTE-PCI-express-External-Adapter-Multiplier/dp/B07CWPWDF8)
-> port 1: GPU
-> port 2: GPU
-> port 3: GPU

IOMMU groups: https://pastebin.com/SvuWtGcz
GPU1 PCI details (same for GPU2-3 except the addresses): https://pastebin.com/6wh4Hz8v
GPU6 PCI details (slightly differ from GPU1-3, same for GPU4-5 except the addresses): https://pastebin.com/aDRfiLXA

I have successfully established VFIO for the GPUs in slots 1-3 (CPU0) with ...

/etc/modprobe.d/vfio-pci.conf:
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1

/etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=id:1b21:1184"

... and then for VM (Windows, Linux) with assigned GPU1-3 everything works perfectly.

The problem appears when I try to also pass through GPU4-6 (CPU1), which are on PCIe Switch Port. It doesn't matter if I try to pass through only one of those GPU, result is the same. When I start the VM I can see this repeated multiple times in dmesg:
DMAR: DRHD: handling fault status reg 40

This line is also present on machine boot, before the: DMAR-IR: Enabled IRQ remapping in x2apic mode

However, after some time when VM is starting, I also start to get kernel errors, which I pasted here: https://pastebin.com/zRVJ63yZ (the host hangs at that point and just throws those errors via ssh/dmesg -wH, even though they are slightly different, but I have caught only what's in the paste)

I tried a lot of different configs, from changing options of intel_iommu (igfx_off, sp_off, ...), allowing unsafe interrupts, changing VM args/settings and I can't figure out what is going wrong - I don't know enough about kernel and it's methods to understand what the errors which are thrown mean.

I found a partial solution to the problem which is unfortunately working only for Linux - but I'd like to have Windows also working. The secret there is to remove pcie_acs_override and add pci=nommconf. Then the PCIe Switch Port devices get into single IOMMU group and when assiging them to Linux VM there is no any error - everything is being recognized. On the other hand in Windows I always get exclamation in Device Manager, showing that there is not enough resources for the device to work properly and I need to first disable pcie option in VM settings, boot, shutdown and add pcie option again, because otherwise VM doesn't boot at all.

I have now spent almost a week trying to figure out what's causing issues and it seems I don't have enough skill to find that out by myself. I would be very happy for any help/tip/... to get this thing resolved if even possible, or at least the confirmation that there is no option this will work correctly. If there is possibility to pass through complete PCIe Switch with all the GPUs on it, this is also ok (I already tried to do that somehow, but GPUs are resolved at the boot time and can't unbind/remove them later on, as pcieport module is being used on boot).


Thank you very much!
 
my guesses are:

* there is some bios setting you missed for pcie bifurcation, so that the devices on the pcie switch port can be properly seperated
* the pcie switch is not suited for this kind of use

also windows is rather picky in how the pcie layout has to be to work properly

can you please post the output of 'pveversion -v' and the vm configs ?
 
pveversion -v: https://pastebin.com/mTGsq5Tp
vm config: https://pastebin.com/z2Vfq3pG

For the VM config I also tried with another q35 machine type (3.1); kernel irqchip on, off, split; removing pcie=1 and adding romfile. For the pcie switch there is no bifurcation (at least not in the form of splitting lanes in BIOS), the chip on device handles this (it is a cheap solution with PCI x1 interface and low bandwidth) and this is probably the source of problems. I can also post "screenshots" of BIOS options, but with my knowledge about this I don't find anything special that could be changed (no options for ACS, IOMMU, ... are present there or at least not under that naming).

When I added pci=nommconf to cmdline, IOMMU group for pcie switch port started to contain all GPUs connected to it and that enabled boot for VMs, but Windows is having problems with that (other 3 GPUs which are not connected to switch stopped working). There is also list of IOMMU groups with that option set: https://pastebin.com/5DU9khAR

If it's of any help, there are also IOMMU groups when no ACS patch for pcie switch port is being used: https://pastebin.com/MPZd3yJU

I was also trying to pass through complete pcie switch port (as some are doing with USB hubs), but I'm not sure if that is possible (in ideal case to my understanding I should inject pci stub module before the kernel uses pcieport and resolves all devices under under and then pass this device to VM).
 
pveversion -v: https://pastebin.com/mTGsq5Tp
vm config: https://pastebin.com/z2Vfq3pG

For the VM config I also tried with another q35 machine type (3.1); kernel irqchip on, off, split; removing pcie=1 and adding romfile. For the pcie switch there is no bifurcation (at least not in the form of splitting lanes in BIOS), the chip on device handles this (it is a cheap solution with PCI x1 interface and low bandwidth) and this is probably the source of problems. I can also post "screenshots" of BIOS options, but with my knowledge about this I don't find anything special that could be changed (no options for ACS, IOMMU, ... are present there or at least not under that naming).

When I added pci=nommconf to cmdline, IOMMU group for pcie switch port started to contain all GPUs connected to it and that enabled boot for VMs, but Windows is having problems with that (other 3 GPUs which are not connected to switch stopped working). There is also list of IOMMU groups with that option set: https://pastebin.com/5DU9khAR

If it's of any help, there are also IOMMU groups when no ACS patch for pcie switch port is being used: https://pastebin.com/MPZd3yJU

I was also trying to pass through complete pcie switch port (as some are doing with USB hubs), but I'm not sure if that is possible (in ideal case to my understanding I should inject pci stub module before the kernel uses pcieport and resolves all devices under under and then pass this device to VM).
Hi, mnovi

Did you found any solution for this? I have the same problem here.

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!