Hello,
I am building a proof-of-concept server for performing some GPGPU related calculations. The idea is to pass through all GPUs to VMs for easier testing of code and better isolation between program and host (if anything goes wrong in best case only the VM has to be stopped or at least the host does not hang and can be easily rebooted without the need for triggering a physical reset button).
The base specs of machine:
- Motherboard: Supermicro X9DRi-LN4F+
- CPU: 2x Intel Xeon E5-2670
- RAM: 96GB DDR3
- GPU: 6x AMD RX570 4GB (Sapphire Nitro+)
- SSDs + ZFS + Kernel 5.4.101 (LTS) + VFIO modules + ACS patch
As the motherboard has only 6 PCIe slots they are populated like this:
- CPU0, slot 1: GPU
- CPU0, slot 2: GPU
- CPU0, slot 3: GPU
- CPU1, slot 4: NVMe SSD
- CPU1, slot 5: NVMe SSD
- CPU1, slot 6: ASM1184e PCIe Switch Port (https://www.amazon.com/XT-XINTE-PCI-express-External-Adapter-Multiplier/dp/B07CWPWDF8)
-> port 1: GPU
-> port 2: GPU
-> port 3: GPU
IOMMU groups: https://pastebin.com/SvuWtGcz
GPU1 PCI details (same for GPU2-3 except the addresses): https://pastebin.com/6wh4Hz8v
GPU6 PCI details (slightly differ from GPU1-3, same for GPU4-5 except the addresses): https://pastebin.com/aDRfiLXA
I have successfully established VFIO for the GPUs in slots 1-3 (CPU0) with ...
/etc/modprobe.d/vfio-pci.conf:
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1
/etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=id:1b21:1184"
... and then for VM (Windows, Linux) with assigned GPU1-3 everything works perfectly.
The problem appears when I try to also pass through GPU4-6 (CPU1), which are on PCIe Switch Port. It doesn't matter if I try to pass through only one of those GPU, result is the same. When I start the VM I can see this repeated multiple times in dmesg:
DMAR: DRHD: handling fault status reg 40
This line is also present on machine boot, before the: DMAR-IR: Enabled IRQ remapping in x2apic mode
However, after some time when VM is starting, I also start to get kernel errors, which I pasted here: https://pastebin.com/zRVJ63yZ (the host hangs at that point and just throws those errors via ssh/dmesg -wH, even though they are slightly different, but I have caught only what's in the paste)
I tried a lot of different configs, from changing options of intel_iommu (igfx_off, sp_off, ...), allowing unsafe interrupts, changing VM args/settings and I can't figure out what is going wrong - I don't know enough about kernel and it's methods to understand what the errors which are thrown mean.
I found a partial solution to the problem which is unfortunately working only for Linux - but I'd like to have Windows also working. The secret there is to remove pcie_acs_override and add pci=nommconf. Then the PCIe Switch Port devices get into single IOMMU group and when assiging them to Linux VM there is no any error - everything is being recognized. On the other hand in Windows I always get exclamation in Device Manager, showing that there is not enough resources for the device to work properly and I need to first disable pcie option in VM settings, boot, shutdown and add pcie option again, because otherwise VM doesn't boot at all.
I have now spent almost a week trying to figure out what's causing issues and it seems I don't have enough skill to find that out by myself. I would be very happy for any help/tip/... to get this thing resolved if even possible, or at least the confirmation that there is no option this will work correctly. If there is possibility to pass through complete PCIe Switch with all the GPUs on it, this is also ok (I already tried to do that somehow, but GPUs are resolved at the boot time and can't unbind/remove them later on, as pcieport module is being used on boot).
Thank you very much!
I am building a proof-of-concept server for performing some GPGPU related calculations. The idea is to pass through all GPUs to VMs for easier testing of code and better isolation between program and host (if anything goes wrong in best case only the VM has to be stopped or at least the host does not hang and can be easily rebooted without the need for triggering a physical reset button).
The base specs of machine:
- Motherboard: Supermicro X9DRi-LN4F+
- CPU: 2x Intel Xeon E5-2670
- RAM: 96GB DDR3
- GPU: 6x AMD RX570 4GB (Sapphire Nitro+)
- SSDs + ZFS + Kernel 5.4.101 (LTS) + VFIO modules + ACS patch
As the motherboard has only 6 PCIe slots they are populated like this:
- CPU0, slot 1: GPU
- CPU0, slot 2: GPU
- CPU0, slot 3: GPU
- CPU1, slot 4: NVMe SSD
- CPU1, slot 5: NVMe SSD
- CPU1, slot 6: ASM1184e PCIe Switch Port (https://www.amazon.com/XT-XINTE-PCI-express-External-Adapter-Multiplier/dp/B07CWPWDF8)
-> port 1: GPU
-> port 2: GPU
-> port 3: GPU
IOMMU groups: https://pastebin.com/SvuWtGcz
GPU1 PCI details (same for GPU2-3 except the addresses): https://pastebin.com/6wh4Hz8v
GPU6 PCI details (slightly differ from GPU1-3, same for GPU4-5 except the addresses): https://pastebin.com/aDRfiLXA
I have successfully established VFIO for the GPUs in slots 1-3 (CPU0) with ...
/etc/modprobe.d/vfio-pci.conf:
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1
/etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=id:1b21:1184"
... and then for VM (Windows, Linux) with assigned GPU1-3 everything works perfectly.
The problem appears when I try to also pass through GPU4-6 (CPU1), which are on PCIe Switch Port. It doesn't matter if I try to pass through only one of those GPU, result is the same. When I start the VM I can see this repeated multiple times in dmesg:
DMAR: DRHD: handling fault status reg 40
This line is also present on machine boot, before the: DMAR-IR: Enabled IRQ remapping in x2apic mode
However, after some time when VM is starting, I also start to get kernel errors, which I pasted here: https://pastebin.com/zRVJ63yZ (the host hangs at that point and just throws those errors via ssh/dmesg -wH, even though they are slightly different, but I have caught only what's in the paste)
I tried a lot of different configs, from changing options of intel_iommu (igfx_off, sp_off, ...), allowing unsafe interrupts, changing VM args/settings and I can't figure out what is going wrong - I don't know enough about kernel and it's methods to understand what the errors which are thrown mean.
I found a partial solution to the problem which is unfortunately working only for Linux - but I'd like to have Windows also working. The secret there is to remove pcie_acs_override and add pci=nommconf. Then the PCIe Switch Port devices get into single IOMMU group and when assiging them to Linux VM there is no any error - everything is being recognized. On the other hand in Windows I always get exclamation in Device Manager, showing that there is not enough resources for the device to work properly and I need to first disable pcie option in VM settings, boot, shutdown and add pcie option again, because otherwise VM doesn't boot at all.
I have now spent almost a week trying to figure out what's causing issues and it seems I don't have enough skill to find that out by myself. I would be very happy for any help/tip/... to get this thing resolved if even possible, or at least the confirmation that there is no option this will work correctly. If there is possibility to pass through complete PCIe Switch with all the GPUs on it, this is also ok (I already tried to do that somehow, but GPUs are resolved at the boot time and can't unbind/remove them later on, as pcieport module is being used on boot).
Thank you very much!