Proxmox locks up when trying pci/pcie passthrough on Ryzen 7 3700X system

sumguy

Member
Apr 5, 2022
21
4
8
I have a remote datacenter system with an Asus Pro WS 565-ACE with Ryzen 7 3700X, 128Gb ram. The bios they run in the datacenter is somewhat custom. I have requested to flash to latest retail, but they do not allow that.

For some reason the system always locks up when I try pci/pcie passthrough. I run TrueNAS virtualized at home and also at the datacenter. At home on a different AMD machine - the passthrough worked fine. I was able to pass onboard SATA controllers to the TrueNAS VM and has never given a single problem. Drives at the datacenter had to be individually attached to TrueNAS VM. Later half of the disks became "degraded." So I wanted to again try to get the passthrough to work to see if that alleviates the issue.

I've done everything here: https://pve.proxmox.com/wiki/Pci_passthrough

and here: https://pve.proxmox.com/wiki/PCI(e)_Passthrough

I've tried different combinations. These are all the devices sharing the same interrupt I guess? I cross-referenced and copied/pasted the names for convenience.

/sys/kernel/iommu_groups/14/devices/0000:03:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller /sys/kernel/iommu_groups/14/devices/0000:02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea /sys/kernel/iommu_groups/14/devices/0000:01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43e9 /sys/kernel/iommu_groups/14/devices/0000:01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43ef /sys/kernel/iommu_groups/14/devices/0000:07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) /sys/kernel/iommu_groups/14/devices/0000:02:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea /sys/kernel/iommu_groups/14/devices/0000:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41) /sys/kernel/iommu_groups/14/devices/0000:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04) /sys/kernel/iommu_groups/14/devices/0000:02:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea /sys/kernel/iommu_groups/14/devices/0000:01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb /sys/kernel/iommu_groups/14/devices/0000:04:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller /sys/kernel/iommu_groups/14/devices/0000:02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea

I think the drives are attached to the JMicron adapters. I also tried the AMD Sata Controllers though too. I tried specifying the options vfio-pci ids. I also tried blacklisting all the drivers associated with all those devices. There were 2 - ahci and another one for usb or something. Seems no matter what I do - the moment I start the VM the system locks up, every time.
 
hese are all the devices sharing the same interrupt I guess?
not interrupt but iommu group. you cannot pass through a single device of such a group and simultaniously using the others on the host, this will lead to the exact problem you described
how the devices are grouped is depending on the motherboard and the bios, so there is not really something to configure here (sometimes you have to enable some bios switches though, this also depends on the actual hw and bios version ...)
 
This is what I was afraid of. Thank you so much for your answer. Kind of annoying and surprising all these are in the same group. I need the Ethernet card for the host system. I assume the PCI bridges - not even sure what those are... Wow graphics controller there as well. How did they all wind up in the same group. Grrrr. Somehow I need to get the SATA controllers out.
 
Last edited:
Thats probably because NICs, storage controllers, USB controllers and the BMC (with its VGA, NIC, soundcard) are usually directly connected to the chipset and not to the CPU. And then all those devices share the same few (4?) PCIe lanes between CPU und chipset.
You chances that PCIe cards will work are usually much higher. Especially when putting that card into a PCIe slot that is directly connected to the CPU and not to the chipset.
 
I'm working with support to see if there is anything they can do to split some of these up a bit. Not exactly a virtualization-friendly configuration at all! I'll post what happens. My home lab server doesn't have this issue at all. Everything is split up nicely with lots of different IOMMUs

Home lab server:

/sys/kernel/iommu_groups/23/devices/0000:0a:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
/sys/kernel/iommu_groups/23/devices/0000:02:0a.0 02:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]

/sys/kernel/iommu_groups/22/devices/0000:09:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
/sys/kernel/iommu_groups/22/devices/0000:02:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]

Each of those IOMMUs - 23 and 23 - those are the only 2 devices on them. I have had the sata controllers passed onto the VM for months no problem. I did not pass the PCI bridge over - is that something possible or recommended even?

If I did anymore passing I would be running into that limit problem I believe. I don't think that has been resolved - the 3 or 4 or 5 pcie device passthrough limit?
 
I'm working with support to see if there is anything they can do to split some of these up a bit. Not exactly a virtualization-friendly configuration at all! I'll post what happens. My home lab server doesn't have this issue at all. Everything is split up nicely with lots of different IOMMUs

Home lab server:

/sys/kernel/iommu_groups/23/devices/0000:0a:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
/sys/kernel/iommu_groups/23/devices/0000:02:0a.0 02:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]

/sys/kernel/iommu_groups/22/devices/0000:09:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
/sys/kernel/iommu_groups/22/devices/0000:02:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]

Each of those IOMMUs - 23 and 23 - those are the only 2 devices on them. I have had the sata controllers passed onto the VM for months no problem. I did not pass the PCI bridge over - is that something possible or recommended even?

If I did anymore passing I would be running into that limit problem I believe. I don't think that has been resolved - the 3 or 4 or 5 pcie device passthrough limit?
Someone last week tried to passthrough 12 GPUs but it looked like Q35 can't handle more then 10 PCIe devices.

Are you sure those PCIe bridges weren't passed through? Until now I thought you always passthrough all devices of a IOMMU group no matter what member of a group you select for passthrough. Maybe @avw knows that better.
 
Maybe it passes them through automatically? How can I tell? I never selected them - that is for sure.

I'm pretty sure I ran into the hardware pass to a single VM limit thing when I tried to pass through the data controllers and NVME cards. I can try again - has there been any update on that?
 
Last edited:
I have not seen a 565 chipset for Ryzen before, but it does indeed looks like it has the same big chipset IOMMU group as all Ryzen chipsets except for X570. In that case. you can only use the PCIe and M.2 slots that have PCIe lanes directly connected to the CPU (20 lanes at most, often x4 M.2 and one x16 or two x8). All other PCIe slots go via the chipset.

PCI(e) Bridges in a IOMMU group don't need to be explicitly passed to a VM. You also don't need to pass all devices in a group to a VM. It's just that you cannot share devices in the same group between VMs or between a VM and the host. Just passthrough the few devices that you want to a VM and the other devices from the same group(s) are automatically removed from the host to provide group isolation (but don't need to go the VM).

You can try breaking the IOMMU group isolation with the pcie_acs_override kernel parameter (built into Proxmox), but this does not guarantee that the devices will work when passed to a VM and it comes with a security issue: as PCI(e) devices can read memory and communicate with other devices in the same IOMMU group, they can pass information from a VM to another VM or the host and the other way around.

I don't know the limit for VMs, but I assume it can be increased by adding additional virtual PCIe root devices, but this is not supported by Proxmox and you need to do this yourself via passing additional arguments to QEMU/KVM via the args: VM configuration setting. I believe Proxmox only supports 5 devices, which can be multi-function (like GPU+audio, which only takes up 1 hostpci setting).
 
Last edited:
  • Like
Reactions: Dunuin