Problems with GPU Passthrough since 8.2

I was really happy because nvidia announced that GRID v16.9 is compatible with 6.8 kernel so i updated to be able to use my P4s and now i see than AMD and Supermicro broke the PCI passthrough of my T1000 on my M11SDV-8C+-LN4F on that kernel.... Instead of flashing old BIOS or dealing with mobo firmware, i recommend to just pin the latest 6.5 kernel, which is a lot more stable and doesn't have that buggy PCI passthrough thing.

proxmox-boot-tool kernel pin 6.5.13-6-pve
 
I have a Dell T340 running PVE 8.3.3, Kernel 6.8.12-8, a INTEL CPU, and a H330 raid card set to IT mode. When I try to pass the H330 I am getting the vfio-pci 0000:01:00.0: Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor. error. When I run dmesg | grep -e DMAR -e IOMMU -e AMD-Vi it shows that there is a firmware bug with RMRR on the device that is having issues. What I don't know is if the bug is in the H330 card firmware or the motherboard/kernel firmware. When I get a list of the iommu groups it looks like everything connected directly to the cpu is coming in, in group 1. When I try to make a mapped device in PVE tells me the "A selected device is not in a separated IOMMU group, Make sure this is intended". I have a HBA330 on order but I have a feeling this won't make a difference in the end since this thread make it sound like it is a issue with the BIOS... I have tried the relax_rmrr but that has not helped. I'm not sure that I am brave enough the build my own kernel Does anyone have any ideas on something else I can try?

Code:
dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[    0.005935] ACPI: DMAR 0x000000006FFD2000 000090 (v01 DELL   PE_SC3   00000002 DELL 00000001)
[    0.005950] ACPI: Reserving DMAR table memory at [mem 0x6ffd2000-0x6ffd208f]
[    0.169706] DMAR: IOMMU enabled
[    0.169706] DMAR: Intel-IOMMU: assuming all RMRRs are relaxable. This can lead to instability or data loss
[    0.424046] DMAR: Host address width 39
[    0.424047] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.424053] DMAR: dmar0: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[    0.424057] DMAR: RMRR base: 0x000000507f6000 end: 0x000000587fdfff
[    0.424061] DMAR: RMRR base: 0x0000006b6ca000 end: 0x0000006b6e9fff
[    0.424064] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 0
[    0.424066] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.424067] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.427218] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.641430] DMAR: [Firmware Bug]: RMRR entry for device 02:00.0 is broken - applying workaround
[    0.641434] DMAR: No ATSR found
[    0.641435] DMAR: No SATC found
[    0.641436] DMAR: dmar0: Using Queued invalidation
[    0.641717] DMAR: Intel(R) Virtualization Technology for Directed I/O


Code:
for d in $(find /sys/kernel/iommu_groups/ -type l | sort -n -k5 -t/); do
    n=${d#*/iommu_groups/*}; n=${n%%/*}
    printf 'IOMMU Group %s ' "$n"
    lspci -nns "${d##*/}"
done;
IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e31] (rev 0d)
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 0d)
IOMMU Group 1 00:01.1 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) [8086:1905] (rev 0d)
IOMMU Group 1 01:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] (rev 01)
IOMMU Group 1 01:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 [8086:1528] (rev 01)
IOMMU Group 1 02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] [1000:005f] (rev 02)
IOMMU Group 2 00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
IOMMU Group 3 00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] (rev 10)
IOMMU Group 4 00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
IOMMU Group 4 00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10)
IOMMU Group 5 00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10)
IOMMU Group 5 00:16.4 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller #2 [8086:a364] (rev 10)
IOMMU Group 6 00:17.0 SATA controller [0106]: Intel Corporation Cannon Lake PCH SATA AHCI Controller [8086:a352] (rev 10)
IOMMU Group 7 00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 [8086:a338] (rev f0)
IOMMU Group 8 00:1c.1 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #2 [8086:a339] (rev f0)
IOMMU Group 9 00:1e.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller [8086:a328] (rev 10)
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Cannon Point-LP LPC Controller [8086:a309] (rev 10)
IOMMU Group 10 00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10)
IOMMU Group 10 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10)
IOMMU Group 11 03:00.0 PCI bridge [0604]: PLDA PCI Express Bridge [1556:be00] (rev 02)
IOMMU Group 11 04:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller [102b:0536] (rev 04)
IOMMU Group 12 05:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 12 05:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
 
Last edited:
Are there any news?
Does it make sense from our side to establish contact with supermicro?
Is there a ticket number we could reference or a "special" Mailadress?
Happy holidays!
They have stopped responding to my mails.
 
Might help somebody:
I was also struggling with the Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor. issue while I was trying to PT an integrated NIC on AMD based platform.
Turned out the solution is to disable RDMA in BIOS. Error gone, PT works. kernel: 6.11.11-2
 
Did someone test if this issue is mitigated on PVE9 and 6.14 kernel?

Edit: Spent all the morning trying to make v6.8 and v6.14 work with my T1000 passthrough without success, tried relaxed settings, various grub cmdline configs and tried to involve chatgpt. Updated my M11SDV-8C+-LN4F bios to latest M11SDV_1.5_AS03.17.05_SUM2.14.0-p8 and nothing seems to fix the "Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1" message. So the actual picture is: We will be unable to update with 100% warranties to PVE 9 as only fix is to downgrade to 6.5 kernel and PVE 9 uses 6.14, under my POV keeping 6.5 kernel across major PVE updates is not a reliable fix (but for the moment the only one we have).

My conclusion is that we have some AMD-Supermicro regression that makes us unable to update the kernel so we are now depending on vendors to fix the issue. Hope that some Proxmox staff members can help to scale the issue.

This was even reported on the Linux Opt-in Linux 6.8 Kernel for Proxmox VE 8, that was like a year ago:


@Stoiko Ivanov Please can someone on the team look at this? Or at least help us to debug this even more to find a fix?

Seems that on that thread someone even found the commit which brokes the passthrough procedure (@athurdent ): https://lore.kernel.org/lkml/8cc1d69e-f86d-fd04-7737-914d967dc0f5@intel.com/
 
Last edited: