Persistent vfio_container_dma_map -22 Invalid Argument after Proxmox 8.x update - Intel/NVIDIA/BAR Mapping issue

rexl1

New Member
Jan 30, 2024
2
0
1
I wrote a system debug script with ai to help me get to the bottom of the this persist issue. see the attached script and debug it found. Hopefully someone can help.

  1. Latest Software: You are now running the very latest stable Proxmox VE 8.4.1 and kernel 6.8.12-10-pve.
  2. Correct GRUB Parameter: initcall_blacklist=sysfb_init is correctly active in your kernel command line (BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve ... initcall_blacklist=sysfb_init ...). This should prevent the host kernel from initializing the framebuffer on your GPU.
  3. Correct Modprobe Setting: The potentially conflicting disable_vga=1 has been successfully removed from /etc/modprobe.d/vfio.conf.
  4. VFIO Binding Success: The dmesg output (vfio-pci: add [10de:2783...]) shows that the vfio-pci driver is successfully binding to your GPU (01:00.0) and its audio device (01:00.1).
  5. Correct IOMMU Grouping: Your GPU+Audio (Group 15) and USB controller (Group 19) are correctly isolated into their own IOMMU groups.
  6. GPU BAR Enumeration: lspci still shows your 16GB VRAM BAR at host address 6000000000.
  7. i have the latest bios installed for gigabyte z690 elite ax motherboard.
  8. i have the latest nvidia drivers installed in win11 and ubuntu..
  9. interestedly this vifo errr happens if I assign the gpu to both windows or linux vms. :(
New and Significant Finding: "Can't Claim" Errors Persist and Multiply

The most critical piece of information in this new debug output snippet is the confirmation that the "can't claim" errors are still present for devices at 00:15.x and 00:1f.5, and we see even more details about them:

[ <span>0.453595</span>] pci <span>0000</span>:<span>00</span>:<span>15.0</span>: BAR <span>0</span> [mem <span>0xfe0f9000</span><span>-0xfe0f9fff</span> <span>64</span>bit]: can<span>'t claim; no compatible bridge window<br>... (similar lines for 00:15.1, 00:15.2, 00:15.3)<br>[ 0.453636] pci 0000:00:1f.5: BAR 0 [mem 0xfe010000-0xfe010fff]: can'</span>t claim; <span>no</span> compatible bridge <span>window</span>


These errors are happening during the host kernel's boot process when it's trying to allocate memory addresses (BARs) for devices other than your GPU. The message "can't claim; no compatible bridge window" indicates a fundamental issue with how the system's PCI bridges are configured or how the kernel is attempting to assign resources, preventing it from allocating space for these devices.

Why this is related to your GPU Passthrough Failure:

The vfio_container_dma_map = -22 (Invalid argument) error for your GPU's 16GB BAR means the kernel is rejecting the request to map that large memory region into the VM. This rejection is likely happening because the host kernel's overall physical memory map is in a problematic state due to the resource allocation failures shown by the "can't claim" errors for those other devices. If the host cannot cleanly assign BARs for other devices, it can fragment the address space or create conflicts that prevent a large, contiguous BAR like your GPU's VRAM from being mapped correctly for VFIO.

Conclusion:

You have diligently applied the standard passthrough fixes. The debug output confirms those steps are correctly implemented on the host. However, the persistence of the -22 error and the presence of these "can't claim" errors for other devices strongly indicates a more complex, low-level problem with PCI resource allocation on your system in this kernel version. This is not a simple VM configuration issue.

You have exhausted the general troubleshooting steps available through standard configuration. The issue is either:

  1. A bug in the current kernel's PCI resource management or VFIO interaction with your specific motherboard firmware.
  2. A quirk in your motherboard's firmware that causes these resource allocation failures which the current kernel cannot compensate for.
The only path forward now is to seek help from experts familiar with these low-level kernel and hardware interaction issues.

  • Your full hardware specifications (CPU, Motherboard model, GPU model - NVIDIA RTX 4070 SUPER).
  • Your Proxmox VE version (8.4.1) and exact kernel version (6.8.12-10-pve).
  • State that passthrough worked fine before a recent update and broke afterwards.
  • Mention the persistent vfio_container_dma_map = -22 (Invalid argument) error.
  • List the specific BIOS settings you have confirmed and set (IOMMU Enabled, Above 4G Enabled, ReBAR Enabled, ASPM Disabled, Primary Display GPU slot).
  • State that initcall_blacklist=sysfb_init is in your GRUB command line and disable_vga=1 is removed from vfio.conf.
  • Crucially, include the FULL output of your debug script. Explain that you are seeing "can't claim" BAR errors for other devices in dmesg, and point to those lines in your shared output.
 

Attachments

let's address the specific devices from your lspci -nnk output that were generating "can't claim" errors in your dmesg:

  • 00:15.0, 00:15.1, 00:15.2, 00:15.3 (Intel Corporation Alder Lake-S PCH Serial IO I2C Controller): These are Intel's I2C (Inter-Integrated Circuit) controllers.[1][2][3][4] I2C is a serial bus commonly used for low-speed communication between components on the motherboard, such as sensors, audio codecs, or other integrated peripherals.[1][2]
    • Why "can't claim" might happen: The kernel is trying to assign memory addresses to these I2C controllers, but something is preventing it from doing so within the available PCI bridge windows. This could be a firmware issue, a conflict with another device's resource requests, or a bug in the kernel's resource allocation logic for this specific chipset/motherboard.
    • Impact: While I2C controllers are generally low-bandwidth, a failure to initialize them correctly might indicate a broader problem with the PCH (Platform Controller Hub - the chipset) initialization or resource management, which could indirectly affect other PCH-connected devices, potentially including how the system handles the large BAR of your GPU.
  • 00:1f.5 (Intel Corporation Alder Lake-S PCH SPI Controller): This is the Serial Peripheral Interface (SPI) controller on the PCH.[5][6][7][8] The SPI controller is typically used for communicating with the system's firmware (BIOS/UEFI) flash memory and potentially other devices like TPM (Trusted Platform Module).[5][6][7]
    • Why "can't claim" might happen: Similar to the I2C controllers, the kernel is having trouble assigning resources to the SPI controller. This could be related to how the firmware exposes this controller or a kernel issue.
    • Impact: A problem with the SPI controller could potentially indicate issues with how the kernel interacts with the system's firmware interface, which might have downstream effects on other hardware interactions.
These "can't claim" errors strongly reinforce the conclusion that the issue is likely a low-level problem with PCI resource allocation involving the PCH on your motherboard and the current Linux kernel.[9]