Persistent vfio_container_dma_map -22 Invalid Argument after Proxmox 8.x update - Intel/NVIDIA/BAR Mapping issue

rexl1

New Member
Jan 30, 2024
10
0
1
I wrote a system debug script with ai to help me get to the bottom of the this persist issue. see the attached script and debug it found. Hopefully someone can help.

  1. Latest Software: You are now running the very latest stable Proxmox VE 8.4.1 and kernel 6.8.12-10-pve.
  2. Correct GRUB Parameter: initcall_blacklist=sysfb_init is correctly active in your kernel command line (BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve ... initcall_blacklist=sysfb_init ...). This should prevent the host kernel from initializing the framebuffer on your GPU.
  3. Correct Modprobe Setting: The potentially conflicting disable_vga=1 has been successfully removed from /etc/modprobe.d/vfio.conf.
  4. VFIO Binding Success: The dmesg output (vfio-pci: add [10de:2783...]) shows that the vfio-pci driver is successfully binding to your GPU (01:00.0) and its audio device (01:00.1).
  5. Correct IOMMU Grouping: Your GPU+Audio (Group 15) and USB controller (Group 19) are correctly isolated into their own IOMMU groups.
  6. GPU BAR Enumeration: lspci still shows your 16GB VRAM BAR at host address 6000000000.
  7. i have the latest bios installed for gigabyte z690 elite ax motherboard.
  8. i have the latest nvidia drivers installed in win11 and ubuntu..
  9. interestedly this vifo errr happens if I assign the gpu to both windows or linux vms. :(
New and Significant Finding: "Can't Claim" Errors Persist and Multiply

The most critical piece of information in this new debug output snippet is the confirmation that the "can't claim" errors are still present for devices at 00:15.x and 00:1f.5, and we see even more details about them:

[ <span>0.453595</span>] pci <span>0000</span>:<span>00</span>:<span>15.0</span>: BAR <span>0</span> [mem <span>0xfe0f9000</span><span>-0xfe0f9fff</span> <span>64</span>bit]: can<span>'t claim; no compatible bridge window<br>... (similar lines for 00:15.1, 00:15.2, 00:15.3)<br>[ 0.453636] pci 0000:00:1f.5: BAR 0 [mem 0xfe010000-0xfe010fff]: can'</span>t claim; <span>no</span> compatible bridge <span>window</span>


These errors are happening during the host kernel's boot process when it's trying to allocate memory addresses (BARs) for devices other than your GPU. The message "can't claim; no compatible bridge window" indicates a fundamental issue with how the system's PCI bridges are configured or how the kernel is attempting to assign resources, preventing it from allocating space for these devices.

Why this is related to your GPU Passthrough Failure:

The vfio_container_dma_map = -22 (Invalid argument) error for your GPU's 16GB BAR means the kernel is rejecting the request to map that large memory region into the VM. This rejection is likely happening because the host kernel's overall physical memory map is in a problematic state due to the resource allocation failures shown by the "can't claim" errors for those other devices. If the host cannot cleanly assign BARs for other devices, it can fragment the address space or create conflicts that prevent a large, contiguous BAR like your GPU's VRAM from being mapped correctly for VFIO.

Conclusion:

You have diligently applied the standard passthrough fixes. The debug output confirms those steps are correctly implemented on the host. However, the persistence of the -22 error and the presence of these "can't claim" errors for other devices strongly indicates a more complex, low-level problem with PCI resource allocation on your system in this kernel version. This is not a simple VM configuration issue.

You have exhausted the general troubleshooting steps available through standard configuration. The issue is either:

  1. A bug in the current kernel's PCI resource management or VFIO interaction with your specific motherboard firmware.
  2. A quirk in your motherboard's firmware that causes these resource allocation failures which the current kernel cannot compensate for.
The only path forward now is to seek help from experts familiar with these low-level kernel and hardware interaction issues.

  • Your full hardware specifications (CPU, Motherboard model, GPU model - NVIDIA RTX 4070 SUPER).
  • Your Proxmox VE version (8.4.1) and exact kernel version (6.8.12-10-pve).
  • State that passthrough worked fine before a recent update and broke afterwards.
  • Mention the persistent vfio_container_dma_map = -22 (Invalid argument) error.
  • List the specific BIOS settings you have confirmed and set (IOMMU Enabled, Above 4G Enabled, ReBAR Enabled, ASPM Disabled, Primary Display GPU slot).
  • State that initcall_blacklist=sysfb_init is in your GRUB command line and disable_vga=1 is removed from vfio.conf.
  • Crucially, include the FULL output of your debug script. Explain that you are seeing "can't claim" BAR errors for other devices in dmesg, and point to those lines in your shared output.
 

Attachments

let's address the specific devices from your lspci -nnk output that were generating "can't claim" errors in your dmesg:

  • 00:15.0, 00:15.1, 00:15.2, 00:15.3 (Intel Corporation Alder Lake-S PCH Serial IO I2C Controller): These are Intel's I2C (Inter-Integrated Circuit) controllers.[1][2][3][4] I2C is a serial bus commonly used for low-speed communication between components on the motherboard, such as sensors, audio codecs, or other integrated peripherals.[1][2]
    • Why "can't claim" might happen: The kernel is trying to assign memory addresses to these I2C controllers, but something is preventing it from doing so within the available PCI bridge windows. This could be a firmware issue, a conflict with another device's resource requests, or a bug in the kernel's resource allocation logic for this specific chipset/motherboard.
    • Impact: While I2C controllers are generally low-bandwidth, a failure to initialize them correctly might indicate a broader problem with the PCH (Platform Controller Hub - the chipset) initialization or resource management, which could indirectly affect other PCH-connected devices, potentially including how the system handles the large BAR of your GPU.
  • 00:1f.5 (Intel Corporation Alder Lake-S PCH SPI Controller): This is the Serial Peripheral Interface (SPI) controller on the PCH.[5][6][7][8] The SPI controller is typically used for communicating with the system's firmware (BIOS/UEFI) flash memory and potentially other devices like TPM (Trusted Platform Module).[5][6][7]
    • Why "can't claim" might happen: Similar to the I2C controllers, the kernel is having trouble assigning resources to the SPI controller. This could be related to how the firmware exposes this controller or a kernel issue.
    • Impact: A problem with the SPI controller could potentially indicate issues with how the kernel interacts with the system's firmware interface, which might have downstream effects on other hardware interactions.
These "can't claim" errors strongly reinforce the conclusion that the issue is likely a low-level problem with PCI resource allocation involving the PCH on your motherboard and the current Linux kernel.[9]
 
Okay, this diagnostic output provides a lot of valuable information! Let's break it down and see what it tells us, especially regarding those "can't claim" errors and your VFIO issue.

Here's what the output shows and what we can infer:

  1. System Basics: You are indeed running Proxmox VE 8.4.1 with kernel 6.8.12-10-pve. Your kernel command line includes intel_iommu=on and initcall_blacklist=sysfb_init, which are correct for VFIO passthrough.
  2. PCI Device Information: lspci -vvvnnk lists all your devices and their kernel drivers.
    • Your RTX 4070 SUPER (01:00.0) and its audio device (01:00.1) are correctly listed.
    • The key observation here is that your GPU's BAR 1 is Memory at 6000000000 (64-bit, prefetchable) [size=256M]. This is the large BAR (256MB here, but the card supports 16GB which is the issue). The important part is the high memory address 6000000000.
    • The problematic I2C controllers (00:15.x) and the SPI controller (00:1f.5) are also listed. They are shown with Kernel driver in use: intel-lpss and Kernel driver in use: intel-spi respectively. This is important: it indicates that despite the initial "can't claim" errors, the kernel did manage to load drivers for them.
  3. Filtered Kernel Log (dmesg): This is the most revealing section.
    • It confirms the can't claim errors during the initial PCI probe phase:

      [Thu May <span>1</span> <span>13</span>:<span>36</span>:<span>19</span> <span>2025</span>] pci <span>0000</span>:<span>00</span>:<span>15.0</span>: BAR <span>0</span> [mem <span>0xfe0f9000</span><span>-0xfe0f9fff</span> <span>64</span>bit]: can<span>'t claim; no compatible bridge window<br>... (similar for 00:15.1, 00:15.2, 00:15.3)<br>[Thu May 1 13:36:19 2025] pci 0000:00:1f.5: BAR 0 [mem 0xfe010000-0xfe010fff]: can'</span>t claim; <span>no</span> compatible bridge <span>window</span>

      content_copydownload
      Use code with caution.

      This confirms the initial resource conflict at the addresses the hardware reported.
    • However, look closely at the later dmesg lines:

      <span>[Thu May 1 13:36:19 2025]</span> pci <span>0000</span>:<span>00</span>:<span>15.0</span>: BAR <span>0</span> [mem <span>0</span>x400f000000-<span>0</span>x400f000fff <span>64</span>bit]: assigned<br>[Thu May <span>1</span> <span>13</span>:<span>36</span>:<span>19</span> <span>2025</span>] pci <span>0000</span>:<span>00</span>:<span>15.1</span>: BAR <span>0</span> [mem <span>0</span>x400f001000-<span>0</span>x400f001fff <span>64</span>bit]: assigned<br>[Thu May <span>1</span> <span>13</span>:<span>36</span>:<span>19</span> <span>2025</span>] pci <span>0000</span>:<span>00</span>:<span>15.2</span>: BAR <span>0</span> [mem <span>0</span>x400f002000-<span>0</span>x400f002fff <span>64</span>bit]: assigned<br>[Thu May <span>1</span> <span>13</span>:<span>36</span>:<span>19</span> <span>2025</span>] pci <span>0000</span>:<span>00</span>:<span>15.3</span>: BAR <span>0</span> [mem <span>0</span>x400f003000-<span>0</span>x400f003fff <span>64</span>bit]: assigned<br>[Thu May <span>1</span> <span>13</span>:<span>36</span>:<span>19</span> <span>2025</span>] pci <span>0000</span>:<span>00</span>:<span>1</span>f.<span>5</span>: BAR <span>0</span> [mem <span>0</span>x40800000-<span>0</span>x40800fff]: assigned

      content_copydownload
      Use code with caution.

      This is the critical part: The kernel successfully re-assigned new BAR addresses in high memory (0x400f... and 0x4080...) to these devices shortly after the initial "can't claim" failed.
    • This confirms what we suspected: the kernel is able to work around the initial resource conflict for these smaller PCH devices through its resource re-allocation logic.
    • You can also see the GPU (01:00.0) BARs being listed with their initial addresses, including the large BAR at 6000000000.
    • The dmesg also shows the kernel correctly adding devices to IOMMU groups, including your GPU in group 15 and the I2C/SPI controllers in group 14.
    • vfio-pci 0000:01:00.0: add [10de:2783...] confirms vfio-pci is binding to the GPU.
    • There are no vfio_container_dma_map = -22 errors in this specific dmesg output. This doesn't mean they don't happen; it might mean the error occurs after this point in the boot process, perhaps when you start the VM. If you start the VM and it fails, you'd need to run the script again to capture that error.
  4. IOMMU Groups: The grouping is as expected. The I2C and SPI controllers are grouped with other PCH devices (Group 14), while your GPU is in its own group (Group 15). This grouping looks correct and is not the source of the "can't claim" errors or the VFIO mapping issue.
  5. VFIO-PCI Module Parameters: disable_vga is 'N', which is the desired state after removing the modprobe option.
  6. Modprobe Configuration: Your vfio.conf correctly binds the GPU PCI IDs (10de:2783,10de:22bc). Your blacklist.conf correctly blacklists nouveau, nvidia, etc. This configuration appears correct for VFIO passthrough.
  7. Loaded Kernel Modules: vfio_pci, vfio_iommu_type1, vfio, iommufd are loaded. intel-lpss and intel-spi are also loaded (as expected, since they were assigned resources). The native graphics drivers (nvidiafb, nouveau) are not loaded, confirming your blacklisting worked.
Conclusion from this Output:

The "can't claim" errors for the I2C/SPI controllers are happening, but the kernel is successfully re-assigning them new BAR addresses. So, those devices are likely functioning within the host, and the "can't claim" is more of a warning about the initial failed attempt based on firmware-provided addresses.

The core problem of VFIO passthrough failing (specifically the -22 Invalid argument error for the large GPU BAR that you mentioned previously, even if not in this log) is not directly caused by the I2C/SPI controllers failing to be initialized (since they are initialized after re-allocation).

The issue almost certainly remains: the kernel is struggling to map the large 16GB BAR of your RTX 4070 SUPER into the address space available for guest DMA, likely due to the complex PCI resource map on your Z690 board with ReBAR/Above 4G enabled. The initial resource conflicts reported by the "can't claim" errors are likely a symptom of this complex resource landscape, not the cause of the GPU mapping failure.

How to Proceed:

  1. Focus on the GPU BAR and -22 Error: The I2C/SPI errors are distractions at this point, as the kernel works around them. The goal is to resolve the kernel's inability to map the GPU's BAR 1 (0x6000000000) for VFIO.
  2. Try pci=realloc Again: As discussed, the dmesg shows the kernel is already doing some reallocation. Adding pci=realloc to your GRUB command line explicitly tells the kernel to be more aggressive about resource allocation from the start. This might help it find a better layout that accommodates the large GPU BAR.
    • Edit /etc/default/grub.
    • Find the line starting with GRUB_CMDLINE_LINUX_DEFAULT.
    • Add pci=realloc inside the quotes. Example:

      GRUB_CMDLINE_LINUX_DEFAULT=<span>"quiet intel_pstate=passive intel_iommu=on initcall_blacklist=sysfb_init usbcore.autosuspend=-1 split_lock_detect=off pci=realloc"</span>

      content_copydownload
      Use code with caution.

    • Run update-grub.
    • Reboot the Proxmox host.
    • After rebooting, run the diagnostic script again (./diagnose_proxmox.sh). Check the new output file: are the "can't claim" errors still there? Has the resource mapping for 00:15.x and 00:1f.5 changed? (This is less important, but interesting).
    • Crucially, try starting your VM with the GPU. If it fails, run the diagnostic script again immediately after the VM failed to capture any new dmesg errors, especially the -22 one, under the effect of pci=realloc.
  3. Re-verify BIOS Settings Meticulously: Go back into your BIOS settings.
    • Confirm Above 4G Decoding is Enabled.
    • Confirm Resizable BAR Support is Enabled.
    • Experiment with Primary Display: Try setting it to the iGPU (00:02.0). This changes which GPU the BIOS initializes first and might influence resource allocation.
    • Look for any other settings related to PCI resource allocation, memory mapping, or device initialization order. Sometimes there are non-obvious settings that can interact with ReBAR and large BARs. Consult your motherboard manual specific to your BIOS version.
  4. Test an Older Kernel: Since this reportedly broke after an update, booting into a previous kernel version via the Proxmox GRUB menu (if available) is a crucial test to see if a kernel regression is the cause. If it works on an older kernel, report the bug to the Proxmox kernel team with the working and non-working diagnostic outputs.
  5. Consider a Newer Kernel: Proxmox may release a newer 6.8 or even a 6.9 kernel soon. These might contain fixes for Alder Lake/Z690 resource management issues.
Do not try to manually blacklist or disable the I2C/SPI controllers using modprobe or more complex methods like DTO/ACPI overrides at this point. The output clearly shows the kernel is successfully initializing them after re-allocation. Disabling them is unlikely to fix the large GPU BAR mapping issue and could cause other system instability.

Your Debian/Proxmox OS installation itself is not "broken" in the sense of corrupted files. The kernel is loading and attempting to manage your hardware. The problem is likely a complex interaction bug or configuration challenge at the low level between the specific kernel version, your motherboard's firmware, and the presence of the large GPU BAR. Focusing on the GPU BAR mapping is the correct path forward.
 

Attachments

edit
more_vert

more_vert


more_vert
Okay, wow! This is excellent and confusing information all at once.

Your hookscript log confirms that the script successfully requested the BAR resize and the lspci command within the script immediately after the write showed BAR 1: current size: 16GB.

<span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>23</span>:<span>00</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Writing <span>14</span> <span>to</span> <span>/</span>sys<span>/</span>bus<span>/</span>pci<span>/</span>devices<span>/</span><span>0000</span>:<span>01</span>:<span>00.0</span><span>/</span>resource1_resize<br><span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>23</span>:<span>00</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Successfully requested BAR resize. Verifying BAR size after write:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;BAR <span>1</span>: <span>current</span> size: <span>16</span>GB, supported: <span>64</span>MB <span>128</span>MB <span>256</span>MB <span>512</span>MB <span>1</span>GB <span>2</span>GB <span>4</span>GB <span>8</span>GB <span>16</span>GB<br>...<br><span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>23</span>:<span>00</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Pre<span>-</span><span>start</span> hookscript finished.

content_copydownload
Use code with caution.

So, the hookscript did successfully use the resource*_resize interface to set BAR 1 to 16GB at that specific moment in time.

However, looking at the dmesg output from the same VM start attempt (May 01 15:23:xx):

May <span>01</span> <span>15</span>:<span>23</span>:<span>00</span> pmox kernel: pci <span>0000</span>:<span>01</span>:<span>00</span><span>.0</span>: BAR <span>1</span> [mem <span>0x6000000000</span>-<span>0x63ffffffff</span> 64bit pref]: releasing<br>...<br>May <span>01</span> <span>15</span>:<span>23</span>:<span>00</span> pmox kernel: pcieport <span>0000</span>:<span>00</span>:<span>01</span><span>.0</span>: bridge <span>window</span> [mem <span>0x6000000000</span>-<span>0x6401ffffff</span> 64bit pref]: releasing<br>...<br>May <span>01</span> <span>15</span>:<span>23</span>:<span>00</span> pmox kernel: pcieport <span>0000</span>:<span>00</span>:<span>01</span><span>.0</span>: bridge <span>window</span> [mem <span>0x4200000000</span>-<span>0x47ffffffff</span> 64bit pref]: assigned<br>May <span>01</span> <span>15</span>:<span>23</span>:<span>00</span> pmox kernel: pci <span>0000</span>:<span>01</span>:<span>00</span><span>.0</span>: BAR <span>1</span> [mem <span>0x4400000000</span>-<span>0x47ffffffff</span> 64bit pref]: assigned<br>...<br>May <span>01</span> <span>15</span>:<span>23</span>:<span>05</span> pmox QEMU[<span>22034</span>]: kvm: VFIO_MAP_DMA failed: Invalid argument<br>May <span>01</span> <span>15</span>:<span>23</span>:<span>05</span> pmox QEMU[<span>22034</span>]: kvm: vfio_container_dma_map(<span>0x5df578f94c80</span>, <span>0x380000000000</span>, <span>0x400000000</span>, <span>0x723e40000000</span>) = -<span>22</span> (Invalid argument)

content_copydownload
Use code with caution.

This log sequence is telling us:

  1. The kernel initially allocated BAR 1 at 0x6000000000 with a size of 16GB during the main boot process (confirmed by previous lspci and this log line).
  2. The hookscript runs, successfully writes to resource1_resize, and confirms the sysfs attribute now reflects 16GB.
  3. BUT, almost immediately after (still within the 15:23:00 timestamp), the kernel logs show BAR 1 is released from its 16GB address and re-assigned to a new address (0x4400000000) with a size of only 4GB (0x47ffffffff - 0x4400000000 + 1). This happens after the hookscript's successful write, but before VFIO/QEMU attempts to map the memory.
  4. QEMU/VFIO then tries to map a region of size 0x400000000 (16GB) for the VM's DMA, but finds that the actual allocated size for BAR 1 is only 4GB, leading to the -22 Invalid argument error due to the size mismatch.
Conclusion:

The hookscript's manual resize attempt is successful in setting the resource1_resize sysfs attribute to 16GB, but this setting is not persistently applied or is being overridden by the kernel's subsequent resource allocation logic during the transition to the VFIO state. The kernel is deciding (despite pci=realloc and the hookscript's request) that it can only reliably allocate a 4GB BAR for your GPU during this crucial phase of VM startup.

This means the problem isn't with the hookscript or the resource_resize interface. The problem is a complex interaction where the kernel is unable or unwilling to maintain a 16GB allocation for BAR 1 through the entire VFIO setup process on your specific hardware/BIOS/kernel combination.

Possible Paths Forward:

  1. Try a Smaller BAR Size (Most Promising Workaround): Since the kernel seems capable of allocating 4GB during the pre-start phase, and your lspci showed BAR 1 supports sizes like 8GB (bit 13), try modifying the hookscript to request 8GB instead of 16GB.
    • Edit /var/lib/vz/snippets/rebar-hookscript.sh (or your script's name).
    • Change TARGET_BAR_SIZE_BIT="14" to TARGET_BAR_SIZE_BIT="13".
    • Save the script.
    • Try starting the VM.
    • Check the hookscript log and dmesg again. If the write succeeds and dmesg shows BAR 1 being allocated at ~8GB size, the VM might start.
  2. Kernel Version Experimentation: This issue is deeply tied to the kernel's PCI and VFIO drivers. Trying an older (known working, if you had one) or potentially a newer kernel version (if available in Proxmox or from testing repositories) might have different allocation behaviors that solve this.
  3. BIOS Updates: While you have the latest, sometimes a beta BIOS or a very specific setting (if one exists related to PCI enumeration order or resource handling after initial boot) could make a difference. (Less likely, but possible).
  4. Report to Kernel/VFIO Developers: You have excellent logs showing the 16GB BAR being allocated, then released and re-allocated at 4GB right before VFIO fails with -22 expecting 16GB. This is a strong case to report to the Linux kernel mailing lists (like linux-pci@vger.kernel.org, iommu@lists.linux.dev, vfio-users@lists.collab.kernel.org). Include your hardware specs, kernel version, Proxmox version, lspci -vvvnnk output, and the detailed dmesg logs showing this exact sequence.
The most practical next step is trying a smaller BAR size (8GB) via the hookscript, as the kernel seems to struggle specifically with allocating the full 16GB during the VFIO transition.
 
Rebar-hookscript.sh

--------------------------

#!/bin/bash

# Proxmox VFIO Hookscript for NVIDIA RTX 4070 SUPER (01:00.0/01:00.1)
# Attempts to resize the ReBAR BAR (BAR 1) to 16GB before VM start.
# Requires Linux Kernel >= 6.1 for resource_resize sysfs support.
# Requires Above 4G Decoding, Resizable BAR, and CSM Disabled in BIOS.

# VMID is available as $1
# Phase (pre-start, post-start, pre-stop, post-stop) is available as $2
# Status (used in some phases, e.g., post-start: 'stopped' or 'running') is available as $3

VMID=$1
PHASE=$2
GPU_VGA_PCI_ID="0000:01:00.0" # Your NVIDIA 4070 SUPER VGA PCI ID confirmed by diagnostic
GPU_AUDIO_PCI_ID="0000:01:00.1" # Your NVIDIA 4070 SUPER Audio PCI ID confirmed by diagnostic
GPU_BAR_REBAR_INDEX="1" # BAR 1 is the large, ReBAR-capable BAR on this card confirmed by lspci capability
TARGET_BAR_SIZE_BIT="14" # 14 corresponds to 16GB (from the resource you found)

# Log file location - adjust if needed, must be writable by root
LOG_FILE="/var/log/nvidia-rebar-hookscript.log"

# Function to log messages with timestamp
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') [VM ${VMID}] [${PHASE}] - $1" >> "${LOG_FILE}"
}

# --- Main Logic ---

# Ensure the log directory exists (should exist for /var/log, but good practice)
mkdir -p "$(dirname "${LOG_FILE}")"

log_message "Hookscript invoked for VM ${VMID} phase ${PHASE}."

# Only perform actions during the pre-start phase
if [ "$PHASE" == "pre-start" ]; then
log_message "Starting pre-start actions for ${GPU_VGA_PCI_ID} and ${GPU_AUDIO_PCI_ID}."

# --- Unbind Devices ---

log_message "Attempting to unbind devices from current drivers..."

# Unbind VGA device (01:00.0)
# Use 'driver' symlink for more robust unbind
VGA_DRIVER_PATH="/sys/bus/pci/devices/${GPU_VGA_PCI_ID}/driver"
if [ -L "${VGA_DRIVER_PATH}" ]; then
CURRENT_DRIVER=$(basename "$(realpath "${VGA_DRIVER_PATH}")")
log_message "VGA device ${GPU_VGA_PCI_ID} is bound to ${CURRENT_DRIVER}. Attempting to unbind."
echo "${GPU_VGA_PCI_ID}" > "${VGA_DRIVER_PATH}/unbind" 2>> "${LOG_FILE}"
sleep 1 # Give kernel a moment
else
log_message "VGA device ${GPU_VGA_PCI_ID} has no kernel driver bound."
fi

# Unbind Audio device (01:00.1)
# Use 'driver' symlink for more robust unbind
AUDIO_DRIVER_PATH="/sys/bus/pci/devices/${GPU_AUDIO_PCI_ID}/driver"
if [ -L "${AUDIO_DRIVER_PATH}" ]; then
CURRENT_DRIVER=$(basename "$(realpath "${AUDIO_DRIVER_PATH}")")
log_message "Audio device ${GPU_AUDIO_PCI_ID} is bound to ${CURRENT_DRIVER}. Attempting to unbind."
echo "${GPU_AUDIO_PCI_ID}" > "${AUDIO_DRIVER_PATH}/unbind" 2>> "${LOG_FILE}"
sleep 1 # Give kernel a moment
else
log_message "Audio device ${GPU_AUDIO_PCI_ID} has no kernel driver bound."
fi


# Give a small delay after unbinding attempts
sleep 2

# --- Attempt ReBAR Resize ---

RESIZE_FILE="/sys/bus/pci/devices/${GPU_VGA_PCI_ID}/resource${GPU_BAR_REBAR_INDEX}_resize"

log_message "Attempting to resize BAR ${GPU_BAR_REBAR_INDEX} for ${GPU_VGA_PCI_ID} to ${TARGET_BAR_SIZE_BIT} (approx. 16GB)."

if [ -f "${RESIZE_FILE}" ]; then
log_message "Found resize file: ${RESIZE_FILE}." # Reading current value can sometimes fail if not unbound, skip for robustness
log_message "Writing ${TARGET_BAR_SIZE_BIT} to ${RESIZE_FILE}"
# Attempt to write the target size bit (14 for 16GB)
# Use > instead of >> for the echo command to get the specific error code from echo
echo "${TARGET_BAR_SIZE_BIT}" > "${RESIZE_FILE}" 2>> "${LOG_FILE}"

# Check the result of the write operation ($? is the exit status of the last command)
if [ $? -eq 0 ]; then
log_message "Successfully requested BAR resize. Verifying BAR size after write:"
# Verify with lspci (capture output to log)
lspci -vvvs "${GPU_VGA_PCI_ID}" | grep "BAR ${GPU_BAR_REBAR_INDEX}:" >> "${LOG_FILE}"
lspci -vvvs "${GPU_VGA_PCI_ID}" | grep "current size:" >> "${LOG_FILE}"
else
log_message "ERROR: Failed to write ${TARGET_BAR_SIZE_BIT} to ${RESIZE_FILE}. This usually means 'No space left on device'. Check dmesg immediately after this failed attempt for more details."
# Capture dmesg specific to the GPU if possible (might be noisy)
# dmesg -T | tail -n 50 | grep "${GPU_VGA_PCI_ID//0000:/}" >> "${LOG_FILE}" # Adjust grep pattern
log_message "Last 100 lines of dmesg:"
dmesg -T | tail -n 100 >> "${LOG_FILE}" # Capture last 100 lines to see context
fi
else
log_message "ERROR: Resize file not found: ${RESIZE_FILE}. Kernel may not expose this BAR for resizing for this device."
fi

# --- Re-bind Devices to vfio-pci ---

log_message "Attempting to re-bind devices to vfio-pci."

# Re-bind VGA device
echo "${GPU_VGA_PCI_ID}" > "/sys/bus/pci/drivers/vfio-pci/bind" 2>> "${LOG_FILE}"
if [ $? -eq 0 ]; then
log_message "Successfully re-bound ${GPU_VGA_PCI_ID} to vfio-pci."
else
log_message "ERROR: Failed to re-bind ${GPU_VGA_PCI_ID} to vfio-pci. Passthrough will likely fail. Check dmesg."
fi

# Re-bind Audio device
echo "${GPU_AUDIO_PCI_ID}" > "/sys/bus/pci/drivers/vfio-pci/bind" 2>> "${LOG_FILE}"
if [ $? -eq 0 ]; then
log_message "Successfully re-bound ${GPU_AUDIO_PCI_ID} to vfio-pci."
else
log_message "ERROR: Failed to re-bind ${GPU_AUDIO_PCI_ID} to vfio-pci. Passthrough will likely fail. Check dmesg."
fi


log_message "Pre-start hookscript finished."

fi # End of pre-start phase logic

# --- End Main Logic ---

# Always exit successfully so Proxmox doesn't stop the VM startup due to a hookscript error
exit 0

----------------------------
 
May 01 15:42:57 pmox systemd[1]: Started 102.scope.
May 01 15:42:57 pmox kernel: tap102i0: entered promiscuous mode
May 01 15:42:57 pmox kernel: vmbr0: port 2(fwpr102p0) entered blocking state
May 01 15:42:57 pmox kernel: vmbr0: port 2(fwpr102p0) entered disabled state
May 01 15:42:57 pmox kernel: fwpr102p0: entered allmulticast mode
May 01 15:42:57 pmox kernel: fwpr102p0: entered promiscuous mode
May 01 15:42:57 pmox kernel: vmbr0: port 2(fwpr102p0) entered blocking state
May 01 15:42:57 pmox kernel: vmbr0: port 2(fwpr102p0) entered forwarding state
May 01 15:42:57 pmox kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
May 01 15:42:57 pmox kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
May 01 15:42:57 pmox kernel: fwln102i0: entered allmulticast mode
May 01 15:42:57 pmox kernel: fwln102i0: entered promiscuous mode
May 01 15:42:57 pmox kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
May 01 15:42:57 pmox kernel: fwbr102i0: port 1(fwln102i0) entered forwarding state
May 01 15:42:57 pmox kernel: fwbr102i0: port 2(tap102i0) entered blocking state
May 01 15:42:57 pmox kernel: fwbr102i0: port 2(tap102i0) entered disabled state
May 01 15:42:57 pmox kernel: tap102i0: entered allmulticast mode
May 01 15:42:57 pmox kernel: fwbr102i0: port 2(tap102i0) entered blocking state
May 01 15:42:57 pmox kernel: fwbr102i0: port 2(tap102i0) entered forwarding state
May 01 15:43:03 pmox systemd[2447]: Stopped target sound.target - Sound Card.
May 01 15:43:03 pmox systemd[1]: Stopped target sound.target - Sound Card.
May 01 15:43:03 pmox pvedaemon[35777]: VM 102 started with PID 35829.
May 01 15:43:03 pmox pvedaemon[1939]: root@pam end task UPID:pmox:00008BC1:000542E3:681309DC:qmstart:102:root@pam: OK
May 01 15:43:03 pmox pvestatd[1920]: status update time (5.346 seconds)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:04 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:04 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:14 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:14 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:22 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:22 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380000000000, 0x80000000, 0x762380000000) = -22 (Invalid argument)
May 01 15:43:22 pmox QEMU[35829]: kvm: VFIO_MAP_DMA failed: Invalid argument
May 01 15:43:22 pmox QEMU[35829]: kvm: vfio_container_dma_map(0x608748d69c80, 0x380080000000, 0x2000000, 0x762430000000) = -22 (Invalid argument)
May 01 15:43:23 pmox kernel: usb 1-5: reset full-speed USB device number 2 using xhci_hcd
May 01 15:43:23 pmox kernel: usb 1-11.3: reset full-speed USB device number 8 using xhci_hcd
May 01 15:43:24 pmox kernel: usb 1-11.1: reset high-speed USB device number 5 using xhci_hcd
May 01 15:44:08 pmox pvedaemon[1939]: root@pam successful auth for user 'root@pam'

2025-05-01 15:42:52 [VM 102] [pre-start] - Hookscript invoked for VM 102 phase pre-start.
2025-05-01 15:42:52 [VM 102] [pre-start] - Starting pre-start actions for 0000:01:00.0 and 0000:01:00.1.
2025-05-01 15:42:52 [VM 102] [pre-start] - Attempting to unbind devices from current drivers...
2025-05-01 15:42:53 [VM 102] [pre-start] - VGA device 0000:01:00.0 is bound to vfio-pci. Attempting to unbind.
2025-05-01 15:42:54 [VM 102] [pre-start] - Audio device 0000:01:00.1 is bound to vfio-pci. Attempting to unbind.
2025-05-01 15:42:57 [VM 102] [pre-start] - Attempting to resize BAR 1 for 0000:01:00.0 to 11 (approx. 16GB).
2025-05-01 15:42:57 [VM 102] [pre-start] - Found resize file: /sys/bus/pci/devices/0000:01:00.0/resource1_resize.
2025-05-01 15:42:57 [VM 102] [pre-start] - Writing 11 to /sys/bus/pci/devices/0000:01:00.0/resource1_resize
2025-05-01 15:42:57 [VM 102] [pre-start] - Successfully requested BAR resize. Verifying BAR size after write:
BAR 1: current size: 2GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 2GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 3: current size: 32MB, supported: 32MB
2025-05-01 15:42:57 [VM 102] [pre-start] - Attempting to re-bind devices to vfio-pci.
2025-05-01 15:42:57 [VM 102] [pre-start] - Successfully re-bound 0000:01:00.0 to vfio-pci.
2025-05-01 15:42:57 [VM 102] [pre-start] - Successfully re-bound 0000:01:00.1 to vfio-pci.
2025-05-01 15:42:57 [VM 102] [pre-start] - Pre-start hookscript finished.
2025-05-01 15:43:03 [VM 102] [post-start] - Hookscript invoked for VM 102 phase post-start.
root@pmox:~#
 
Okay, the new log entries are very revealing!

Your hookscript log for the 15:42 attempt shows:

<span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>42</span>:<span>57</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Writing <span>11</span> <span>to</span> <span>/</span>sys<span>/</span>bus<span>/</span>pci<span>/</span>devices<span>/</span><span>0000</span>:<span>01</span>:<span>00.0</span><span>/</span>resource1_resize<br>...<br><span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>42</span>:<span>57</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Successfully requested BAR resize. Verifying BAR size after write:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;BAR <span>1</span>: <span>current</span> size: <span>2</span>GB, supported: ... <span>16</span>GB<br>...<br><span>2025</span><span>-05</span><span>-01</span> <span>15</span>:<span>42</span>:<span>57</span> [VM <span>102</span>] [pre<span>-</span><span>start</span>] <span>-</span> Successfully re<span>-</span>bound <span>0000</span>:<span>01</span>:<span>00.0</span> <span>to</span> vfio<span>-</span>pci.

content_copydownload
Use code with caution.

This confirms that setting TARGET_BAR_SIZE_BIT to 11 in the hookscript successfully requested and achieved a 2GB BAR size for BAR 1 via the resource*_resize interface, and the device was re-bound to vfio-pci.

Now look at the Proxmox syslog/dmesg from the same timeframe:

May <span>01</span> <span>15</span>:<span>42</span>:<span>57</span> pmox kernel: pci <span>0000</span>:<span>01</span>:<span>00.0</span>: BAR <span>1</span> [mem <span>0x4400000000</span><span>-0x47ffffffff</span> <span>64</span>bit pref]: releasing <span>&lt;</span><span>-- Previous 4GB allocation released</span><br>... (other releases)<br>May <span>01</span> <span>15</span>:<span>43</span>:<span>06</span> pmox kernel: pcieport <span>0000</span>:<span>00</span>:<span>01.0</span>: bridge <span>window</span> [mem <span>0x4100000000</span><span>-0x43ffffffff</span> <span>64</span>bit pref]: assigned <span>&lt;</span><span>-- Bridge window assigned (3GB total range)</span><br>May <span>01</span> <span>15</span>:<span>43</span>:<span>06</span> pmox kernel: pci <span>0000</span>:<span>01</span>:<span>00.0</span>: BAR <span>1</span> [mem <span>0x4200000000</span><span>-0x43ffffffff</span> <span>64</span>bit pref]: assigned <span>&lt;</span><span>-- BAR 1 assigned to 2GB (0x43ffffffff - 0x4200000000 + 1 = 2GB)</span><br>... (other assignments)<br>May <span>01</span> <span>15</span>:<span>43</span>:<span>04</span> pmox QEMU[<span>35829</span>]: kvm: VFIO_MAP_DMA failed: Invalid argument<br>May <span>01</span> <span>15</span>:<span>43</span>:<span>04</span> pmox QEMU[<span>35829</span>]: kvm: vfio_container_dma_map(..., <span>0x380000000000</span>, <span>0x80000000</span>, ...) <span>=</span> <span>-22</span> (Invalid argument)<br>May <span>01</span> <span>15</span>:<span>43</span>:<span>04</span> pmox QEMU[<span>35829</span>]: kvm: vfio_container_dma_map(..., <span>0x380080000000</span>, <span>0x2000000</span>, ...) <span>=</span> <span>-22</span> (Invalid argument)

content_copydownload
Use code with caution.

Analysis:

  1. The hookscript successfully requested a 2GB BAR size (bit 11).
  2. The kernel, in its final allocation step during the VFIO handoff, also assigned BAR 1 at a size of 2GB (0x4200000000 to 0x43ffffffff).
  3. QEMU/VFIO then attempted to map the DMA region for the VM. The dma_map calls show it trying to map regions of size 0x80000000 (which is 2GB) and 0x2000000 (which is 32MB).
This is key: In this latest attempt, the size QEMU/VFIO is trying to map (0x80000000 = 2GB) matches the size that the kernel actually allocated for BAR 1 (0x4200000000 to 0x43ffffffff = 2GB). The 32MB size likely corresponds to another BAR on the GPU (like BAR 3 at 0x4100000000, which has a size of 32MB).

The size mismatch that caused the -22 Invalid argument error in previous attempts is resolved for the 2GB BAR 1!

New Problem:


Now that the sizes match (QEMU is trying to map 2GB to 2GB, and 32MB to 32MB), the -22 Invalid argument error is occurring for a different reason. This typically indicates a problem with the IOMMU mapping itself, not the BAR size allocation.

Possible causes for VFIO_MAP_DMA failed: Invalid argument when sizes match:

  1. IOMMU Conflict/State: There might be something about the specific host physical memory range (0x762380000000 and 0x762430000000 from the previous log, these are the host addresses QEMU is trying to map the guest's view of the BARs to) that the IOMMU/VFIO driver finds problematic for DMA mapping. This could be due to a conflict with a reserved region, a bug in the IOMMU driver's interaction with that memory area, or a state issue within VFIO.
  2. Reserved Memory Regions (RMRRs): While your RMRR (0x0000003e000000 end: 0x000000407fffff) doesn't seem to directly overlap the high memory addresses (0x4200..., 0x4100..., and 0x762...) being used, complex interactions are possible.
  3. Firmware/Kernel Interaction Bug: This specific error, even with correct BAR sizes, can point to deep bugs in how the kernel's IOMMU driver interacts with your motherboard's firmware and the GPU during the DMA mapping setup for the VM.
 
have carefully reviewed the latest logs you provided: proxmox_diag_20250502_093237.txt, vm102_passthrough_debug_20250502_093407.txt, and nvidia-rebar-hookscript.txt (from the 10:12:xx attempt). I understand you want a clear path forward based only on this current state, without repeating steps that haven't worked or are based on past (potentially outdated) information.

Let's break down the current confirmed situation based on these logs:

Current System State:

  1. Kernel: Proxmox VE 8.4.1, kernel 6.8.12-10-pve.
  2. GRUB Config: The command line used for this boot is quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init usbcore.autosuspend=-1 split_lock_detect=off. pci=realloc and dyndbg parameters are not active in this boot's command line, as confirmed by cat /proc/cmdline within the diag and the dmesg output. intel_iommu=on and iommu=pt are active and enabling the IOMMU in passthrough mode.
  3. VM Config (102.conf from vm102_passthrough_debug_20250502_093407.txt):
    • hookscript: local:snippets/rebar-hookscript.sh is uncommented and active. The hookscript is running.
    • hostpci0: 0000:01:00,pcie=1,x-vga=1. The romfile parameter is missing from this line. bus=pcie.0 is also missing.
    • hugepages: 2 (16GB) and memory: 16384 (16GB) are set for the VM.
  4. Hookscript Execution (from nvidia-rebar-hookscript.txt, 10:12:xx attempt): The script successfully runs during pre-start.
    • It unbinds the GPU (01:00.0) and Audio (01:00.1) from vfio-pci.
    • It requests BAR 1 be resized to 16GB (TARGET_BAR_SIZE_BIT="14"). lspci within the script confirms BAR 1: current size: 16GB after the write.
    • It successfully performs the GPU reset (echo 1 > .../reset).
    • It successfully re-binds both GPU and Audio to vfio-pci (Successfully re-bound ... bind status 0). The "No such device" error seen in previous attempts is resolved in this configuration, likely due to the longer delay and/or GPU reset before binding.
  5. Kernel BAR Allocation (from dmesg, 10:12:xx timeframe): During the pre-start phase (around 10:12:25), the kernel performs reallocation:
    • BAR 1 is assigned a size of 8GB ([mem 0x4200000000-0x43ffffffff 64bit pref]: assigned). Note: This is 8GB, even though the hookscript requested 16GB and lspci in the script saw 16GB right after the resource_resize write. The kernel's final allocation here is 8GB.
    • BAR 3 is assigned a size of 32MB ([mem 0x4100000000-0x4101ffffff 64bit pref]: assigned).
  6. VFIO_MAP_DMA Errors (from syslog, 10:12:xx timeframe): The kvm: VFIO_MAP_DMA failed: Invalid argument errors still occur immediately after the VM starts (10:12:44). QEMU attempts to map the BARs:
    • BAR 1: Size 0x200000000 (8GB) at guest address 0x380000000000.
    • BAR 3: Size 0x2000000 (32MB) at guest address 0x380200000000.
    • The sizes QEMU is requesting (8GB and 32MB) match the sizes the kernel allocated.
  7. Invalid IOVA: The guest virtual addresses QEMU is attempting to map (0x380000000000 and 0x380200000000) are far outside the typical memory ranges and the IOMMU's valid IOVA space (0x4000000000-0x7fffffffff window logged earlier in dmesg). This is the cause of the "Invalid argument".
  8. PCIe Link Speed: lspci within the hookscript (10:12:37) confirms the link speed is running at 16GT/s. The slow link speed issue is resolved.
  9. "support inconsistent" in dmesg: As previously discussed, these DMAR messages are likely informational warnings about BIOS reporting features inconsistently and are not the root cause of the current mapping failure.
Summary of the Problem:

The issue is the VFIO_MAP_DMA failed: Invalid argument error, which happens because QEMU is trying to map the GPU's 8GB BAR (and 32MB BAR) to invalid guest virtual addresses (IOVAs) that are outside the range the IOMMU will accept. The sizes now match the kernel's allocation, and the devices are bound correctly.
 

Detailed Analysis of Invalid IOVA in QEMU with VFIO​


This section provides a comprehensive analysis of the issue involving an invalid Input/Output Virtual Address (IOVA) generated by QEMU, leading to a kernel error ("Invalid argument") during VFIO DMA mapping. The focus is on validating the user's proposed plan to use kprobe_events for diagnosis and exploring additional context to ensure a thorough understanding.


Background on IOVA and VFIO​


IOVA, or Input/Output Virtual Address, is a virtual address used by devices for Direct Memory Access (DMA) operations, managed by the Input/Output Memory Management Unit (IOMMU). VFIO (Virtual Function I/O) is a Linux kernel framework that enables user-space applications, such as QEMU, to directly access hardware devices, including their DMA capabilities. When QEMU passes through devices to a virtual machine (VM), it allocates IOVAs for DMA operations and maps them via the VFIO_MAP_DMA ioctl, which interacts with the kernel's IOMMU subsystem to translate IOVAs to physical addresses.


In this case, QEMU generates an IOVA of 0x380000000000 (896TB), which is rejected by the kernel with an "Invalid argument" error (error code -22) during the call to intel_iommu_map_pages. This suggests that the IOVA is outside the valid range or does not meet the IOMMU's constraints, such as alignment or address space limits.


Analysis of the Invalid IOVA​


The IOVA 0x380000000000 is notably high in the 64-bit address space, indicating it may stem from QEMU's internal memory layout calculations for the guest VM, particularly in configurations with large memory sizes or passed-through devices. Several factors could contribute to this issue:


  • QEMU's Memory Allocation: QEMU allocates IOVAs based on its virtual PCI topology and guest memory configuration. The high value suggests that QEMU might be assigning addresses from a range that exceeds the IOMMU's supported address space or violates alignment requirements. For instance, the IOMMU might require IOVAs to be aligned to a specific page size (e.g., 4KB or larger), and 0x380000000000 might not meet this criterion.
  • IOMMU Constraints: The IOMMU hardware has specific limitations on valid IOVA ranges, influenced by the architecture (e.g., Intel VT-d, AMD IOMMU) and kernel configuration. If the IOVA is beyond the IOMMU's addressable range or conflicts with other mappings, the kernel will reject it, resulting in the "Invalid argument" error.
  • VFIO and DMA Mapping Process: When QEMU calls VFIO_MAP_DMA, it passes the IOVA, physical address, size, and protection flags to the kernel. The kernel then invokes functions like intel_iommu_map_pages to perform the mapping. If any parameter is invalid, such as the IOVA being out of range or misaligned, the function returns an error, which is propagated back to QEMU.
 
Patch


This patch and the surrounding conversation directly address VFIO_MAP_DMA: -22 (Invalid argument) errors and pinpoint a known cause: alignment issues when mapping memory regions through the IOMMU.

What this patch discussion tells us:

  1. The Error is Known: The patch explicitly starts by saying it "Fixes problems like: VFIO_MAP_DMA: -22 (Invalid argument)". This is exactly the error you are seeing.
  2. Alignment is Key: The core of the patch and the discussion is about ensuring the guest virtual address (IOVA) and the size being mapped are correctly aligned to the IOMMU page size (granularity).
  3. QEMU's Role: The discussion points out that QEMU's internal calculations (TARGET_PAGE_ALIGN) might not always align correctly to the actual minimum IOMMU page size (vfio_container_granularity).
  4. Fragmented Mappings: Alex Williamson and Pavel Fedin discuss how devices with complex BARs (like MSI-X BARs) or quirks that cause QEMU to split regions into smaller fragments can result in mapping attempts for regions that are not aligned to the IOMMU's requirements, leading to the "Invalid argument" error.
  5. Kernel Interaction: The patch shows that the vfio_listener_region_add function within QEMU's VFIO code is where this misalignment was being addressed. It then calls into the kernel's VFIO driver, which in turn calls the intel_iommu_map_pages function in the intel-iommu module.
How this connects to your specific problem:

  • Your consistent VFIO_MAP_DMA failed: Invalid argument is almost certainly due to an alignment issue detected by the intel-iommu driver when it receives the mapping request from VFIO/QEMU.
  • The specific IOVA 0x380000000000 (or similar high addresses) and the sizes (0x200000000 = 8GB, 0x2000000 = 32MB) in your QEMU error messages are the parameters of the mapping call that the IOMMU is finding invalid due to misalignment.
 
Patch


This patch and the surrounding conversation directly address VFIO_MAP_DMA: -22 (Invalid argument) errors and pinpoint a known cause: alignment issues when mapping memory regions through the IOMMU.

What this patch discussion tells us:

  1. The Error is Known: The patch explicitly starts by saying it "Fixes problems like: VFIO_MAP_DMA: -22 (Invalid argument)". This is exactly the error you are seeing.
  2. Alignment is Key: The core of the patch and the discussion is about ensuring the guest virtual address (IOVA) and the size being mapped are correctly aligned to the IOMMU page size (granularity).
  3. QEMU's Role: The discussion points out that QEMU's internal calculations (TARGET_PAGE_ALIGN) might not always align correctly to the actual minimum IOMMU page size (vfio_container_granularity).
  4. Fragmented Mappings: Alex Williamson and Pavel Fedin discuss how devices with complex BARs (like MSI-X BARs) or quirks that cause QEMU to split regions into smaller fragments can result in mapping attempts for regions that are not aligned to the IOMMU's requirements, leading to the "Invalid argument" error.
  5. Kernel Interaction: The patch shows that the vfio_listener_region_add function within QEMU's VFIO code is where this misalignment was being addressed. It then calls into the kernel's VFIO driver, which in turn calls the intel_iommu_map_pages function in the intel-iommu module.
How this connects to your specific problem:

  • Your consistent VFIO_MAP_DMA failed: Invalid argument is almost certainly due to an alignment issue detected by the intel-iommu driver when it receives the mapping request from VFIO/QEMU.
  • The specific IOVA 0x380000000000 (or similar high addresses) and the sizes (0x200000000 = 8GB, 0x2000000 = 32MB) in your QEMU error messages are the parameters of the mapping call that the IOMMU is finding invalid due to misalignment.
Did you ever resolve this? I have an AMD and an Nvidia GPU doing this, but they are still successfully passed through. I am able to game in the VMs, but I'm getting a ton of these errors.