Problems with GPU Passthrough since 8.2

merasil

Member
Mar 9, 2020
16
5
23
48
Hi,

I have had a problem with GPU passthrough since updating to 8.2. The VM to which I have passed the GPU no longer starts and has the following error message:

Code:
kvm: -device vfio-pci,host=0000:06:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,x-vga=on,multifunction=on: vfio 0000:06:00.0: failed to setup container for group 20: Failed to set group container: Invalid argument
TASK ERROR: start failed: QEMU exited with code 1

I have these logs on the host:

Code:
root@hvs:~# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[    0.512344] AMD-Vi: Using global IVHD EFR:0xf77ef22294ada, EFR2:0x0
[    1.044799] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    1.046587] AMD-Vi: Extended features (0xf77ef22294ada, 0x0): PPR NX GT IA GA PC GA_vAPIC
[    1.046600] AMD-Vi: Interrupt remapping enabled
[    1.046759] AMD-Vi: Virtual APIC enabled
[    1.046912] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[   59.575729] vfio-pci 0000:06:00.0: Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.

Code:
root@hvs:~# lsmod | grep vfio
vfio_pci               16384  0
vfio_pci_core          86016  1 vfio_pci
irqbypass              12288  2 vfio_pci_core,kvm
vfio_iommu_type1       49152  0
vfio                   69632  3 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd                98304  1 vfio

Code:
root@hvs:~# lspci -k
06:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T400 4GB] (rev a1)
        Subsystem: Lenovo TU117GL [T400 4GB]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
06:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)
        Subsystem: Lenovo Device 1613
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

I have not changed anything in the BIOS of the host. IOMMU also seems to be enabled. How can I debug the whole thing further?
 
Last edited:
As this does not seem to be widespread, I guess it might be due to the fact that we are running AMD CPUs instead of Intel.
Perhaps that newer kernel code is not compatible with AMD.
 
I have the same problem. I was going to passthrough my nvidia gtx1060 6gb to my windows11 vm and then i got the folowing error message:
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:0b:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:0b:00.0: failed to setup container for group 32: Failed to set group container: Invalid argument
stopping swtpm instance (pid 3367) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
CPU: Intel(R) Xeon(R) CPU E5530
GPU: Nvidia Geforce GTX1060 6gb
RAM: ECC rdimm 16gb
Boot Mode: Legacy BIOS (maybe thats the problem?)
Proxmox version: 8.2.2
any help would be appreciated and also im quite new to proxmox and linux in general so maybe i just misconfigured something?
 
I'm using a 5700X and after enabled IOMMU related stuff in BIOS(IOMMU, AES and ACR) and checked my GPU has a separate group, this command dmesg | grep -e DMAR -e IOMMU -e AMD-Vi just returned nothing...? any idea?
 
I too am having issues with GPU pass-through since 8.2 getting the following error on start of VM
Code:
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1
This started with the latest update and I hoped it would get fixed with the new update that I just applied kernel 6.8.4+3 and it is still having this issue.
Messed with VM config to attempt to fix still getting error.
Current VM config:

Code:
root@pve2:/etc/pve/nodes/pve3/qemu-server# cat 121.conf
agent: enabled=1,fstrim_cloned_disks=1,type=virtio
boot: c
bootdisk: scsi0
cipassword: 
cores: 8
cpu: host
hostpci0: 0000:81:00.0,pcie=1
ide2: c-vm:vm-121-cloudinit,media=cdrom,size=4M
ipconfig0: ip=dhcp
machine: q35
memory: 32768
meta: creation-qemu=8.1.5,ctime=1715103535
name: ollama2
net0: virtio=BC:24:11:79:2D:9B,bridge=vmbr0,tag=2
numa: 0
scsi0: c-vm:vm-121-disk-0,size=30G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=bc4d4e3a-dfa0-40a7-95ac-d5d308227f2e
sockets: 1
sshkeys:
vga: serial0
vmgenid: f578cc50-250c-41ce-98c5-146cc416d79d

Any help would be appreciated it has shut down my Ollama because of this issue.
 
Last edited:
i also have PCI pass problem with latest kernel (i think)

Win11 VM with Geforce pass with problem:


Task viewer: VM 300 - StartOutputStatusStopDownloadswtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 7163) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

can anybody help with this ?
 
I got the "Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed" with an RX570 and it was caused by amdgpu crashing and leaving the GPU in an unusable state. The work-around, for me, is to blacklist amdgpu (so I can still use the GPU as host console until the VM with passthrough starts).
 
  • Like
Reactions: semanticbeeng
I got the "Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed" with an RX570 and it was caused by amdgpu crashing and leaving the GPU in an unusable state. The work-around, for me, is to blacklist amdgpu (so I can still use the GPU as host console until the VM with passthrough starts).
Cool but mine fails right on reboot of the host too. And it is a NVidia Telsa card. Also, it is blacklisted already and has been since it was setup. Appreciate the feedback; however, I do not think it is related to a crash of GPU since only difference was the kernel update a few weeks ago and has not worked since. Just really sucks because the Telsa GPU is expensive and I cannot continue testing Ollama out. I cannot wait to test the graphics processing they implemented. Thanks again I hope your suggestion helps others!
 
Still no fix from ProxMox I really hate rolling back versions of kernel. Gee gully, I really hate rolling back versions of kernel over this!
 
can somebody else also confirm if the problem occurs with 6.8.4-3-pve but not with 6.8.4-2-pve ? if yes, i'd try to reproduce and bisect it
 
  • Like
Reactions: KrisFromFuture
can somebody else also confirm if the problem occurs with 6.8.4-3-pve but not with 6.8.4-2-pve ? if yes, i'd try to reproduce and bisect it
I you are referring to
Code:
Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.
this has been happening on all 6.8 kernels for me, when trying to pass through a NIC, also the on-board ones of my M11SDV-8C-LN4F.
Kernel 6.5 is fine.
 
  • Like
Reactions: mldy
no i was actually talking about the 'kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.' issue


the 1:1 mapping seems to be a more strict enforcement of the iommu rules, i don't think it's a "bug" per se
EDIT: can you maybe post your iommu groups?
 
Last edited:
  • Like
Reactions: athurdent
the 1:1 mapping seems to be a more strict enforcement of the iommu rules, i don't think it's a "bug" per se
Thanks, can you think of a fix though? I already approached Supermicro and got their latest RC BIOS, no luck. They refuse to do anything else though , because Proxmox is not on the compatibility list for their board.
 
Thanks, can you think of a fix though? I already approached Supermicro and got their latest RC BIOS, no luck. They refuse to do anything else though , because Proxmox is not on the compatibility list for their board.
you could try to replicate the behaviour on an Ubuntu 24.04 if that is supported by supermicro. that should have the same kernel (though i'm not sure if they're on the exact same version)
 
  • Like
Reactions: athurdent
you could try to replicate the behaviour on an Ubuntu 24.04 if that is supported by supermicro. that should have the same kernel (though i'm not sure if they're on the exact same version)
No, sadly that is not supported either.
Perhaps there's a "relax IOMMU" kernel boot option to pass? I tried almost everything that remotely related to IOMMU and RMRR, no luck.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!