Problems with GPU Passthrough since 8.2

merasil · Apr 26, 2024

Hi,

I have had a problem with GPU passthrough since updating to 8.2. The VM to which I have passed the GPU no longer starts and has the following error message:

Code:

kvm: -device vfio-pci,host=0000:06:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,x-vga=on,multifunction=on: vfio 0000:06:00.0: failed to setup container for group 20: Failed to set group container: Invalid argument
TASK ERROR: start failed: QEMU exited with code 1

I have these logs on the host:

Code:

root@hvs:~# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[    0.512344] AMD-Vi: Using global IVHD EFR:0xf77ef22294ada, EFR2:0x0
[    1.044799] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    1.046587] AMD-Vi: Extended features (0xf77ef22294ada, 0x0): PPR NX GT IA GA PC GA_vAPIC
[    1.046600] AMD-Vi: Interrupt remapping enabled
[    1.046759] AMD-Vi: Virtual APIC enabled
[    1.046912] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[   59.575729] vfio-pci 0000:06:00.0: Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.

Code:

root@hvs:~# lsmod | grep vfio
vfio_pci               16384  0
vfio_pci_core          86016  1 vfio_pci
irqbypass              12288  2 vfio_pci_core,kvm
vfio_iommu_type1       49152  0
vfio                   69632  3 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd                98304  1 vfio

Code:

root@hvs:~# lspci -k
06:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T400 4GB] (rev a1)
        Subsystem: Lenovo TU117GL [T400 4GB]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
06:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)
        Subsystem: Lenovo Device 1613
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

I have not changed anything in the BIOS of the host. IOMMU also seems to be enabled. How can I debug the whole thing further?

athurdent · Apr 26, 2024

I am getting the same error with my Intel X710-DA2 card. Worked fine with 6.5
https://forum.proxmox.com/threads/o...est-no-subscription.144557/page-6#post-657836

merasil · Apr 26, 2024

yeah seems to be a kernel "issue"... some1 knows a workaround except for booting 6.5 again?

athurdent · May 1, 2024

As this does not seem to be widespread, I guess it might be due to the fact that we are running AMD CPUs instead of Intel.
Perhaps that newer kernel code is not compatible with AMD.

bacev · May 3, 2024

I have the same problem. I was going to passthrough my nvidia gtx1060 6gb to my windows11 vm and then i got the folowing error message:
swtpm_setup: Not overwriting existing state file.

kvm: -device vfio-pci,host=0000:0b:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:0b:00.0: failed to setup container for group 32: Failed to set group container: Invalid argument

stopping swtpm instance (pid 3367) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
CPU: Intel(R) Xeon(R) CPU E5530
GPU: Nvidia Geforce GTX1060 6gb
RAM: ECC rdimm 16gb
Boot Mode: Legacy BIOS (maybe thats the problem?)
Proxmox version: 8.2.2
any help would be appreciated and also im quite new to proxmox and linux in general so maybe i just misconfigured something?

Ryohka233 · May 5, 2024

I'm using a 5700X and after enabled IOMMU related stuff in BIOS(IOMMU, AES and ACR) and checked my GPU has a separate group, this command dmesg | grep -e DMAR -e IOMMU -e AMD-Vi just returned nothing...? any idea?

apauna · May 8, 2024

I too am having issues with GPU pass-through since 8.2 getting the following error on start of VM

Code:

kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1

This started with the latest update and I hoped it would get fixed with the new update that I just applied kernel 6.8.4+3 and it is still having this issue.
Messed with VM config to attempt to fix still getting error.
Current VM config:

Code:

root@pve2:/etc/pve/nodes/pve3/qemu-server# cat 121.conf
agent: enabled=1,fstrim_cloned_disks=1,type=virtio
boot: c
bootdisk: scsi0
cipassword: 
cores: 8
cpu: host
hostpci0: 0000:81:00.0,pcie=1
ide2: c-vm:vm-121-cloudinit,media=cdrom,size=4M
ipconfig0: ip=dhcp
machine: q35
memory: 32768
meta: creation-qemu=8.1.5,ctime=1715103535
name: ollama2
net0: virtio=BC:24:11:79:2D:9B,bridge=vmbr0,tag=2
numa: 0
scsi0: c-vm:vm-121-disk-0,size=30G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=bc4d4e3a-dfa0-40a7-95ac-d5d308227f2e
sockets: 1
sshkeys:
vga: serial0
vmgenid: f578cc50-250c-41ce-98c5-146cc416d79d

Any help would be appreciated it has shut down my Ollama because of this issue.

KrisFromFuture · May 8, 2024

i also have PCI pass problem with latest kernel (i think)

Win11 VM with Geforce pass with problem:

Task viewer: VM 300 - StartOutputStatusStopDownloadswtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 7163) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

can anybody help with this ?

leesteken · May 8, 2024

I got the "Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed" with an RX570 and it was caused by amdgpu crashing and leaving the GPU in an unusable state. The work-around, for me, is to blacklist amdgpu (so I can still use the GPU as host console until the VM with passthrough starts).

apauna · May 8, 2024

leesteken said:
I got the "Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed" with an RX570 and it was caused by amdgpu crashing and leaving the GPU in an unusable state. The work-around, for me, is to blacklist amdgpu (so I can still use the GPU as host console until the VM with passthrough starts).

Cool but mine fails right on reboot of the host too. And it is a NVidia Telsa card. Also, it is blacklisted already and has been since it was setup. Appreciate the feedback; however, I do not think it is related to a crash of GPU since only difference was the kernel update a few weeks ago and has not worked since. Just really sucks because the Telsa GPU is expensive and I cannot continue testing Ollama out. I cannot wait to test the graphics processing they implemented. Thanks again I hope your suggestion helps others!

KrisFromFuture · May 8, 2024

return to Linux 6.8.4-2-pve kernel solved my problem with PCI Geforce passthrough

apauna · May 8, 2024

KrisFromFuture said:
return to Linux 6.8.4-2-pve kernel solved my problem with PCI Geforce passthrough

Ok, great! However, now my server cannot upgrade until it gets fixed. Is ProxMox/upstream aware of the bug in Kernel?

apauna · May 15, 2024

Still no fix from ProxMox I really hate rolling back versions of kernel. Gee gully, I really hate rolling back versions of kernel over this!

KrisFromFuture · May 15, 2024

apauna said:
Still no fix from ProxMox I really hate rolling back versions of kernel. Gee gully, I really hate rolling back versions of kernel over this!

no
you can monitor new kernels under https://git.proxmox.com/?p=pve-kernel.git;a=summary

dcsapak · May 15, 2024

can somebody else also confirm if the problem occurs with 6.8.4-3-pve but not with 6.8.4-2-pve ? if yes, i'd try to reproduce and bisect it

athurdent · May 15, 2024

dcsapak said:
can somebody else also confirm if the problem occurs with 6.8.4-3-pve but not with 6.8.4-2-pve ? if yes, i'd try to reproduce and bisect it

I you are referring to

Code:

Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.

this has been happening on all 6.8 kernels for me, when trying to pass through a NIC, also the on-board ones of my M11SDV-8C-LN4F.
Kernel 6.5 is fine.

dcsapak · May 15, 2024

no i was actually talking about the 'kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.' issue

the 1:1 mapping seems to be a more strict enforcement of the iommu rules, i don't think it's a "bug" per se
EDIT: can you maybe post your iommu groups?

athurdent · May 15, 2024

dcsapak said:
the 1:1 mapping seems to be a more strict enforcement of the iommu rules, i don't think it's a "bug" per se

Thanks, can you think of a fix though? I already approached Supermicro and got their latest RC BIOS, no luck. They refuse to do anything else though , because Proxmox is not on the compatibility list for their board.

dcsapak · May 15, 2024

athurdent said:
Thanks, can you think of a fix though? I already approached Supermicro and got their latest RC BIOS, no luck. They refuse to do anything else though , because Proxmox is not on the compatibility list for their board.

you could try to replicate the behaviour on an Ubuntu 24.04 if that is supported by supermicro. that should have the same kernel (though i'm not sure if they're on the exact same version)

athurdent · May 15, 2024

dcsapak said:
you could try to replicate the behaviour on an Ubuntu 24.04 if that is supported by supermicro. that should have the same kernel (though i'm not sure if they're on the exact same version)

No, sadly that is not supported either.
Perhaps there's a "relax IOMMU" kernel boot option to pass? I tried almost everything that remotely related to IOMMU and RMRR, no luck.

Problems with GPU Passthrough since 8.2

Active Member

Renowned Member

Active Member

Renowned Member

New Member

New Member

New Member

New Member

Distinguished Member

New Member

New Member

New Member

New Member

New Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

We value your privacy