GPU Passthrough ends up with PCI_NUM_PINS error

server_paul · Mar 18, 2024

I'm running Proxmox on my server and recently added a AMD Raden RX 5700 XT.
It is affected by the vendor-reset bug, but I solved that. I can now reboot the VMs without any errors.

The GPU has 1 HDMI and 3 DP Ports. I initally connected it via HDMI to my Monitor, which is connected to a smart power plug so I can turn it off completely when I'm not home. This worked well.

However, I now switched to DisplayPort and just "restarting" my monitor ends up in a error.

Code:

kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1

The only way I was able to fix this is by rebooting Proxmox.

server_paul · Mar 19, 2024

I solved it - I used to do

Code:

echo 0 > /sys/bus/pci/devices/.../d3cold_allowed

because I had some troubles with that in the past. Once I echo'ed 1 into that, the error didn't return.
The VM however, still runs into an "internal error". After a reboot of the VM everything worked fine again...
Can I somehow get a log of what happened inside the Debian VM?

Ryohka233 · Apr 20, 2024

I'm having similar issue with RX6400, when cheking d3cold_allowed it seems I have 1 by default though

disgustipated · Jun 16, 2024

I just started seeing this after running latest updates this morning, seems update kernel to latest 6.8.4-3 caused some problems. im using pci passthough for nvme drives though and not gpu.
I fixed it for for now by pinning the old kernel

Code:

proxmox-boot-tool kernel list

note the version before newest

Code:

proxmox-boot-tool kernel pin 6.8.4-2-pve
proxmox-boot-tool refresh
shutdown -r now

leesteken · Jun 16, 2024

I had the same problem since kernel 6.8 and worked around it by blacklisting amdgpu. Passthrough would only fail when amdgpu was loaded first (which was not a problem before, thanks to vendor-reset). amdgpu also crashes when I bind it to my RX570 (which apparently breaks passthrough with the error above) after passthrough (which worked fine before, restoring the host console).

EDIT: Other 6.8 kernels (non-Proxmox) also crash the RX570, which I tested inside the VM with passthrough (but most VMs still use older kernels, and work fine).

yayuuu · Jun 17, 2024

I've had the same issue.
At first my VM booted without error, but it was taking a very long time and I couldn't access desktop after it booted. After shutting down and starting again, I was getting the same error as server_paul.
Booting proxmox from older kernel (6.8.2 in my case) fixes the issue.

KrisFromFuture · Jun 22, 2024

Hi

have the same problem with latest kernel. I pass GTX 1030 and one NVME to Win11 VM

swtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 16941) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

Works only on proxmox-kernel-6.8.4-2
6.8.4-3 and 6.8.8-1 gives those errors.

Any other solutions than use old kernel?
please

Jas Tran · Jul 2, 2024

I also faced this problem, it took many hours to get workaround success

Large community but do not see any suggestion to fix it. also bored a bit.

root@pve:~# vi /etc/default/grub
-- fix line with: pcie_acs_override=downstream
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream nomodeset"

root@pve:~# update-grub
root@pve-10:~# vi /etc/default/grub
root@pve-10:~# update-grub
root@pve-10:~# reboot

This solution has worked with latest kernel 6.8.8.1, my VM goes back without annoy errors, awesome

daemontus · Jul 21, 2024

I am running into the same issues with both 1080 Ti and W5500 (or any PCIe device passthrough really). The VM even crashes randomly after a few hours, not being able to start up again due to the PCI error.

I tried the solutions proposed in this thread, but only the kernel downgrade actually helped. If there are any other suggestions of what to try (or new kernel that actually works), I'd be happy to try them out.

daemontus · Aug 17, 2024

Just a quick update: The kernel downgrade seemed to have worked, but the error is back now, even on the old kernel :/

silverstone · Sep 25, 2024

Jas Tran said:
I also faced this problem, it took many hours to get workaround success

Large community but do not see any suggestion to fix it. also bored a bit.

root@pve:~# vi /etc/default/grub
-- fix line with: pcie_acs_override=downstream
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream nomodeset"

root@pve:~# update-grub
root@pve-10:~# vi /etc/default/grub
root@pve-10:~# update-grub
root@pve-10:~# reboot

This solution has worked with latest kernel 6.8.8.1, my VM goes back without annoy errors, awesome

Unfortunately that doesn't seem to help in my Case

.

Also /sys/devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/d3cold_allowed (AMD RX 6600) was already set to 1.

I tried setting to 0 but that doesn't really solve the Issue.

dmesg says the following (with default d3cold_allowed = 1)

Code:

[  565.645785] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  566.648134] pcieport 0000:0a:00.0: retraining failed
[  567.821193] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  568.822764] pcieport 0000:0a:00.0: retraining failed
[  568.822775] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  569.868967] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  571.981258] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  576.332964] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  585.036958] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  601.933260] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  636.749228] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  637.022765] vfio-pci 0000:0b:00.1: Unable to change power state from D0 to D3hot, device inaccessible
[  637.022795] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  637.025694] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  638.055059] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  639.056771] pcieport 0000:0a:00.0: retraining failed
[  640.268955] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  641.270768] pcieport 0000:0a:00.0: retraining failed
[  641.270779] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  642.316963] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  644.428967] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  649.036966] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  657.741259] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  674.637290] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  710.476978] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  710.480222] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  710.485090] vfio-pci 0000:0b:00.0: Unable to change power state from D0 to D3hot, device inaccessible

Then trying to change to (d3cold_allowed = 0)

Code:

[  712.890762] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  713.892096] pcieport 0000:0a:00.0: retraining failed
[  715.085267] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  716.087084] pcieport 0000:0a:00.0: retraining failed
[  716.087095] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  717.133264] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  719.245058] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  723.788955] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  732.493257] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  749.389271] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  784.205262] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  784.210458] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  784.210474] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  784.389632] clocksource: timekeeping watchdog on CPU30: hpet wd-wd read-back delay of 220349ns
[  784.389642] clocksource: wd-tsc-wd read-back delay of 215739ns, clock-skew test skipped!
[  810.187187] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  810.187207] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  810.187327] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  810.187341] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.010921] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.010939] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.017765] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394702] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394725] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394803] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.406801] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.408754] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  830.408768] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.410030] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible

Not much difference. Not sure if this is related to ASPM not working properly or if it might be related to amd_pstate.shared_mem=1 amd_pstate=guided

EDIT 1: REMOVING these Options from /etc/default/grub.d/powersave.cfg seems to make the system NOT crash as soon as I run any Command at least

Code:

cpufreq.default_governor=powersave initcall_blacklist=acpi_cpufreq_init amd_pstate.shared_mem=1 amd_pstate=guided

dust2k.mf · Oct 29, 2024

I have fixed the problem by following https://forum.proxmox.com/threads/p...x-ve-8-installation-and-configuration.130218/ instructions but not using vendor-reset

Search

Search

GPU Passthrough ends up with PCI_NUM_PINS error

server_paul

Member

server_paul

Member

Ryohka233

New Member

disgustipated

New Member

leesteken

Distinguished Member

yayuuu

Member

KrisFromFuture

New Member

Jas Tran

Active Member

daemontus

New Member

daemontus

New Member

silverstone

Well-Known Member

dust2k.mf

New Member

We value your privacy