GPU Passthrough ends up with PCI_NUM_PINS error

server_paul

Member
Jun 22, 2022
11
0
6
I'm running Proxmox on my server and recently added a AMD Raden RX 5700 XT.
It is affected by the vendor-reset bug, but I solved that. I can now reboot the VMs without any errors.

The GPU has 1 HDMI and 3 DP Ports. I initally connected it via HDMI to my Monitor, which is connected to a smart power plug so I can turn it off completely when I'm not home. This worked well.

However, I now switched to DisplayPort and just "restarting" my monitor ends up in a error.
Code:
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1
The only way I was able to fix this is by rebooting Proxmox.
 
I solved it - I used to do
Code:
echo 0 > /sys/bus/pci/devices/.../d3cold_allowed
because I had some troubles with that in the past. Once I echo'ed 1 into that, the error didn't return.
The VM however, still runs into an "internal error". After a reboot of the VM everything worked fine again...
Can I somehow get a log of what happened inside the Debian VM?
 
I'm having similar issue with RX6400, when cheking d3cold_allowed it seems I have 1 by default though
 
I just started seeing this after running latest updates this morning, seems update kernel to latest 6.8.4-3 caused some problems. im using pci passthough for nvme drives though and not gpu.
I fixed it for for now by pinning the old kernel
Code:
proxmox-boot-tool kernel list
note the version before newest
Code:
proxmox-boot-tool kernel pin 6.8.4-2-pve
proxmox-boot-tool refresh
shutdown -r now
 
I had the same problem since kernel 6.8 and worked around it by blacklisting amdgpu. Passthrough would only fail when amdgpu was loaded first (which was not a problem before, thanks to vendor-reset). amdgpu also crashes when I bind it to my RX570 (which apparently breaks passthrough with the error above) after passthrough (which worked fine before, restoring the host console).

EDIT: Other 6.8 kernels (non-Proxmox) also crash the RX570, which I tested inside the VM with passthrough (but most VMs still use older kernels, and work fine).
 
Last edited:
I've had the same issue.
At first my VM booted without error, but it was taking a very long time and I couldn't access desktop after it booted. After shutting down and starting again, I was getting the same error as server_paul.
Booting proxmox from older kernel (6.8.2 in my case) fixes the issue.
 
Hi

have the same problem with latest kernel. I pass GTX 1030 and one NVME to Win11 VM

swtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 16941) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

Works only on proxmox-kernel-6.8.4-2
6.8.4-3 and 6.8.8-1 gives those errors.

Any other solutions than use old kernel?
please :)
 
I also faced this problem, it took many hours to get workaround success

Large community but do not see any suggestion to fix it. also bored a bit.

root@pve:~# vi /etc/default/grub
-- fix line with: pcie_acs_override=downstream
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream nomodeset"

root@pve:~# update-grub
root@pve-10:~# vi /etc/default/grub
root@pve-10:~# update-grub
root@pve-10:~# reboot

This solution has worked with latest kernel 6.8.8.1, my VM goes back without annoy errors, awesome
 
I am running into the same issues with both 1080 Ti and W5500 (or any PCIe device passthrough really). The VM even crashes randomly after a few hours, not being able to start up again due to the PCI error.

I tried the solutions proposed in this thread, but only the kernel downgrade actually helped. If there are any other suggestions of what to try (or new kernel that actually works), I'd be happy to try them out.
 
I also faced this problem, it took many hours to get workaround success

Large community but do not see any suggestion to fix it. also bored a bit.

root@pve:~# vi /etc/default/grub
-- fix line with: pcie_acs_override=downstream
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream nomodeset"

root@pve:~# update-grub
root@pve-10:~# vi /etc/default/grub
root@pve-10:~# update-grub
root@pve-10:~# reboot

This solution has worked with latest kernel 6.8.8.1, my VM goes back without annoy errors, awesome
Unfortunately that doesn't seem to help in my Case :(.

Also /sys/devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/d3cold_allowed (AMD RX 6600) was already set to 1.

I tried setting to 0 but that doesn't really solve the Issue.

dmesg says the following (with default d3cold_allowed = 1)
Code:
[  565.645785] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  566.648134] pcieport 0000:0a:00.0: retraining failed
[  567.821193] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  568.822764] pcieport 0000:0a:00.0: retraining failed
[  568.822775] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  569.868967] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  571.981258] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  576.332964] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  585.036958] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  601.933260] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  636.749228] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  637.022765] vfio-pci 0000:0b:00.1: Unable to change power state from D0 to D3hot, device inaccessible
[  637.022795] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  637.025694] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  638.055059] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  639.056771] pcieport 0000:0a:00.0: retraining failed
[  640.268955] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  641.270768] pcieport 0000:0a:00.0: retraining failed
[  641.270779] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  642.316963] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  644.428967] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  649.036966] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  657.741259] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  674.637290] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  710.476978] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  710.480222] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  710.485090] vfio-pci 0000:0b:00.0: Unable to change power state from D0 to D3hot, device inaccessible

Then trying to change to (d3cold_allowed = 0)
Code:
[  712.890762] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  713.892096] pcieport 0000:0a:00.0: retraining failed
[  715.085267] pcieport 0000:0a:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  716.087084] pcieport 0000:0a:00.0: retraining failed
[  716.087095] vfio-pci 0000:0b:00.0: not ready 1023ms after bus reset; waiting
[  717.133264] vfio-pci 0000:0b:00.0: not ready 2047ms after bus reset; waiting
[  719.245058] vfio-pci 0000:0b:00.0: not ready 4095ms after bus reset; waiting
[  723.788955] vfio-pci 0000:0b:00.0: not ready 8191ms after bus reset; waiting
[  732.493257] vfio-pci 0000:0b:00.0: not ready 16383ms after bus reset; waiting
[  749.389271] vfio-pci 0000:0b:00.0: not ready 32767ms after bus reset; waiting
[  784.205262] vfio-pci 0000:0b:00.0: not ready 65535ms after bus reset; giving up
[  784.210458] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  784.210474] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  784.389632] clocksource: timekeeping watchdog on CPU30: hpet wd-wd read-back delay of 220349ns
[  784.389642] clocksource: wd-tsc-wd read-back delay of 215739ns, clock-skew test skipped!
[  810.187187] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  810.187207] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  810.187327] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  810.187341] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.010921] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.010939] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  829.017765] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394702] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394725] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.394803] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.406801] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.408754] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible
[  830.408768] vfio-pci 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  830.410030] vfio-pci 0000:0b:00.1: Unable to change power state from D3cold to D0, device inaccessible

Not much difference. Not sure if this is related to ASPM not working properly or if it might be related to amd_pstate.shared_mem=1 amd_pstate=guided

EDIT 1: REMOVING these Options from /etc/default/grub.d/powersave.cfg seems to make the system NOT crash as soon as I run any Command at least
Code:
cpufreq.default_governor=powersave initcall_blacklist=acpi_cpufreq_init amd_pstate.shared_mem=1 amd_pstate=guided
 
Last edited:
  • Like
Reactions: KrisFromFuture

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!