PCI Passthrough with RTX 4060 Ti: kvm: vfio: Unable to power on device, stuck in D3

Nov 14, 2024
11
2
3
Hey guys,
we do a PCI Passthrough of a TRX 4060 TI on a B650D4U-2L2T/BCM mainboard with an AMD Ryzen 9 7900X CPU.
I got the passthrough working yesterday once after a CMOS reset followed by reapplying these bios options (I verified working passthrough using nvidia-smi in the VM), but after powering off the server and putting it back into the rack, it doesn't work anymore. Every time I start the VM, it does show this error (rebooted ~5 times to test if it occurs every time):

Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK ERROR: start failed: QEMU exited with code -1

I am trying to figure out what causes this now, because I did not change anything yesterday from before the server was placed in the rack and afterward. The only thing I changed when putting it into the rack were power cable & lan cables.
Additionally, I do not know what to troubleshoot next. We already replaced the mainboard and the GPU a couple of weeks ago, maybe someone has a good idea what to test next?

lspci -nnk
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti] [10de:2803] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] AD106 [GeForce RTX 4060 Ti] [1462:5174]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
01:00.1 Audio device [0403]: NVIDIA Corporation AD106M High Definition Audio Controller [10de:22bd] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] AD106M High Definition Audio Controller [1462:5174]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

root@proxmox9:~# ls -l /sys/kernel/iommu_groups/12/devices/
total 0
lrwxrwxrwx 1 root root 0 Feb 27 09:44 0000:01:00.0 -> ../../../../devices/pci0000:00/0000:00:01.1/0000:01:00.0
lrwxrwxrwx 1 root root 0 Feb 27 09:44 0000:01:00.1 -> ../../../../devices/pci0000:00/0000:00:01.1/0000:01:00.1

root@proxmox9:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2803,10de:22bd disable_vga=1 disable_idle_d3=1

VM > Hardware > PCI Device:
1740645746130.png

root@proxmox9:~# cat /etc/default/grub
[...]
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off"
[...]


root@proxmox9:~# dmesg | grep vfio
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off
[ 0.049791] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off
[ 3.326677] vfio-pci 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 3.326760] vfio_pci: add [10de:2803[ffffffff:ffffffff]] class 0x000000/00000000
[ 3.326819] vfio_pci: add [10de:22bd[ffffffff:ffffffff]] class 0x000000/00000000
[ 16.847494] vfio-pci 0000:01:00.1: enabling device (0000 -> 0002)
[ 20.033227] vfio-pci 0000:01:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 20.037272] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 20.758299] vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 21.950364] vfio-pci 0000:01:00.0: not ready 1023ms after FLR; waiting
[ 23.038217] vfio-pci 0000:01:00.0: not ready 2047ms after FLR; waiting
[ 25.150502] vfio-pci 0000:01:00.0: not ready 4095ms after FLR; waiting
[ 29.502524] vfio-pci 0000:01:00.0: not ready 8191ms after FLR; waiting
[ 38.206361] vfio-pci 0000:01:00.0: not ready 16383ms after FLR; waiting
[ 55.102743] vfio-pci 0000:01:00.0: not ready 32767ms after FLR; waiting
[ 91.454980] vfio-pci 0000:01:00.0: not ready 65535ms after FLR; giving up
[ 91.553262] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 91.553648] vfio-pci 0000:01:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 100.580322] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 100.580357] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1149.620773] vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1150.796899] vfio-pci 0000:01:00.0: not ready 1023ms after FLR; waiting
[ 1151.884942] vfio-pci 0000:01:00.0: not ready 2047ms after FLR; waiting
[ 1153.996774] vfio-pci 0000:01:00.0: not ready 4095ms after FLR; waiting
[ 1158.476981] vfio-pci 0000:01:00.0: not ready 8191ms after FLR; waiting
[ 1167.181084] vfio-pci 0000:01:00.0: not ready 16383ms after FLR; waiting
[ 1184.077151] vfio-pci 0000:01:00.0: not ready 32767ms after FLR; waiting
[ 1217.869729] vfio-pci 0000:01:00.0: not ready 65535ms after FLR; giving up
root@proxmox9:~# dmesg | grep -e DMAR -e IOMMU
[ 0.442148] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.444095] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).

BIOS Settings:
1740646109837.png
1740646146945.png
1740646509416.png

BIOS Firmware Version20.07 (latest)

root@proxmox9:~# apt update
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 http://ftp.de.debian.org/debian bookworm InRelease
Hit:3 http://ftp.de.debian.org/debian bookworm-updates InRelease
Hit:4 https://enterprise.proxmox.com/debian/pve bookworm InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.

root@proxmox9:~# uname -a
Linux proxmox9 6.8.12-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) x86_64 GNU/Linux
 
Last edited:
Hello peterge-misoft! Can you please post the following:
  1. Output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
  2. The full VM configuration (output of qm config <VMID>).
 
Hello peterge-misoft! Can you please post the following:
  1. Output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
  2. The full VM configuration (output of qm config <VMID>).
1. root@proxmox9:~# pvesh get /nodes/proxmox9/hardware/pci --pci-class-blacklist ""
Formatting is fucked when i copy and paste the output, and it exceeds the char limit. Here are two screenshots of the output:
1740723547031.png
1740723569721.png


2. root@proxmox9:~# qm config 100
agent: enabled=1
boot: order=ide2;scsi0
cores: 20
cpu: host
description: <!--- The Location is only required on hypervisors%0A**Location%3A**%0A%0AThe physical location of the hypervisor, either Bonn or Cologne.%0A%0A**Special Hardware%3A**%0A%0ADefault%3A none, can contain a GPU, USB dongles or other hardware.%0A-->%0A**Key Player%3A**%0A%0AAndreas Behrendt%0A%0A**Role/Function%3A**%0A%0ATest von px9 PCI Passthrough, Klon von Vollstreckungsbescheid KI%0A%0A**Application%3A**%0A%0A[VollstreckungsbescheidKI](https%3A//gitlab.misoft.local/kunden/vc/vollstreckungsbescheid-ki-backend)%0A%0A**Dependencies%3A**%0A%0Aroot PW%3A Standard ohne !%0A%0A**Status%3A**%0A%0AKopie von VollstreckungsbescheidKI
hostpci0: 0000:01:00
ide2: none,media=cdrom
kvm: 1
memory: 53248
meta: creation-qemu=7.1.0,ctime=1679486323
name: gpu-test
numa: 0
onboot: 1
ostype: l26
scsi0: VM-Storage1:vm-100-disk-0,discard=on,iothread=1,size=400G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=318e3526-fe14-4076-b9d6-a4a2093162ca
sockets: 1
tags: linux;service;gpu;ki
vmgenid: b4881c20-0666-4091-a8ed-e1181b0e3c78
 
Thanks for the information! Could you please try the following:
  1. Since your GPU supports UEFI, please try disabling 'Legacy boot' or CSM in the BIOS of the server.
  2. Change your VM to use OVMF (UEFI) instead of SeaBIOS.
  3. Try using q35 as your machine type in the VM settings. After changing this, open the GPU settings of the VM - you should now also be able to enable PCI Express.
 
Last edited:
Thanks for the information! Could you please try the following:
  1. Since your GPU supports UEFI, please try disabling 'Legacy boot' or CSM in the BIOS of the server.
  2. Change your VM to use OVMF (UEFI) instead of SeaBIOS.
  3. Try using q35 as your machine type in the VM settings. After changing this, open the GPU settings of the VM - you should now also be able to enable PCI Express.

1. CSM was disabled the whole time:
1740733688760.png

2. I switched from SeaBios to OVMF and added an EFI disk:
1740733979930.png
1740734204412.png
After starting the VM the start task still shows:
Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
~60 seconds after starting the VM the console displays this:
1740735113227.png

3. I switched the VM back to SeaBIOS and removed the EFI disk.
Then I changed from i440fx to q35:
1740735553480.png
1740735413504.png
And I enabled PCI Express in the GPU setting:
1740735470808.png
but the output of the start task is still showing the same error:
Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK OK
executing nvidia-smi does still not work :(



IMHO the VM is not the reason why this isn't working, we have 4 other proxmox hosts where we use PCI Passthrough and the VM is a clone of one of the VMs on those hosts...
 
Last edited:
Just wondering:
  1. Which PSU do you have? I'm wondering whether the GPU doesn't get enough power, explaining why it cannot power on.
  2. Are you 100% sure that the power cable from the PSU to the GPU is properly plugged in?
  3. Can you also post the full output of journalctl --boot and dmesg?
 
Sorry for not responding @l.leahu-vladucu, today is my first office day with access to server room again.

1. The server uses a: be quiet! Straight Power 11 750W ATX 2.4 PSU
2. Yep, 100% sure that the PSU PCIe 8 Pin Cable is connected to GPU (the cable clip is seated correctly) and properly plugged in: https://imgur.com/a/YJ0v95q
3. root@proxmox9:~# journalctl --boot
Mar 04 08:33:18 proxmox9 kernel: Linux version 6.8.12-8-pve (build@proxmox) (gcc (Debian 12.2>
Mar 04 08:33:18 proxmox9 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/de>
Mar 04 08:33:18 proxmox9 kernel: KERNEL supported cpus:
Mar 04 08:33:18 proxmox9 kernel: Intel GenuineIntel
Mar 04 08:33:18 proxmox9 kernel: AMD AuthenticAMD
Mar 04 08:33:18 proxmox9 kernel: Hygon HygonGenuine
Mar 04 08:33:18 proxmox9 kernel: Centaur CentaurHauls
Mar 04 08:33:18 proxmox9 kernel: zhaoxin Shanghai
Mar 04 08:33:18 proxmox9 kernel: BIOS-provided physical RAM map:
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a211fff] ACPI >
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000000a212000-0x000000000affffff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b020fff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000000b021000-0x000000008857efff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000008857f000-0x000000008e57efff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000008e57f000-0x000000008e67efff] ACPI >
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000008e67f000-0x000000009067efff] ACPI >
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000009067f000-0x000000009867efff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000009867f000-0x00000000987fefff] type >
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x00000000987ff000-0x0000000099ff8fff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000099ff9000-0x0000000099ffbfff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000099ffc000-0x0000000099ffffff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000009a000000-0x000000009bffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000009d7f3000-0x000000009fffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x00000000fd000000-0x00000000ffffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000103de7ffff] usable
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000103eec0000-0x00000010801fffff] reser>
Mar 04 08:33:18 proxmox9 kernel: BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reser>
Mar 04 08:33:18 proxmox9 kernel: PCI: Unknown option `reset'
Mar 04 08:33:18 proxmox9 kernel: NX (Execute Disable) protection: active
Mar 04 08:33:18 proxmox9 kernel: APIC: Static calls initialized
Mar 04 08:33:18 proxmox9 kernel: efi: EFI v2.9 by American Megatrends
Mar 04 08:33:18 proxmox9 kernel: efi: ACPI=0x90665000 ACPI 2.0=0x90665014 TPMFinalLog=0x90632>
Mar 04 08:33:18 proxmox9 kernel: efi: Remove mem105: MMIO range=[0xe0000000-0xefffffff] (256M>
Mar 04 08:33:18 proxmox9 kernel: e820: remove [mem 0xe0000000-0xefffffff] reserved
Mar 04 08:33:18 proxmox9 kernel: efi: Remove mem106: MMIO range=[0xfd000000-0xfedfffff] (30MB>
Mar 04 08:33:18 proxmox9 kernel: e820: remove [mem 0xfd000000-0xfedfffff] reserved
Mar 04 08:33:18 proxmox9 kernel: efi: Not removing mem107: MMIO range=[0xfee00000-0xfee00fff]>
Mar 04 08:33:18 proxmox9 kernel: efi: Remove mem108: MMIO range=[0xfee01000-0xffffffff] (17MB>
Mar 04 08:33:18 proxmox9 kernel: e820: remove [mem 0xfee01000-0xffffffff] reserved
Mar 04 08:33:18 proxmox9 kernel: efi: Remove mem110: MMIO range=[0x1060000000-0x10801fffff] (>
Mar 04 08:33:18 proxmox9 kernel: e820: remove [mem 0x1060000000-0x10801fffff] reserved
Mar 04 08:33:18 proxmox9 kernel: secureboot: Secure boot disabled
Mar 04 08:33:18 proxmox9 kernel: SMBIOS 3.7.0 present.

root@proxmox9:~# dmesg
char limit of 16.000 exeeded, so: https://pastebin.com/fjgcbaYM
 
Could you please try to reseat the GPU on the motherboard? Simply plug it out and plug it back in, and make sure it is well connected to the PCIe slot.

As far as I can see, the GPU either has some connection issues of some sort (either power or PCIe connection), or the PSU has issues delivering enough power to the GPU. Or it is a motherboard issue.

I got the passthrough working yesterday once after a CMOS reset followed by reapplying these bios options (I verified working passthrough using nvidia-smi in the VM), but after powering off the server and putting it back into the rack, it doesn't work anymore. Every time I start the VM, it does show this error (rebooted ~5 times to test if it occurs every time):
Out of curiosity, if you try again exactly with the previous setup, does it work?
 
We ordered a new 750W PSU, replaced it, and it did work without issues 2 times on the workbench. After putting the server back into the rack, right after the first boot the issue "kvm: vfio: Unable to power on device, stuck in D3" is back there.

We are unsure if the issue is caused by the power circuits in our office, but it wasn't there before the problems started on this server...

A motherboard issue is a valid option, but the current motherboard is new, it was delivered in 12/24...

Edit: When unplugging it from the rack and putting it back on the work bench, it did give the D3 error again. I would guess this might be a heat issue related to the mainboard?
Another thing I noticed on the workbench at the beginning of this day was that sometimes the network connectivity was not given without any changes made to the network cables. I guess it might be the mainboard most propably (again?)....

All of this is fcking annoying
 
Last edited:
Thanks for the update! Hmm, this still sounds like a power issue of some kind. The following comes to mind:
  1. Either some cables are loose and no longer work properly when putting the server back into the rack. You can also try changing the cables (and additionally check that everything is properly connected), just in case any of them is damaged.
  2. Or, the rack has some power issues. In this case it would make sense to check whether the server is actually able to get as much power as it needs.
  3. Try to reseat the GPU, just in case it is not properly connected to the motherboard.
Since you tried it twice with different power supplies, and since it works outside of the server rack, I would say it's rather unlikely to be a motherboard or power supply issue.

Let me know if you find out what the issue is ;)
 
Let me know if you find out what the issue is ;)

I did not find out what the issue is, but let me describe how I "fixed" the problem. Hopefully it will help others.

0. We bought the initial proxmox9 hardware in December 2023
1. We ordered a new B650D4U-2L2T/BCM mainboard for the proxmox9 server in December 2024, because we thought it could be a mainboard issue. The issue continued to happen on the new board...
2. We ordered a new PSU in March 2025, like described in my post above
3. We ordered a new machine for another customer, named proxmox14 without PSU & mainboard, so just CPU, GPU, RAM and disks
4. GPU was ordered wrong (8gb instead of 16gb) so I build proxmox14 with the GPU from proxmox9
5. I built the machine with the mainboard from 2023 and the old PSU (because the new parts from 2. & 3. were built into proxmox9), checked the bios settings and it worked instantly
6. I installed the same BMC and BIOS version on proxmox9 like the old mainboard in proxmox14 has
7. The 16GB GPU arrived, and I placed it into proxmox9. I took the CPU cooler off top check thermal paste, but it was fine, so I checked bios, booted it, and it started working with our test?!?

The BIOS settings I took care of are the three in this guide: https://pve.proxmox.com/wiki/PCI_Passthrough#BIOS_options

The Firmware versions on proxmox9 are these:
BMC Firmware Version 6.01.00
BIOS Firmware Version 10.18

I am not completely sure what caused these big troubles, because everything worked today. I am pretty sure that the board sucks, but as of right now they are both working. But weird things are still happening: proxmox9 did boot without network connectivity twice today, it takes long to power up and once it did only spin one GPU fan of two while being at full load and ~70C.
So I am pretty sure that the board is not working properly, as confirmed by @MarkusKo too. But with this version and these settings it seems to do its job...

Thanks, @l.leahu-vladucu for your time!
 
Oh and one more thing:
When I wrote Asrock Support about this problem, I got this answer:
Dear Sir,

Thank you for contacting ASRock Rack support.

If the board post to bios you can try to use this socflash version to flash the bmc ..

Please check the website for the latest bios & bmc firmware

https://www.asrockrack.com/general/productdetail.asp?Model=B650D4U-2L2T/BCM#Download

Q: How to flash BMC FW in DOS?

Preparation

A bootable USB drive

(P.S. you can build bootable USB by https://rufus.akeo.ie/ )

Update BMC Batch files folder
Before using Socflash tool to update, please enter the BIOS, press "F9" to load default, then press "F10" to save and exit the BIOS, and make sure [CSM] is set to [Enabled]

https://www.dropbox.com/s/eosedwhx333nh9w/socflash_v12207.zip?dl=0

Step:

1. Copy your BMC image to “Socflash” folder

2. Use a bootable USB drive boot to DOS.

3. Execute command “Socflash.exe if={update image name}” to update BMC.

Ex: Socflash.exe if=EPC612D8L1.10 option=r


4. After update finish, please power off/on and check BMC version in BIOS.

cid:image002.jpg@01D7A0DA.4D550090


Ps. If you have a gpu/pcie card installed and the update fails please manually enable the onboard VGA in the BIOS, if this fails

retry the procedure without the gpu/pcie card installed .


Regards,

ASRockRack TSD
But I did not flash it using socflash today, I just used the Firmware Update & BIOS update option in Maintainance inside the BMC Webui.