PCI Passthrough with RTX 4060 Ti: kvm: vfio: Unable to power on device, stuck in D3

Nov 14, 2024
4
0
1
Hey guys,
we do a PCI Passthrough of a TRX 4060 TI on a B650D4U-2L2T/BCM mainboard with an AMD Ryzen 9 7900X CPU.
I got the passthrough working yesterday once after a CMOS reset followed by reapplying these bios options (I verified working passthrough using nvidia-smi in the VM), but after powering off the server and putting it back into the rack, it doesn't work anymore. Every time I start the VM, it does show this error (rebooted ~5 times to test if it occurs every time):

Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK ERROR: start failed: QEMU exited with code -1

I am trying to figure out what causes this now, because I did not change anything yesterday from before the server was placed in the rack and afterward. The only thing I changed when putting it into the rack were power cable & lan cables.
Additionally, I do not know what to troubleshoot next. We already replaced the mainboard and the GPU a couple of weeks ago, maybe someone has a good idea what to test next?

lspci -nnk
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti] [10de:2803] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] AD106 [GeForce RTX 4060 Ti] [1462:5174]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
01:00.1 Audio device [0403]: NVIDIA Corporation AD106M High Definition Audio Controller [10de:22bd] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] AD106M High Definition Audio Controller [1462:5174]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

root@proxmox9:~# ls -l /sys/kernel/iommu_groups/12/devices/
total 0
lrwxrwxrwx 1 root root 0 Feb 27 09:44 0000:01:00.0 -> ../../../../devices/pci0000:00/0000:00:01.1/0000:01:00.0
lrwxrwxrwx 1 root root 0 Feb 27 09:44 0000:01:00.1 -> ../../../../devices/pci0000:00/0000:00:01.1/0000:01:00.1

root@proxmox9:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2803,10de:22bd disable_vga=1 disable_idle_d3=1

VM > Hardware > PCI Device:
1740645746130.png

root@proxmox9:~# cat /etc/default/grub
[...]
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off"
[...]


root@proxmox9:~# dmesg | grep vfio
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off
[ 0.049791] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-8-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt pcie_aspm=off vfio-pci.disable_idle_d3=1 pci=realloc,reset pcie_port_pm=off
[ 3.326677] vfio-pci 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 3.326760] vfio_pci: add [10de:2803[ffffffff:ffffffff]] class 0x000000/00000000
[ 3.326819] vfio_pci: add [10de:22bd[ffffffff:ffffffff]] class 0x000000/00000000
[ 16.847494] vfio-pci 0000:01:00.1: enabling device (0000 -> 0002)
[ 20.033227] vfio-pci 0000:01:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 20.037272] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 20.758299] vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 21.950364] vfio-pci 0000:01:00.0: not ready 1023ms after FLR; waiting
[ 23.038217] vfio-pci 0000:01:00.0: not ready 2047ms after FLR; waiting
[ 25.150502] vfio-pci 0000:01:00.0: not ready 4095ms after FLR; waiting
[ 29.502524] vfio-pci 0000:01:00.0: not ready 8191ms after FLR; waiting
[ 38.206361] vfio-pci 0000:01:00.0: not ready 16383ms after FLR; waiting
[ 55.102743] vfio-pci 0000:01:00.0: not ready 32767ms after FLR; waiting
[ 91.454980] vfio-pci 0000:01:00.0: not ready 65535ms after FLR; giving up
[ 91.553262] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 91.553648] vfio-pci 0000:01:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 100.580322] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 100.580357] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1149.620773] vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1150.796899] vfio-pci 0000:01:00.0: not ready 1023ms after FLR; waiting
[ 1151.884942] vfio-pci 0000:01:00.0: not ready 2047ms after FLR; waiting
[ 1153.996774] vfio-pci 0000:01:00.0: not ready 4095ms after FLR; waiting
[ 1158.476981] vfio-pci 0000:01:00.0: not ready 8191ms after FLR; waiting
[ 1167.181084] vfio-pci 0000:01:00.0: not ready 16383ms after FLR; waiting
[ 1184.077151] vfio-pci 0000:01:00.0: not ready 32767ms after FLR; waiting
[ 1217.869729] vfio-pci 0000:01:00.0: not ready 65535ms after FLR; giving up
root@proxmox9:~# dmesg | grep -e DMAR -e IOMMU
[ 0.442148] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.444095] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).

BIOS Settings:
1740646109837.png
1740646146945.png
1740646509416.png

BIOS Firmware Version20.07 (latest)

root@proxmox9:~# apt update
Hit:1 http://security.debian.org bookworm-security InRelease
Hit:2 http://ftp.de.debian.org/debian bookworm InRelease
Hit:3 http://ftp.de.debian.org/debian bookworm-updates InRelease
Hit:4 https://enterprise.proxmox.com/debian/pve bookworm InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.

root@proxmox9:~# uname -a
Linux proxmox9 6.8.12-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-8 (2025-01-24T12:32Z) x86_64 GNU/Linux
 
Last edited:
Hello peterge-misoft! Can you please post the following:
  1. Output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
  2. The full VM configuration (output of qm config <VMID>).
 
Hello peterge-misoft! Can you please post the following:
  1. Output of pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
  2. The full VM configuration (output of qm config <VMID>).
1. root@proxmox9:~# pvesh get /nodes/proxmox9/hardware/pci --pci-class-blacklist ""
Formatting is fucked when i copy and paste the output, and it exceeds the char limit. Here are two screenshots of the output:
1740723547031.png
1740723569721.png


2. root@proxmox9:~# qm config 100
agent: enabled=1
boot: order=ide2;scsi0
cores: 20
cpu: host
description: <!--- The Location is only required on hypervisors%0A**Location%3A**%0A%0AThe physical location of the hypervisor, either Bonn or Cologne.%0A%0A**Special Hardware%3A**%0A%0ADefault%3A none, can contain a GPU, USB dongles or other hardware.%0A-->%0A**Key Player%3A**%0A%0AAndreas Behrendt%0A%0A**Role/Function%3A**%0A%0ATest von px9 PCI Passthrough, Klon von Vollstreckungsbescheid KI%0A%0A**Application%3A**%0A%0A[VollstreckungsbescheidKI](https%3A//gitlab.misoft.local/kunden/vc/vollstreckungsbescheid-ki-backend)%0A%0A**Dependencies%3A**%0A%0Aroot PW%3A Standard ohne !%0A%0A**Status%3A**%0A%0AKopie von VollstreckungsbescheidKI
hostpci0: 0000:01:00
ide2: none,media=cdrom
kvm: 1
memory: 53248
meta: creation-qemu=7.1.0,ctime=1679486323
name: gpu-test
numa: 0
onboot: 1
ostype: l26
scsi0: VM-Storage1:vm-100-disk-0,discard=on,iothread=1,size=400G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=318e3526-fe14-4076-b9d6-a4a2093162ca
sockets: 1
tags: linux;service;gpu;ki
vmgenid: b4881c20-0666-4091-a8ed-e1181b0e3c78
 
Thanks for the information! Could you please try the following:
  1. Since your GPU supports UEFI, please try disabling 'Legacy boot' or CSM in the BIOS of the server.
  2. Change your VM to use OVMF (UEFI) instead of SeaBIOS.
  3. Try using q35 as your machine type in the VM settings. After changing this, open the GPU settings of the VM - you should now also be able to enable PCI Express.
 
Last edited:
Thanks for the information! Could you please try the following:
  1. Since your GPU supports UEFI, please try disabling 'Legacy boot' or CSM in the BIOS of the server.
  2. Change your VM to use OVMF (UEFI) instead of SeaBIOS.
  3. Try using q35 as your machine type in the VM settings. After changing this, open the GPU settings of the VM - you should now also be able to enable PCI Express.

1. CSM was disabled the whole time:
1740733688760.png

2. I switched from SeaBios to OVMF and added an EFI disk:
1740733979930.png
1740734204412.png
After starting the VM the start task still shows:
Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
~60 seconds after starting the VM the console displays this:
1740735113227.png

3. I switched the VM back to SeaBIOS and removed the EFI disk.
Then I changed from i440fx to q35:
1740735553480.png
1740735413504.png
And I enabled PCI Express in the GPU setting:
1740735470808.png
but the output of the start task is still showing the same error:
Code:
kvm: vfio: Unable to power on device, stuck in D3
kvm: vfio: Unable to power on device, stuck in D3
TASK OK
executing nvidia-smi does still not work :(



IMHO the VM is not the reason why this isn't working, we have 4 other proxmox hosts where we use PCI Passthrough and the VM is a clone of one of the VMs on those hosts...
 
Last edited:
Just wondering:
  1. Which PSU do you have? I'm wondering whether the GPU doesn't get enough power, explaining why it cannot power on.
  2. Are you 100% sure that the power cable from the PSU to the GPU is properly plugged in?
  3. Can you also post the full output of journalctl --boot and dmesg?