RTX 6000 Ada stuck in D3cold under vfio-pci, function .0 disappears from PCI tree after remove/rescan survives reboot but recurs

mariokazela

New Member
May 25, 2026
1
0
1
Hi guys, just need your help,

Setup:

  • Proxmox VE 8 (kernel 6.x PVE)
  • 4× NVIDIA RTX 6000 Ada (AD102) for VM passthrough
  • All 8 functions (4× GPU + 4× HDMI audio) bound to vfio-pci
  • IOMMU on, intel_iommu=on iommu=pt, ACS working, groups clean
  • q35 VMs, pcie=1 on all hostpci entries
The problem:

After a VM that owned one of the GPUs shut down, one card (0000:94:00.0) ended up stuck in D3cold while its audio sibling 0000:94:00.1 stayed in D0:

Code:
0000:16:00.0 driver=vfio-pci power=D0
0000:16:00.1 driver=vfio-pci power=D0
0000:40:00.0 driver=vfio-pci power=D0
0000:40:00.1 driver=vfio-pci power=D0
0000:6a:00.0 driver=vfio-pci power=D0
0000:6a:00.1 driver=vfio-pci power=D0
0000:94:00.0 driver=vfio-pci power=D3cold <-- stuck
0000:94:00.1 driver=vfio-pci power=D0

What I tried (in order):

  1. echo on > .../power/control on 94:00.0 - no change
  2. echo 1 > .../remove then echo 1 > /sys/bus/pci/rescan - .0 does not re-enumerate, only .1 comes back
  3. Rescan from the parent bridge 0000:93:01.0 - same result, only .1 reappears
  4. PCIe link retrain via setpci -s 0000:93:01.0 CAP_EXP+10.w=0020:0020 - no change
  5. Secondary Bus Reset via setpci BRIDGE_CONTROL (0x03 → 0x43 → 0x03 with 500ms hold) — bridge accepts the writes, link should retrain, but .0 still does not re-enumerate after rescan
  6. echo 1 > /sys/bus/pci/devices/0000:93:01.0/reset — Permission denied (kernel guards bridge resets)
Anyone can help me to solve this issues?
 
This „smells“ more like a PCIe/BIOS power management problem than a Proxmox config error. A quick test would be to swap the RTX cards in this slot and check, if the error persists with the swapped card.

What you could check before and after the VM‘s start & stop:

Code:
lspci -vv -s 94:00.0

lspci -vv -s 93:01.0

cat /sys/bus/pci/devices/0000:94:00.0/power_state

cat /sys/bus/pci/devices/0000:94:00.0/power/control

dmesg | grep -Ei '94:00|vfio|D3|pcie|AER|reset'