Hi guys, just need your help,
Setup:
After a VM that owned one of the GPUs shut down, one card (0000:94:00.0) ended up stuck in D3cold while its audio sibling 0000:94:00.1 stayed in D0:
What I tried (in order):
Setup:
- Proxmox VE 8 (kernel 6.x PVE)
- 4× NVIDIA RTX 6000 Ada (AD102) for VM passthrough
- All 8 functions (4× GPU + 4× HDMI audio) bound to vfio-pci
- IOMMU on, intel_iommu=on iommu=pt, ACS working, groups clean
- q35 VMs, pcie=1 on all hostpci entries
After a VM that owned one of the GPUs shut down, one card (0000:94:00.0) ended up stuck in D3cold while its audio sibling 0000:94:00.1 stayed in D0:
Code:
0000:16:00.0 driver=vfio-pci power=D0
0000:16:00.1 driver=vfio-pci power=D0
0000:40:00.0 driver=vfio-pci power=D0
0000:40:00.1 driver=vfio-pci power=D0
0000:6a:00.0 driver=vfio-pci power=D0
0000:6a:00.1 driver=vfio-pci power=D0
0000:94:00.0 driver=vfio-pci power=D3cold <-- stuck
0000:94:00.1 driver=vfio-pci power=D0
What I tried (in order):
- echo on > .../power/control on 94:00.0 - no change
- echo 1 > .../remove then echo 1 > /sys/bus/pci/rescan - .0 does not re-enumerate, only .1 comes back
- Rescan from the parent bridge 0000:93:01.0 - same result, only .1 reappears
- PCIe link retrain via setpci -s 0000:93:01.0 CAP_EXP+10.w=0020:0020 - no change
- Secondary Bus Reset via setpci BRIDGE_CONTROL (0x03 → 0x43 → 0x03 with 500ms hold) — bridge accepts the writes, link should retrain, but .0 still does not re-enumerate after rescan
- echo 1 > /sys/bus/pci/devices/0000:93:01.0/reset — Permission denied (kernel guards bridge resets)