Hello Proxmox Community and Team,
I am experiencing a complete failure of GPU passthrough with my new hardware setup and have exhausted all standard troubleshooting steps. The issue is strongly isolated to the PCIe passthrough itself, and I suspect a kernel-level bug due to an additional critical observation.
1. Proxmox Version:
plaintext
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
pve-qemu-kvm: 10.0.2-4
qemu-server: 9.0.16
2. Host Hardware:
When the GPU (01:00.0 and its audio function 01:00.1) is added to the VM, the VM starts but becomes completely unresponsive. It does not boot far enough to bring up the network interface, resulting in ssh: connect to host [...] port 22: Connection timed out. The NoVNC console also fails to initialize, throwing a TASK ERROR: Failed to run vncproxy / qmp command 'set_password' failed error.
Crucially: The same VM works flawlessly via SSH when the GPU is not passed through. The problem is 100% tied to the PCI passthrough configuration.
4. Steps Already Taken (All unsuccessful):
The standard command to check IOMMU group isolation returns corrupted and nonsensical output on my system. This suggests a potential underlying issue with the IOMMU subsystem in this specific kernel/hardware combination.
Command executed:
bash
for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU Group %s ' "$n"; lspci -nns "${d##*/}"; done
Corrupted output example: The output is long and repetitive, showing impossibly high group numbers (e.g., Group 40+) and repeating bridge devices instead of listing actual hardware like the 01:00.0 GPU.
6. Request for Help:
Has anyone successfully achieved passthrough on a similar very new AM5/Raphael and RTX 90xx setup? Is this failure mode with the corrupted lspci output known to the developers? Could this be a kernel bug introduced in the 6.14.x series?
Any guidance or insight would be greatly appreciated. Thank you for your time and help.
Best regards,
Julian
I am experiencing a complete failure of GPU passthrough with my new hardware setup and have exhausted all standard troubleshooting steps. The issue is strongly isolated to the PCIe passthrough itself, and I suspect a kernel-level bug due to an additional critical observation.
1. Proxmox Version:
plaintext
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
pve-qemu-kvm: 10.0.2-4
qemu-server: 9.0.16
2. Host Hardware:
- CPU: AMD Ryzen 9 9950X (Granite Ridge, Zen 5, AM5)
- GPU: NVIDIA GeForce RTX 5070 Ti (ASUS TUF model)
- Motherboard: ASUS TUF GAMING B650-PLUS WIFI
- RAM: 32 GB DDR5
When the GPU (01:00.0 and its audio function 01:00.1) is added to the VM, the VM starts but becomes completely unresponsive. It does not boot far enough to bring up the network interface, resulting in ssh: connect to host [...] port 22: Connection timed out. The NoVNC console also fails to initialize, throwing a TASK ERROR: Failed to run vncproxy / qmp command 'set_password' failed error.
Crucially: The same VM works flawlessly via SSH when the GPU is not passed through. The problem is 100% tied to the PCI passthrough configuration.
4. Steps Already Taken (All unsuccessful):
IOMMU enabled in GRUB: amd_iommu=on iommu=pt
vfio-pci drivers bound to the GPU IDs via /etc/modprobe.d/vfio.conf
Added video=efifb
ff to kernel parameters
VM configured with machine: q35, BIOS: OVMF (UEFI)
PCI device flags: pcie=1,rombar=0,x-vga=1
Used a known-good VBIOS dump via romfile=/path/to/rom.rom
Configured a permanent EFI Disk
Hypervisor concealment: args: -cpu 'host,hv_vendor_id=null,-hypervisor' in VM config
The standard command to check IOMMU group isolation returns corrupted and nonsensical output on my system. This suggests a potential underlying issue with the IOMMU subsystem in this specific kernel/hardware combination.
Command executed:
bash
for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU Group %s ' "$n"; lspci -nns "${d##*/}"; done
Corrupted output example: The output is long and repetitive, showing impossibly high group numbers (e.g., Group 40+) and repeating bridge devices instead of listing actual hardware like the 01:00.0 GPU.
6. Request for Help:
Has anyone successfully achieved passthrough on a similar very new AM5/Raphael and RTX 90xx setup? Is this failure mode with the corrupted lspci output known to the developers? Could this be a kernel bug introduced in the 6.14.x series?
Any guidance or insight would be greatly appreciated. Thank you for your time and help.
Best regards,
Julian
Last edited: