GPU Passthrough hang/lockup after kernel update — RTX 5070 Ti (GB203/Blackwell) — 7.0.6-2-pve regression

Splice Here

New Member
Jun 10, 2026
2
1
3
Environment:
  • Proxmox VE 9.2.3, qemu-server 9.1.16
  • Host CPU: AMD (600 Series chipset, PCIe switch confirmed in lspci)
  • GPU: NVIDIA GeForce RTX 5070 Ti — GB203 — PCI ID 10de:2c05 (MSI variant 1462:5310)
  • GPU Audio: 10de:22e9
  • IOMMU Group 13 — GPU and audio isolated, no other devices sharing the group
  • VM hostpci config: hostpci0: 0000:01:00,pcie=1,x-vga=1
  • Single GPU in host (no iGPU, no secondary display adapter)
Problem:
After a Proxmox OS update followed by a clean OS shutdown and restart, starting a Windows VM with the RTX 5070 Ti in PCI passthrough causes the Proxmox host node to become completely unreachable. Web UI (port 8006), SSH, and Datacenter Manager all stop responding simultaneously. The host hardware remains powered on (IPMI/BMC accessible) but the OS is locked up. Removing the hostpci0 line from the VM config and restarting the VM immediately restores normal node operation. Re-adding the PCI device and restarting reproduces the hang every time.

This configuration had been running stably for approximately 6 months prior to the Proxmox update.

Diagnosis:
VFIO binding was confirmed correct before investigating further — both GPU functions were already bound to vfio-pci at host boot:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB203 [GeForce RTX 5070 Ti] [10de:2c05] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nova_core
01:00.1 Audio device [0403]: NVIDIA Corporation GB203 High Definition Audio [10de:22e9] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

Checking installed kernels revealed the update had pulled in 7.0.6-2-pve (currently running) as well as the 6.17.x series, with 6.14.8-2-pve still installed.

The key indicator is nova_core appearing in the kernel modules list on 7.0.6-2-pve. This module — introduced for Blackwell-architecture NVIDIA GPUs in newer kernels — is absent on 6.14.8-2-pve. The interaction between nova_core and the VFIO subsystem at VM start time appears to cause a PCIe bus hang that locks the host OS entirely.

Fix:
Rolling the boot kernel back to 6.14.8-2-pve fully resolves the issue:

proxmox-boot-tool kernel pin 6.14.8-2-pve
proxmox-boot-tool refresh
reboot

After reboot, nova_core no longer appears in the GPU's kernel modules list, and GPU passthrough functions normally. The Proxmox management stack, QEMU, and all other packages remain on their updated versions — only the kernel changes.

# Confirmed working state:
uname -r → 6.14.8-2-pve
lspci kernel modules for 10de:2c05 → nvidiafb, nouveau (no nova_core)
Kernel driver in use → vfio-pci (unchanged, correct)

Request:
Can anyone confirm whether this regression is present on 6.17.x kernels as well, or only 7.0.x?
We have not tested the 6.17 series.
Also interested to know if this affects other Blackwell SKUs (RTX 5090, in particular) or only GB203.
 
Jonas — thanks for the pointer to the other thread and for the follow-up.

Result: kernel 6.17.13-13-pve works, 7.0.6-2-pve does not.

We tested 6.17.13-13-pve on our RTX 5070 Ti (GB203, 10de:2c05) node and GPU passthrough is fully functional. The host stays reachable throughout VM startup and operation. This matches what SunBlack found on the RTX 5090 (GB202).

One additional finding in our testing: on kernel 6.17.13-13-pve, the GPU audio function (10de:22e9) was not automatically bound to vfio-pci at boot — snd_hda_intel was claiming it instead. This did not exist as a problem on 6.14.x. We resolved it by explicitly adding both PCI IDs to /etc/modprobe.d/vfio.conf:

options vfio-pci ids=10de:2c05,10de:22e9

Followed by update-initramfs -u and a reboot. After that, both 01:00.0 and 01:00.1 correctly show Kernel driver in use: vfio-pci and passthrough is stable.

Confirmed working state on 6.17.13-13-pve:

01:00.0 VGA ... NVIDIA Corporation GB203 [GeForce RTX 5070 Ti] [10de:2c05]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau <-- nova_core absent
01:00.1 Audio ... GB203 High Definition Audio [10de:22e9]
Kernel driver in use: vfio-pci <-- correctly bound after vfio.conf fix

dmesg during VM start (clean reset cycle, no errors):

vfio-pci 0000:01:00.0: resetting
vfio-pci 0000:01:00.0: reset done
vfio-pci 0000:01:00.1: resetting
vfio-pci 0000:01:00.1: reset done

In summary: kernel 7.0.x introduces a regression affecting all NVIDIA Blackwell (RTX 50-series) GPU passthrough configurations. The regression is not solvable by fixing vfio-pci binding alone — SunBlack confirmed the host still crashes on 7.0.x even with nova_core blacklisted and binding forced correctly. Kernel 6.17.13-13-pve is the current workaround, with the audio device vfio.conf fix applied.

Happy to provide any additional diagnostic output if useful for tracking this down upstream.
 
  • Like
Reactions: j.theisen