3 Minute Delay Starting VM with GPU Passthrough (vfio-pci reset issue)

damarges · 2025-10-30T11:08:15+0100

Hello everyone,

I'm having an issue with my GPU passthrough setup. The passthrough itself works perfectly, and the GPU is available in my Windows 11 VM. However, every time I start the VM, there is a very long delay of about 2.5 to 3 minutes.

After analyzing the system logs, it seems the problem is not the VM startup itself, but the shutdown process (post-stop hook). The GPU is not being cleanly returned to the host, which causes a long series of PCI resets on the next VM start.
System Information:

Proxmox VE Version: 9.0.11
Linux Kernel: 6.14.11-4-pve
Motherboard: Supermicro X10Dai
CPU: 2x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
GPU: Nvidia RTX 4060 Ti (PCI IDs: 02:00.0 and 02:00.1)

VM Configuration (/etc/pve/qemu-server/100.conf):

Code:

affinity: 0-15
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;ide0;net0
cores: 16
cpu: x86-64-v3
efidisk0: vm-data:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hookscript: local:snippets/gpu-handoff.sh
hostpci0: 0000:02:00.0,pcie=1,x-vga=1
hostpci1: 0000:02:00.1,pcie=1
machine: pc-q35-10.0+pve1
memory: 65536
name: win11
net0: virtio=BC:24:11:F1:4A:55,bridge=vmbr0,firewall=1
numa: 1
ostype: win11
scsi0: vm-data:100/vm-100-disk-1.raw,cache=writeback,discard=on,size=700G
scsihw: virtio-scsi-single
smbios1: uuid=aa6c53ad-5443-497c-9cca-0d671ca8e9b8
sockets: 1
tpmstate0: vm-data:100/vm-100-disk-2.raw,size=4M,version=v2.0
vga: none
vmgenid: c22cf681-d3dd-4fab-826a-efdc0a8afc57

Current Hookscript (/var/lib/vz/snippets/gpu-handoff.sh):

Bash:

#!/bin/bash
set -euo pipefail
VMID="$1"
PHASE="$2"

GPU="0000:02:00.0"
AUDIO="0000:02:00.1"

log(){ logger -t gpu-handoff "VM $VMID: [$PHASE] $*"; }

case "$PHASE" in
  pre-start)
    log "Handing off GPU to VM..."
    fuser -k /dev/nvidia* 2>/dev/null || true
    sleep 1
    modprobe -r nvidia_drm nvidia_modeset nvidia_uvm nvidia 2>/dev/null || true
    echo "$GPU" > /sys/bus/pci/devices/$GPU/driver/unbind 2>/dev/null || true
    echo "$AUDIO" > /sys/bus/pci/devices/$AUDIO/driver/unbind 2>/dev/null || true
    modprobe vfio-pci
    echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/bind
    echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/bind
    log "GPU successfully bound to vfio-pci."
    ;;

  post-stop)
    log "Returning GPU to host..."
    echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/unbind
    echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/unbind
    sleep 1
    echo 1 > /sys/bus/pci/devices/$GPU/reset 2>/dev/null || true
    echo 1 > /sys/bus/pci/devices/$AUDIO/reset 2>/dev/null || true
    modprobe nvidia_drm
    modprobe nvidia_modeset
    modprobe nvidia_uvm
    modprobe nvidia
    modprobe snd_hda_intel
    sleep 2
    echo "$GPU" > /sys/bus/pci/drivers/nvidia/bind
    echo "$AUDIO" > /sys/bus/pci/drivers/snd_hda_intel/bind
    nvidia-smi >/dev/null 2>&1
    log "GPU successfully returned to host."
    ;;
esac

exit 0

When I shut down the VM, the post-stop script fails with "Device or resource busy":

Code:

# The VM is shut down...
Oct 30 10:55:02 pve gpu-handoff[1682882]: VM 100: [post-stop] GPU wird an den Host zurückgegeben...
# ...
Oct 30 10:55:07 pve qmeventd[1682871]: /var/lib/vz/snippets/gpu-handoff.sh: line 45: echo: write error: Device or resource busy
Oct 30 10:55:07 pve qmeventd[1682871]: hookscript error for 100 on post-stop: command '/var/lib/vz/snippets/gpu-handoff.sh 100 post-stop' failed: exit code 1

I assume because of this failure, the next VM start hangs for almost 3 minutes while the kernel repeatedly tries to reset the PCI device:

Code:

# The VM is started again...
Oct 30 10:55:30 pve kernel: # Network setup is quick
# --- HUGE GAP of ~2m 57s ---
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: reset done
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.1: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: reset done
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.1: reset done
Oct 30 10:58:28 pve qm[1683332]: VM 100 started with PID 1683370.

I believe if I can fix the post-stop script so it no longer fails, the startup delay will be gone.

Does anyone have experience with this kind of behavior, perhaps with this specific GPU? Is there a better way to structure the script to ensure a clean return of the GPU to the host?

Thank you in advance for any help

damarges · 2025-10-30T13:27:21+0100

Update. With this script, start times are reduced from 150 seconds to 60 seconds. Not perfect but better:

Code:

#!/bin/bash
set -euo pipefail
VMID="$1"
PHASE="$2"

GPU="0000:02:00.0"
AUDIO="0000:02:00.1"

log(){ logger -t gpu-handoff "VM $VMID: [$PHASE] $*"; }

case "$PHASE" in
  pre-start)
    log "Phase: pre-start. Bereite GPU für VM vor."

    # Treiber vom Host lösen
    echo "$GPU" > /sys/bus/pci/devices/$GPU/driver/unbind || true
    echo "$AUDIO" > /sys/bus/pci/devices/$AUDIO/driver/unbind || true

    # Nvidia-Treiber entladen, um die GPU komplett freizugeben
    modprobe -r nvidia_drm nvidia_modeset nvidia_uvm nvidia || true

    # An vfio-pci binden
    echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/bind || true
    echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/bind || true

    log "GPU erfolgreich an vfio-pci gebunden."
    ;;

  post-stop)
    log "Phase: post-stop. Gebe GPU an Host zurück."

    # Von vfio-pci lösen (genau wie in Ihrem manuellen Script)
    echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/unbind || true
    echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/unbind || true

    # Alle notwendigen Host-Treiber laden (genau wie in Ihrem manuellen Script)
    modprobe nvidia nvidia_uvm nvidia_modeset nvidia_drm
    modprobe snd_hda_intel

    # An die Host-Treiber binden (genau wie in Ihrem manuellen Script)
    echo "$GPU" > /sys/bus/pci/drivers/nvidia/bind || true
    echo "$AUDIO" > /sys/bus/pci/drivers/snd_hda_intel/bind || true

    sleep 1 # Kurze Pause

    # GPU auf dem Host initialisieren (genau wie in Ihrem manuellen Script)
    nvidia-smi || true

    log "GPU erfolgreich an Host zurückgegeben."
    ;;
esac

exit 0

There is no vendor-reset like for AMD GPUs for NVIDIA is it?

Search

Search

3 Minute Delay Starting VM with GPU Passthrough (vfio-pci reset issue)

damarges

Member

damarges

Member

We value your privacy