Hello everyone,
I'm having an issue with my GPU passthrough setup. The passthrough itself works perfectly, and the GPU is available in my Windows 11 VM. However, every time I start the VM, there is a very long delay of about 2.5 to 3 minutes.
After analyzing the system logs, it seems the problem is not the VM startup itself, but the shutdown process (post-stop hook). The GPU is not being cleanly returned to the host, which causes a long series of PCI resets on the next VM start.
System Information:
VM Configuration (/etc/pve/qemu-server/100.conf):
Current Hookscript (/var/lib/vz/snippets/gpu-handoff.sh):
When I shut down the VM, the post-stop script fails with "Device or resource busy":
I assume because of this failure, the next VM start hangs for almost 3 minutes while the kernel repeatedly tries to reset the PCI device:
I believe if I can fix the post-stop script so it no longer fails, the startup delay will be gone.
Does anyone have experience with this kind of behavior, perhaps with this specific GPU? Is there a better way to structure the script to ensure a clean return of the GPU to the host?
Thank you in advance for any help
I'm having an issue with my GPU passthrough setup. The passthrough itself works perfectly, and the GPU is available in my Windows 11 VM. However, every time I start the VM, there is a very long delay of about 2.5 to 3 minutes.
After analyzing the system logs, it seems the problem is not the VM startup itself, but the shutdown process (post-stop hook). The GPU is not being cleanly returned to the host, which causes a long series of PCI resets on the next VM start.
System Information:
- Proxmox VE Version: 9.0.11
- Linux Kernel: 6.14.11-4-pve
- Motherboard: Supermicro X10Dai
- CPU: 2x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
- GPU: Nvidia RTX 4060 Ti (PCI IDs: 02:00.0 and 02:00.1)
VM Configuration (/etc/pve/qemu-server/100.conf):
Code:
affinity: 0-15
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;ide0;net0
cores: 16
cpu: x86-64-v3
efidisk0: vm-data:100/vm-100-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hookscript: local:snippets/gpu-handoff.sh
hostpci0: 0000:02:00.0,pcie=1,x-vga=1
hostpci1: 0000:02:00.1,pcie=1
machine: pc-q35-10.0+pve1
memory: 65536
name: win11
net0: virtio=BC:24:11:F1:4A:55,bridge=vmbr0,firewall=1
numa: 1
ostype: win11
scsi0: vm-data:100/vm-100-disk-1.raw,cache=writeback,discard=on,size=700G
scsihw: virtio-scsi-single
smbios1: uuid=aa6c53ad-5443-497c-9cca-0d671ca8e9b8
sockets: 1
tpmstate0: vm-data:100/vm-100-disk-2.raw,size=4M,version=v2.0
vga: none
vmgenid: c22cf681-d3dd-4fab-826a-efdc0a8afc57
Current Hookscript (/var/lib/vz/snippets/gpu-handoff.sh):
Bash:
#!/bin/bash
set -euo pipefail
VMID="$1"
PHASE="$2"
GPU="0000:02:00.0"
AUDIO="0000:02:00.1"
log(){ logger -t gpu-handoff "VM $VMID: [$PHASE] $*"; }
case "$PHASE" in
pre-start)
log "Handing off GPU to VM..."
fuser -k /dev/nvidia* 2>/dev/null || true
sleep 1
modprobe -r nvidia_drm nvidia_modeset nvidia_uvm nvidia 2>/dev/null || true
echo "$GPU" > /sys/bus/pci/devices/$GPU/driver/unbind 2>/dev/null || true
echo "$AUDIO" > /sys/bus/pci/devices/$AUDIO/driver/unbind 2>/dev/null || true
modprobe vfio-pci
echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/bind
echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/bind
log "GPU successfully bound to vfio-pci."
;;
post-stop)
log "Returning GPU to host..."
echo "$GPU" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "$AUDIO" > /sys/bus/pci/drivers/vfio-pci/unbind
sleep 1
echo 1 > /sys/bus/pci/devices/$GPU/reset 2>/dev/null || true
echo 1 > /sys/bus/pci/devices/$AUDIO/reset 2>/dev/null || true
modprobe nvidia_drm
modprobe nvidia_modeset
modprobe nvidia_uvm
modprobe nvidia
modprobe snd_hda_intel
sleep 2
echo "$GPU" > /sys/bus/pci/drivers/nvidia/bind
echo "$AUDIO" > /sys/bus/pci/drivers/snd_hda_intel/bind
nvidia-smi >/dev/null 2>&1
log "GPU successfully returned to host."
;;
esac
exit 0
When I shut down the VM, the post-stop script fails with "Device or resource busy":
Code:
# The VM is shut down...
Oct 30 10:55:02 pve gpu-handoff[1682882]: VM 100: [post-stop] GPU wird an den Host zurückgegeben...
# ...
Oct 30 10:55:07 pve qmeventd[1682871]: /var/lib/vz/snippets/gpu-handoff.sh: line 45: echo: write error: Device or resource busy
Oct 30 10:55:07 pve qmeventd[1682871]: hookscript error for 100 on post-stop: command '/var/lib/vz/snippets/gpu-handoff.sh 100 post-stop' failed: exit code 1
I assume because of this failure, the next VM start hangs for almost 3 minutes while the kernel repeatedly tries to reset the PCI device:
Code:
# The VM is started again...
Oct 30 10:55:30 pve kernel: # Network setup is quick
# --- HUGE GAP of ~2m 57s ---
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: reset done
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.1: resetting
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.0: reset done
Oct 30 10:58:27 pve kernel: vfio-pci 0000:02:00.1: reset done
Oct 30 10:58:28 pve qm[1683332]: VM 100 started with PID 1683370.
I believe if I can fix the post-stop script so it no longer fails, the startup delay will be gone.
Does anyone have experience with this kind of behavior, perhaps with this specific GPU? Is there a better way to structure the script to ensure a clean return of the GPU to the host?
Thank you in advance for any help