VMs hang and can't be killed when passing in GPU

Redmumba

New Member
May 9, 2025
2
0
1
I rebooted my server this morning (2 LXCs, one VM), and the VM will not come up with the passthrough GPU for PCIe. For this case, the VM ID is 104. I turned off "start on boot" so that I could actually get to a "stable" state, I've made sure I'm updated `apt update` / `apt upgrade`, rebuilt nVidia drivers on both host and in the VM by disabling passthrough. Nothing works, and nothing has changed with the config afaik (last rebooted about a week ago).

What the heck is happening??

I'm able to view the card from `nvidia-smi` on the host, so I know it "works":


Code:
root@proxmox:~# nvidia-smi
Fri May  9 11:49:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    Off |   00000000:27:00.0 Off |                  N/A |
|  0%   44C    P8             12W /  125W |       0MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Any interactions from the command line result in it locking up, i.e. if I run `qm start 104` it can't be interupted/slept/etc.. Trying to kill the running process does nothing as well:

Code:
root@proxmox:~# lsof /var/lock/qemu-server/lock-104.conf
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
task\x20U 4055 root    5wW  REG   0,28        0   77 /run/lock/qemu-server/lock-104.conf
root@proxmox:~# ps aux | grep 4055
root        4055 27.3  0.3 229756 118796 pts/0   R+   11:39   3:06 task UPID:proxmox:00000FD7:00001001:681E4BD0:qmstart:104:root@pam:
root        7950  0.0  0.0   6336  2048 pts/1    S+   11:50   0:00 grep 4055
root@proxmox:~# pstree 4055
task UPID:proxm
root@proxmox:~#

Code:
root@proxmox:~# cat /etc/pve/qemu-server/104.conf
[...]
agent: enabled=1
args: -object memory-backend-memfd,id=mem,size=8192M,share=on
bios: ovmf
boot: order=scsi0
cores: 6
cpu: EPYC-IBPB
efidisk0: local-lvm:vm-104-disk-0,efitype=4m,size=4M
hostpci0: 0000:27:00.0
localtime: 1
memory: 16384
meta: creation-qemu=9.2.0,ctime=1745094291
name: docker
net0: virtio=02:FF:E6:52:C1:29,bridge=vmbr0
numa: 1
onboot: 0
ostype: l26
scsi0: local-lvm:vm-104-disk-1,discard=on,size=200G,ssd=1
scsi1: local-lvm:vm-104-disk-2,backup=0,cache=writethrough,size=256G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=5bbe3e4d-cebe-4269-a2aa-e4fb2a2acb64
sockets: 2
tablet: 0
tags: community-script,debian12,docker
usb0: host=8-3
vga: none
vmgenid: ecf2b3c6-4c7a-4c48-9f15-97da478ac861

Logs:
* `dmesg -T`: https://paste.debian.net/hidden/629d3d58/
 
Last edited:
For anybody else that runs into this, I eventually found the issue; specifically, it was this line here:

Code:
[Fri May  9 11:39:14 2025] NVRM: Attempting to remove device 0000:27:00.0 with non-zero usage count!

It turns out that NetData and Beszel both use the nvidia-smi command in a persistent mode to monitor the temperature, which was preventing the kernel from unloading it. I uninstalled the nVidia drivers from the host (I don't believe they're needed), but in case anybody runs into a similar issue.