NVIDIA GPU on host unusable after use in VM

damarges · Nov 27, 2023

After a fresh reboot I can see and use my NVIDIA Geforce 3060 GPU on host an in docker containers:

Code:

root@pve:~# nvidia-smi
Mon Nov 27 18:47:18 2023      
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   60C    P0              43W / 170W |      1MiB / 12288MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                       
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

As soon as I use this gpu in my win10 VM I cannot use it on the host machine after shutting down windows (either in windows start menu shutdown or qm stop 100).

This is how my vm config looks like:

Code:

agent: 1
bios: ovmf
boot: order=ide0;ide2;net0;sata0
cores: 5
cpu: host
efidisk0: fastn:101/vm-101-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:01:00,pcie=1,x-vga=1
ide0: fastn:101/vm-101-disk-1.qcow2,cache=writethrough,size=350G,ssd=1
machine: pc-q35-8.0
memory: 24576
meta: creation-qemu=8.0.2,ctime=1691479506
name: win10ssd
net0: e1000=06:4F:B9:A3:7B:90,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=53296502-4346-43a7-aed6-84333ee24a4f
sockets: 2
unused0: fastn:101/vm-101-disk-2.raw
usb0: host=04e8:3301,usb3=1
usb1: host=3302:29c7
usb2: host=093a:2510
usb3: host=1c4f:0015

After that vm shutdown docker containers can't see and use the gpu anymore as well as host processed like nividia-smi:

Code:

root@pve:~# nvidia-smi
Failed to initialize NVML: Unknown Error

Anyone got an Idea what could be causing this behaviour?

Contents of my /etc/default/grub:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt video=vesafb:off video=efifb:off  initcall_blacklist=sysfb_init

Only a complete reboot of proxmox helps.

leesteken · Nov 27, 2023

Proxmox can unbind a PCI(e) devices for passthrough, but it does not do the reverse when the VM shuts down. You can do this manually by unbinding the various functions from vfio-pci and binding them to their original drivers. You can use a hookscript to automate this after VM shutdown. There are some snippets for this somewhere on this forum, but the details depend of course on your specific device and its IDs.

damarges · Nov 27, 2023

thanks for the reply.
I found some threads on the board now. Will try and get back/update this thread.

LnxBil · Nov 27, 2023

look here.

leesteken · Nov 27, 2023

damarges said:
thanks for the reply.
I found some threads on the board now. Will try and get back/update this thread.

I found this: https://forum.proxmox.com/threads/vms-dont-release-passed-through-gpu.127892/#post-559382

LnxBil · Nov 27, 2023

damarges said:
thanks for the reply.
I found some threads on the board now. Will try and get back/update this thread.

You changed your whole comment? Sadly, I did not reply to it so that my comment does not make any sense

damarges · Nov 27, 2023

The original comment was "I could not find threads about that on the board". 60 seconds later I found threads and so I edited into "I found threads and will walk through them". Sorry if you began to write an answer to my post not finding anything before. It was not my intention to waste some of your time.

damarges · Jan 26, 2024

What it finally helped and got me working is this shell-script that first removes the vfio-modules, does a modprobe of nvidia driver and then unbinds and rebinds the driver to my hardware (in this case the nvidia rtx 3060):

Bash:

rmmod vfio_pci
rmmod vfio_pci_iommu_type1
rmmod vfio
modprobe nvidia
echo "0000:01:00.0" > "/sys/bus/pci/devices/0000:01:00.0/driver/unbind" && echo "0000:01:00.0" > "/sys/bus/pci/drivers/nvidia/bind"

leesteken · Jan 27, 2024

damarges said:
rmmod vfio_pci rmmod vfio_pci_iommu_type1 rmmod vfio

This is not necessary (and maybe impossible or problematic when still running other VMs with passthrough). Beter to use:

echo "0000:01:00.0" > "/sys/bus/pci/drivers/vfio-pci/unbind"
echo "0000:01:00.1" > "/sys/bus/pci/drivers/vfio-pci/unbind"

and so on for all functions of 0000:01:00, to unbind vfio-pci from the VGA and the audio and possibly other functions like USB. (I don't know your 3060).
But then also bind the right drivers for those functions:

echo "0000:01:00.0" > "/sys/bus/pci/drivers/nvidia/bind"
echo "0000:01:00.1" > "/sys/bus/pci/drivers/snd_hda_intel/bind"

and so one for all functions.

damarges · Jan 27, 2024

Thanks so basically the last line of my script should be enough. The rtx3060 also has an audio Interface on the .1 ending but I neither can unbind or bind it at all. As I don't use it and don't have any issues besides that it works for me. Thank you for your help

Ah wait the sound module of my GPU might use an intel driver instead of a nvidia driver?

Search

Search

NVIDIA GPU on host unusable after use in VM

damarges

Member

leesteken

Distinguished Member

damarges

Member

LnxBil

Distinguished Member

leesteken

Distinguished Member

LnxBil

Distinguished Member

damarges

Member

damarges

Member

leesteken

Distinguished Member

damarges

Member

We value your privacy