NVIDIA GPU on host unusable after use in VM

damarges

Member
Jul 25, 2022
29
3
8
40
Bad Kreuznach, Germany
After a fresh reboot I can see and use my NVIDIA Geforce 3060 GPU on host an in docker containers:
Code:
root@pve:~# nvidia-smi
Mon Nov 27 18:47:18 2023      
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   60C    P0              43W / 170W |      1MiB / 12288MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                       
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

As soon as I use this gpu in my win10 VM I cannot use it on the host machine after shutting down windows (either in windows start menu shutdown or qm stop 100).

This is how my vm config looks like:

Code:
agent: 1
bios: ovmf
boot: order=ide0;ide2;net0;sata0
cores: 5
cpu: host
efidisk0: fastn:101/vm-101-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:01:00,pcie=1,x-vga=1
ide0: fastn:101/vm-101-disk-1.qcow2,cache=writethrough,size=350G,ssd=1
machine: pc-q35-8.0
memory: 24576
meta: creation-qemu=8.0.2,ctime=1691479506
name: win10ssd
net0: e1000=06:4F:B9:A3:7B:90,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=53296502-4346-43a7-aed6-84333ee24a4f
sockets: 2
unused0: fastn:101/vm-101-disk-2.raw
usb0: host=04e8:3301,usb3=1
usb1: host=3302:29c7
usb2: host=093a:2510
usb3: host=1c4f:0015

After that vm shutdown docker containers can't see and use the gpu anymore as well as host processed like nividia-smi:

Code:
root@pve:~# nvidia-smi
Failed to initialize NVML: Unknown Error

Anyone got an Idea what could be causing this behaviour?

Contents of my /etc/default/grub:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt video=vesafb:off video=efifb:off  initcall_blacklist=sysfb_init


Only a complete reboot of proxmox helps.
 
Last edited:
Proxmox can unbind a PCI(e) devices for passthrough, but it does not do the reverse when the VM shuts down. You can do this manually by unbinding the various functions from vfio-pci and binding them to their original drivers. You can use a hookscript to automate this after VM shutdown. There are some snippets for this somewhere on this forum, but the details depend of course on your specific device and its IDs.
 
  • Like
Reactions: damarges and LnxBil
The original comment was "I could not find threads about that on the board". 60 seconds later I found threads and so I edited into "I found threads and will walk through them". Sorry if you began to write an answer to my post not finding anything before. It was not my intention to waste some of your time.
 
What it finally helped and got me working is this shell-script that first removes the vfio-modules, does a modprobe of nvidia driver and then unbinds and rebinds the driver to my hardware (in this case the nvidia rtx 3060):


Bash:
rmmod vfio_pci
rmmod vfio_pci_iommu_type1
rmmod vfio
modprobe nvidia
echo "0000:01:00.0" > "/sys/bus/pci/devices/0000:01:00.0/driver/unbind" && echo "0000:01:00.0" > "/sys/bus/pci/drivers/nvidia/bind"
 
  • Like
Reactions: leesteken
rmmod vfio_pci rmmod vfio_pci_iommu_type1 rmmod vfio
This is not necessary (and maybe impossible or problematic when still running other VMs with passthrough). Beter to use:
echo "0000:01:00.0" > "/sys/bus/pci/drivers/vfio-pci/unbind" echo "0000:01:00.1" > "/sys/bus/pci/drivers/vfio-pci/unbind"
and so on for all functions of 0000:01:00, to unbind vfio-pci from the VGA and the audio and possibly other functions like USB. (I don't know your 3060).
But then also bind the right drivers for those functions:
echo "0000:01:00.0" > "/sys/bus/pci/drivers/nvidia/bind" echo "0000:01:00.1" > "/sys/bus/pci/drivers/snd_hda_intel/bind"
and so one for all functions.
 
  • Like
Reactions: damarges
Thanks so basically the last line of my script should be enough. The rtx3060 also has an audio Interface on the .1 ending but I neither can unbind or bind it at all. As I don't use it and don't have any issues besides that it works for me. Thank you for your help

Ah wait the sound module of my GPU might use an intel driver instead of a nvidia driver?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!