Recover GPU from VM after it shuts down

promoxer

Member
Apr 21, 2023
208
20
18
1. I have 2 VMs that can takeover the 1 GPU in my system
2. I can pass the GPU between the 2 VMs without a hitch, i.e. shutdown A, startup B, shutdown B, startup A
3. Based on this, what commands can I type in Proxmox to take back control of the GPU?
 
I believe this steps can help:

Identify the VMs that are currently using the GPU by navigating to the "Hardware" tab of each VM and checking if the GPU device is present.

Shut down both VMs that are using the GPU to release the device.

Open the Proxmox shell or terminal and type the following command to unbind the GPU device from the virtual machine:

#qm device del <vmid> <deviceid>
Replace <vmid> with the ID of the virtual machine you want to remove the GPU device from, and <deviceid> with the ID of the GPU device.
For example, if the ID of the virtual machine is 100 and the ID of the GPU device is 0, the command will be:
#qm device del 100 0

Once the GPU device is unbound from both VMs, you can bind it back to the Proxmox host by running the following command:
#echo 1 > /sys/class/vtconsole/vtcon0/bind
This command binds the GPU device to the first virtual console of the Proxmox host.

If you have multiple GPUs or want to bind the GPU to a different virtual console, you can replace /sys/class/vtconsole/vtcon0 with the appropriate device path.

Finally, you can start up the VMs again and assign the GPU device to the VM that needs it using the Proxmox web interface
 
I haven't tried qm device del but after the VM is shutdown, unbind the current driver (usually vfio-pci) from the devices and bind the actual drivers. For example:
echo "0000:01:00.0" > "/sys/bus/pci/devices/0000:01:00.0/driver/unbind" && echo "0000:01:00.0" > "/sys/bus/pci/drivers/amdgpu/bind"
echo "0000:01:00.1" > "/sys/bus/pci/devices/0000:01:00.1/driver/unbind" && echo "0000:01:00.1" > "/sys/bus/pci/drivers/snd_hda_intel/bind"
Your PCI ID (0000:01:00.0 etc.) and drivers (amdgpu etc.) might be different.
 
  • Like
Reactions: _gabriel
1. Since the VMs are shutdown, do I still need to unbind the GPU from the VMs first? Because VM B was able to grab GPU from A without A unbinding it.

2. What is the equivalent of `/sys/bus/pci/drivers/amdgpu/bind` for Nvidia? I looked in /sys/bus/pci/drivers and did not find amdgpu or anything that sounds like nvidia. I do find vfio-pci though.

Thank you.
 
Last edited:
1. Since the VMs are shutdown, do I still need to unbind the GPU from the VMs first? Because VM B was able to grab GPU from A without A unbinding it.
Yes, because for VMs the driver needs to be vfio-pci for passthrough, but for the Proxmox host you need the actual driver for the device to use it.
2. What is the equivalent of `/sys/bus/pci/drivers/amdgpu/bind` for Nvidia? I looked in /sys/bus/pci/drivers and did not find amdgpu or anything that sounds like nvidia. I do find vfio-pci though.
What driver is loaded for your GPU when you did not use passthrough? What does lspci -ks YOUR_PCI_ID show as possible drivers? It's probably nouveau.
 
Code:
root@pve:~# lspci -ks 0000:08:00.0
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
        Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
root@pve:~# locate
locate: no pattern to search for specified
root@pve:~# locate nvidiafb
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
root@pve:~# locate nouveau
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/gpu/drm/nouveau
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/gpu/drm/nouveau/nouveau.ko
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/gpu/drm/nouveau
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/gpu/drm/nouveau/nouveau.ko
root@pve:~#

this is what I have... so just
Code:
/sys/bus/pci/drivers/nvdiafb
? I did locate the files physically though..
 
Code:
root@pve:~# lspci -ks 0000:08:00.0
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
        Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
root@pve:~# locate
locate: no pattern to search for specified
root@pve:~# locate nvidiafb
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
root@pve:~# locate nouveau
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/gpu/drm/nouveau
/usr/lib/modules/5.15.104-1-pve/kernel/drivers/gpu/drm/nouveau/nouveau.ko
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/gpu/drm/nouveau
/usr/lib/modules/5.15.74-1-pve/kernel/drivers/gpu/drm/nouveau/nouveau.ko
root@pve:~#

this is what I have... so just
Code:
/sys/bus/pci/drivers/nvdiafb
? I did locate the files physically though..
Or nouveau. Which one is loaded (in use) after a reboot of the host when not using passthrough and you do have a working host console?
 
Yes, I have a working console. Let me check with a reboot. I have to disable my Windows to not auto start first. :)

One moment please.
 
Yes, I have a text console on my monitor when Windows VM was not booted up. How do I tell which driver is being used? I ran `lspci -ks 08:00.0` and it showed both nouveau and nvidiafb like earlier logs.
 
I did a `ls -R` in my `/sys/bus/pci/drivers` and could not find any driver with `08:00.0`. What does this mean?
 
Yes, I have a working console. Let me check with a reboot. I have to disable my Windows to not auto start first. :)

One moment please.
Don't worry, this is not a chat and I'm not waiting for your reply.
Yes, I have a text console on my monitor when Windows VM was not booted up. How do I tell which driver is being used? I ran `lspci -ks 08:00.0` and it showed both nouveau and nvidiafb like earlier logs.
What is the (exact) output of lspci -ks 08:00.0? What is the output of ls -l /sys/buis/pci/devices/0000:08:00.0/driver?
 
Bash:
root@pve:~# lspci -ks 08:00.0
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
        Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
 
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
It is already bound to vfio-pci. I need to see this when you are not doing passthrough and the host console is still working (because that is the situation you want to restore).
 
This is what I get after a reboot, proxmox login prompt being displayed on my GPU's monitor, before I start Windows.

Hope it helps.

Bash:
root@pve:~# ls -l /sys/bus/pci/devices/0000:08:00.0/driver
ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/driver': No such file or directory
root@pve:~# lspci -ks 08:00.0
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
        Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
        Kernel modules: nvidiafb, nouveau
root@pve:~# ls -l /sys/bus/pci/devices/0000:08:00.0/
total 0
-r--r--r-- 1 root root      4096 Apr 26 22:53 ari_enabled
-r--r--r-- 1 root root      4096 Apr 26 22:53 boot_vga
-rw-r--r-- 1 root root      4096 Apr 26 22:53 broken_parity_status
-r--r--r-- 1 root root      4096 Apr 26 22:53 class
-rw-r--r-- 1 root root      4096 Apr 26 22:53 config
-r--r--r-- 1 root root      4096 Apr 26 22:53 consistent_dma_mask_bits
lrwxrwxrwx 1 root root         0 Apr 26 22:53 consumer:pci:0000:08:00.1 -> ../../../virtual/devlink/pci:0000:08:00.0--pci:0000:08:00.1
-r--r--r-- 1 root root      4096 Apr 26 22:53 current_link_speed
-r--r--r-- 1 root root      4096 Apr 26 22:53 current_link_width
-rw-r--r-- 1 root root      4096 Apr 26 22:53 d3cold_allowed
-r--r--r-- 1 root root      4096 Apr 26 22:53 device
-r--r--r-- 1 root root      4096 Apr 26 22:53 dma_mask_bits
-rw-r--r-- 1 root root      4096 Apr 26 22:53 driver_override
-rw-r--r-- 1 root root      4096 Apr 26 22:53 enable
lrwxrwxrwx 1 root root         0 Apr 26 22:53 firmware_node -> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:24/device:25
lrwxrwxrwx 1 root root         0 Apr 26 22:53 iommu -> ../../0000:00:00.2/iommu/ivhd0
lrwxrwxrwx 1 root root         0 Apr 26 22:53 iommu_group -> ../../../../kernel/iommu_groups/26
-r--r--r-- 1 root root      4096 Apr 26 22:53 irq
drwxr-xr-x 2 root root         0 Apr 26 22:53 link
-r--r--r-- 1 root root      4096 Apr 26 22:53 local_cpulist
-r--r--r-- 1 root root      4096 Apr 26 22:53 local_cpus
-r--r--r-- 1 root root      4096 Apr 26 22:53 max_link_speed
-r--r--r-- 1 root root      4096 Apr 26 22:53 max_link_width
-r--r--r-- 1 root root      4096 Apr 26 22:53 modalias
-rw-r--r-- 1 root root      4096 Apr 26 22:53 msi_bus
-rw-r--r-- 1 root root      4096 Apr 26 22:53 numa_node
drwxr-xr-x 2 root root         0 Apr 26 22:53 power
-r--r--r-- 1 root root      4096 Apr 26 22:53 power_state
--w--w---- 1 root root      4096 Apr 26 22:53 remove
--w------- 1 root root      4096 Apr 26 22:53 rescan
--w------- 1 root root      4096 Apr 26 22:53 reset
-rw-r--r-- 1 root root      4096 Apr 26 22:53 reset_method
-r--r--r-- 1 root root      4096 Apr 26 22:53 resource
-rw------- 1 root root  16777216 Apr 26 22:53 resource0
-rw------- 1 root root 268435456 Apr 26 22:53 resource1
-rw------- 1 root root 268435456 Apr 26 22:53 resource1_wc
-rw------- 1 root root  33554432 Apr 26 22:53 resource3
-rw------- 1 root root  33554432 Apr 26 22:53 resource3_wc
-rw------- 1 root root       128 Apr 26 22:53 resource5
-r--r--r-- 1 root root      4096 Apr 26 22:53 revision
-rw------- 1 root root    131072 Apr 26 22:53 rom
lrwxrwxrwx 1 root root         0 Apr 26 22:53 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root      4096 Apr 26 22:53 subsystem_device
-r--r--r-- 1 root root      4096 Apr 26 22:54 subsystem_vendor
-rw-r--r-- 1 root root      4096 Apr 26 22:53 uevent
-r--r--r-- 1 root root      4096 Apr 26 22:53 vendor
-r--r--r-- 1 root root      4096 Apr 26 22:53 waiting_for_supplier
root@pve:~#
 
Last edited:
This is what I get after a reboot, proxmox login prompt being displayed on my GPU's monitor, before I start Windows.

Hope it helps.

Bash:
root@pve:~# ls -l /sys/bus/pci/devices/0000:08:00.0/driver
ls: cannot access '/sys/bus/pci/devices/0000:08:00.0/driver': No such file or directory
root@pve:~# lspci -ks 08:00.0
08:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
        Subsystem: Hewlett-Packard Company GM107 [GeForce GTX 745]
        Kernel modules: nvidiafb, nouveau
Hmm, there is no driver in use. Probably simplefb took over from UEFI/BIOS. Try binding nvidiafb or nouveau and see how it goes when you restart the console:
Once the GPU device is unbound from both VMs, you can bind it back to the Proxmox host by running the following command:
#echo 1 > /sys/class/vtconsole/vtcon0/bind
This command binds the GPU device to the first virtual console of the Proxmox host.

If you have multiple GPUs or want to bind the GPU to a different virtual console, you can replace /sys/class/vtconsole/vtcon0 with the appropriate device path.
 
Ok thanks, mind sharing how I should bind nvidiafb/nouveau?

I assume "restart the console" means this command echo 1 > /sys/class/vtconsole/vtcon0/bind?
 
Last edited:
Ok thanks, mind sharing how I should bind nvidiafb/nouveau?
It's the second part after the && (the first part is the unbind of vfio-pci):
echo "0000:01:00.0" > "/sys/bus/pci/devices/0000:01:00.0/driver/unbind" && echo "0000:01:00.0" > "/sys/bus/pci/drivers/amdgpu/bind"
echo "0000:01:00.1" > "/sys/bus/pci/devices/0000:01:00.1/driver/unbind" && echo "0000:01:00.1" > "/sys/bus/pci/drivers/snd_hda_intel/bind"
Your PCI ID (0000:01:00.0 etc.) and drivers (amdgpu etc.) might be different.
 
unbind works, but after that, I get this

Code:
root@pve:~# echo "0000:08:00.0" > /sys/bus/pci/drivers/nouveau/bind
-bash: /sys/bus/pci/drivers/nouveau/bind: No such file or directory
root@pve:~# echo "0000:08:00.0" > /sys/bus/pci/drivers/nvidiafb/bind
-bash: /sys/bus/pci/drivers/nvidiafb/bind: No such file or directory
root@pve:~#
 
unbind works, but after that, I get this

Code:
root@pve:~# echo "0000:08:00.0" > /sys/bus/pci/drivers/nouveau/bind
-bash: /sys/bus/pci/drivers/nouveau/bind: No such file or directory
root@pve:~# echo "0000:08:00.0" > /sys/bus/pci/drivers/nvidiafb/bind
-bash: /sys/bus/pci/drivers/nvidiafb/bind: No such file or directory
root@pve:~#
Maybe the drivers need to be loaded first (because they were no used before)? Try running modprobe nouveau before.
 
I had to reinstall PVE as a result of ZFS corruption. To my surprise, GPU passthrough worked without fiddling with any blacklisting of drivers etc.
And I also got to see that my GPU is using the `nouveau` at boot up.

I should also mention that PVE was installed as an UEFI OS, previously it wasn't. Not sure if this explains why I had to do zero config for GPU passthrough to work.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!