[SOLVED] GPU Passthrough Issues After Upgrade to 7.2

echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method" only works (and is only necessary for) kernel 5.15. If you are using 5.13, you can ignore this line and/or this error.
Your error about echo "0000:1c:00.0" > "/sys/bus/pci/devices/0000:1c:00.0/driver/unbind" indicated that amdgpu is not loaded for the GPU, which is essential for my solution. This indicates that you are not doing the same thing as I do. You can check is amdgpu is loaded for the GPU using lspci -ks 1c:00.0.

Please note that my setup fixed a different error than Failed to mmap 0000:1c:00.0 BAR 0. Performance may be slow. For me, it fixed BAR 0 cannot reserve memory (or something), which only occurs with kernel 5.15. Maybe I misunderstood your question and/or original problem, and my solution won't help you at all. Or you need to switch to the latest pve-kernel-5.15 and make sure amdgpu is used before running the script (and vendor-reset needs to be installed).

amdgpu seems to be loading and I am on kernel 5.15. See below:

Code:
root@pve:~# lspci -ks 1c:00.0
1c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Radeon RX 580 Armor 4G OC
        Kernel modules: amdgpu
root@pve:~# sh /var/lib/vz/snippets/gpu-hookscript.sh
/var/lib/vz/snippets/gpu-hookscript.sh: 4: echo: echo: I/O error
/var/lib/vz/snippets/gpu-hookscript.sh: 6: cannot create /sys/bus/pci/devices/0000:1c:00.0/driver/unbind: Directory nonexistent
root@pve:~# uname -a
Linux pve 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200) x86_64 GNU/Linux
root@pve:~#

I guess its different for the RX580. I can revert the changes and go back to 5.13 if necessary. I wonder if they will fix this for 5.15+
 
amdgpu seems to be loading and I am on kernel 5.15. See below:

Code:
root@pve:~# lspci -ks 1c:00.0
1c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Radeon RX 580 Armor 4G OC
        Kernel modules: amdgpu
It's not a difference in the GPU. The driver is not actually loaded. It shows that it is possible to use amdgpu but if it was loaded and used for this GPU there would be a line with: Kernel driver in use: amdgpu. I expect the errors to disappear because the directories and files should be there when a driver is loaded.
EDIT: I guess there is still a blacklist amdgpu somewhere in /etc/modprobe.d/.
 
Last edited:
It's not a difference in the GPU. The driver is not actually loaded. It shows that it is possible to use amdgpu but if it was loaded and used for this GPU there would be a line with: Kernel driver in use: amdgpu. I expect the errors to disappear because the directories and files should be there when a driver is loaded.
EDIT: I guess there is still a blacklist amdgpu somewhere in /etc/modprobe.d/.

Hmm I can't see where it is blacklisting the radeon driver though:

Code:
oot@pve:~# cat /etc/modprobe.d/*.conf
#blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist snd_hda_intel
# modprobe information used for DKMS modules
#
# This is a stub file, should be edited when needed,
# used by default by DKMS.
options vfio_iommu_type1 allow_unsafe_interrupts=1
options kvm-amd nested=1
options kvm ignore_msrs=1 report_ignored_msrs=0
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
options vfio-pci ids=1b73:1100,10de:1c82,10de:0fb9,1002:aaf0 disable_vga=1
root@pve:~#

Either way, if amdgpu is not loaded shouldn't it not give the bar 0 when I do start the VM? And then when I stop the VM, and try to run the script, it should work as the module is loaded? In the latter case, "echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method" gives an "I/O Error" which is why it is not resetting also?
 
I needed amdgpu to take over from the boot framebuffer (BOOTFB in /proc/iomem) to prevent BAR 0 cannot reserve errors (which is different from your error) and I needed vendor-reset to make sure the device reset actually works. Vendor-reset does not do anything in kernel 5.15 unless you set the reset_method to device_specific (the command does not reset the device, it prepares it for vendor-reset, which does the reset when vfio-pci takes over).
The whole problem is that the boot framebuffer does not release BAR 0 (which maybe it should, but it just does not do that in kernel 5.15) unless amdgpu is loaded.
Unless you can make sure that amdgpu is the kernel driver in use, you are not doing what I do (and which works for my system that had a different problem than yours).
 
I needed amdgpu to take over from the boot framebuffer (BOOTFB in /proc/iomem) to prevent BAR 0 cannot reserve errors (which is different from your error) and I needed vendor-reset to make sure the device reset actually works. Vendor-reset does not do anything in kernel 5.15 unless you set the reset_method to device_specific (the command does not reset the device, it prepares it for vendor-reset, which does the reset when vfio-pci takes over).
The whole problem is that the boot framebuffer does not release BAR 0 (which maybe it should, but it just does not do that in kernel 5.15) unless amdgpu is loaded.
Unless you can make sure that amdgpu is the kernel driver in use, you are not doing what I do (and which works for my system that had a different problem than yours).
Thanks for the explanation and help here. I really appreciate it. I‘m not sure what to do to force the driver to load. I have a Nvidia card in also and maybe this is where Linux is looking when booting? I‘ll do some testing and post back if I find a solution.
 
ARRRRRG! updated the host, and lost the ability to GPU pass through to my HiveOS VM.

(Nvidia3070FE / Ryzen 5950x / X570 MSI procarbonwifi)
 
echo 'device_specific' >"/sys/bus/pci/devices/0000:1c:00.0/reset_method" only works (and is only necessary for) kernel 5.15. If you are using 5.13, you can ignore this line and/or this error.
Your error about echo "0000:1c:00.0" > "/sys/bus/pci/devices/0000:1c:00.0/driver/unbind" indicated that amdgpu is not loaded for the GPU, which is essential for my solution. This indicates that you are not doing the same thing as I do. You can check is amdgpu is loaded for the GPU using lspci -ks 1c:00.0.

Please note that my setup fixed a different error than Failed to mmap 0000:1c:00.0 BAR 0. Performance may be slow. For me, it fixed BAR 0 cannot reserve memory (or something), which only occurs with kernel 5.15. Maybe I misunderstood your question and/or original problem, and my solution won't help you at all. Or you need to switch to the latest pve-kernel-5.15 and make sure amdgpu is used before running the script (and vendor-reset needs to be installed).
Hello. Do I understand correctly that your solution is only for AMD video cards and for NVIDIA is not suitable?
 
Hello. Do I understand correctly that your solution is only for AMD video cards and for NVIDIA is not suitable?
As far as I know, it only works with GPUs that use the amdgpu kernel driver.
Maybe you could do the same thing with the nouveau driver (and you then would not need vendor-reset) but I have not read anyone having success with that. You could try and see if loading nouveau removes the BOOTFB from /proc/iomem. And whether the switch from nouveau to vfio-pci (which Proxmox does when starting the VM) does not give any (additional) errors. I would love to hear whether this works for (some) NVidia cards but it might be a waste of your time.
 
As far as I know, it only works with GPUs that use the amdgpu kernel driver.
Maybe you could do the same thing with the nouveau driver (and you then would not need vendor-reset) but I have not read anyone having success with that. You could try and see if loading nouveau removes the BOOTFB from /proc/iomem. And whether the switch from nouveau to vfio-pci (which Proxmox does when starting the VM) does not give any (additional) errors. I would love to hear whether this works for (some) NVidia cards but it might be a waste of your time.
I will test it on a test machine. I want to note one strange thing - when I first updated proxmox and could not load windows 10 with a video card (windows just freezes and crashes into a blue screen), my local disk space was sharply reduced. This happened very quickly until I turned off the virtual machine. The memory was not freed. I assume that a huge number of logs is to blame?
 
I will test it on a test machine. I want to note one strange thing - when I first updated proxmox and could not load windows 10 with a video card (windows just freezes and crashes into a blue screen), my local disk space was sharply reduced. This happened very quickly until I turned off the virtual machine. The memory was not freed. I assume that a huge number of logs is to blame?
Take a look with the journalctl command (scroll to around that time) to confirm this. You can configure tighter limits on the journal.
 
So same issue as you guys.......

Updated host, and GPU pass through stopped working. I am running a single RTX3070FE, Ryzen 5950x, MSI MPG x570 procarbonwifi MB.

I ran the command:
proxmox-boot-tool kernel pin 5.13.19-6-pve
and after a reboot I see the host summary confirms I am running the older version again
Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)

I still cannot get GPU pass through to work and I see the following issue when i type : dmesg
vfio-pci 0000:2d:00.0: BAR 1: can't reserve [mem 0x7fe0000000-0x7fefffffff 64bit pref]

this was my grub modification:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:eek:ff,efifb:eek:ff"

Do you guys know what I should try to resolve this issue? It was working fine before the update.
I sure do hope this becomes a proxmox supported feature in the future.

Thanks in advanced.

Puma
 
Last edited:
I was under the impression that people were either using the sketchy workaround, or they were pinning the version to Linux 5.13.19-6-pve #1
Am I misunderstanding? I guess I can remove the version pin and try the work around if that's what I need to do.
For some, using amdgpu instead of blacklisting it works fine on recent 5.15 kernels. For NVidia, only the work-around of disconnecting and rescanning the PCI bus seems to work. Eventually we'll all need to move forward to newer kernels and I believe PCI passthrough will always be trial and error (and not guaranteed by Proxmox although they do put effort into it when they can), so I guess your assessment of the situation is correct.
 
This is strange af; i upgraded my System today.
GTX 1060 was running absolutely fine on PVE 7.2 @ Kernel 5.15, no issues at all.
Special grub flags (regarding efibuffer etc.) were not necessary since PVE 7.1, it basically works out of the box, only had to enable vfio and iommu.

Now I switched to the AMD 6950XT and stuff is kinda broken.
I am able to boot into the VM and the GPU is providing me the screen, if i leave amdgpu enabled on PVE.
BUT the resolution is garbage and the device manager screams "code 43" - effectively it is not working.

I tried basically any combination mentioned in this thread, it does not start to work.

vendor-reset seems to be broken in its current state?
echo 'device_specific' >"/sys/bus/pci/devices/0000:09:00.0/reset_method" leads to "bash: echo: write error: Invalid argument".

I will try older repository states of vendor-reset in hopes to get it working properly and then maybe it will just work.
Do you guys have any other ideas yet?
 
Now I switched to the AMD 6950XT and stuff is kinda broken.
vendor-reset seems to be broken in its current state?
echo 'device_specific' >"/sys/bus/pci/devices/0000:09:00.0/reset_method" leads to "bash: echo: write error: Invalid argument".
Vendor-reset does intentionally not support the 6000-series because they don't have the reset issues like older generations.
Do you guys have any other ideas yet?
What worked for me (RX 580) on 5.15 is to not blacklist amdgpu and not early bind (the VGA part of the GPU) it to vfio-pci and let it take over the GPU (so I don't have the simplefb BOOTFB issue, make sure with lspci -ks 09:00.0) and then execute echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null before starting the VM.
 
Hey guys, just to point my situation here.

I have two proxmox servers both with GPU passthrough enabled. A GT-710 which didn't need any tricky procedure to work on Proxmox 7.2 and v5.15 kernel and another GTX-970 on another system which didn't work with any method posted here:

Just made a script:
#!/bin/bash
echo 1 > /sys/bus/pci/devices/0000\:08\:00.0/remove
echo 1 > /sys/bus/pci/rescan

Which is executed by cron at boot:

@reboot /scripts/gpu/gpu_reset.sh

That's my GRUB CMDLINE:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off video=vesafb:off mpt3sas.max_queue_depth=10000"

The fact is that if i execute it with system booted it reserves the GPU properly, if i enable cron execution the system just freezes some seconds rear booting. Despite off, my Windows 11 machine which has that GPU on passthrough doesn't recognise the GPU properly on any case.

Those issues are solved (like other users) pinning 5.13 kernel...

Cheers.
 
Vendor-reset does intentionally not support the 6000-series because they don't have the reset issues like older generations.
Ahh, this is great to know. Thank you!

I removed the blacklisting and the vfio early binds.
I added that vtcon command as a @reboot cron as i intend to (later) enable autoboot for my VM.

It hangs in the VMs bootscreen now.

Funfact: On Kernel 5.13 it is not working either. o_O
 
I added that vtcon command as a @reboot cron as i intend to (later) enable autoboot for my VM.
It needs to be executed after amdgpu has taken control of the GPU. I'm not sure when @reboot happens exactly.
It hangs in the VMs bootscreen now.
Are there error messages in journalctl -b 0 as soon as you start the VM? Is BOOTFB still present in cat /proc/iomem?
You could try unbinding the GPU and binding it to vfio-pci manually (before starting the VM) to see if it gives errors. It should be something like this:
echo "0000:09:00.0" >/sys/bus/pci/drivers/amdgpu/unbind
echo "0000:09:00.0" >/sys/bus/pci/drivers/vfio-pci/bind
Funfact: On Kernel 5.13 it is not working either. o_O
I also needed a much different approach on 5.13 (video=efifb:off and early binding to vfio-pci). I guess that my approach for 5.15 does not work for your (much newer) GPU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!