[SOLVED] amdgpu breaks passthrough since pve-kernel-5.11.22-7, but works again with 5.15.19-1-pve

leesteken

Famous Member
May 31, 2020
1,425
271
83
For years, I could boot with my AMD GPU (seeing all Proxmox boot messages and console) and then pass it through to a (Linux) VM later on. With the help of vendor-reset and a Proxmox hook script, I could even pass the GPU back to the amdgpu driver after the VM shuts down. The amdgpu and vfio-pci driver work together nicely. This worked up to pve-kernel-5.11.22-5.

Today the enterprise repository updated to PVE 7.1 and I got pve-kernel-5.11.22-7 and this stopped working. The amdgpu driver no longer releases the GPU and the VM would not start. Blacklisting the amdgpu driver causes the VM to freeze when starting and the syslog to fill with error messages about BAR 0. The same behavior with pve-kernel-5.13.19-1.
I found a work-around by adding video=efifb:off to the kernel parameters (fixing the BAR 0 issue) and binding the GPU early to vfio-pci (which I prefer to blacklisting the driver).

This means that my Proxmox host is now practically headless and I can no longer return a GPU to the host. I can only do passthrough with the first and second x16 slot (both in use by VMs) and the system will only show boot messages and the console on the GPU in the first slot. Therefore, I expect no improvement from a third GPU for the Proxmox host. Note that the USB-controler passthrough still works fine, switching between xhci_hcd and vfio-pci drivers.

Have more people experienced this with the newer kernels? Is this fixable or is it new behavior of the amdgpu and not related to Proxmox or VFIO?

EDIT: This looks a lot like this problem with an AMD GPU and pve-kernel-5.11.22-7 with a Mac VM. Because I have the same error: Cannot bind 0000:0b:00.0 to vfio (when amdgpu is the driver in use).
 
Last edited:
  • Like
Reactions: Duanra08

Duanra08

New Member
Nov 13, 2021
7
1
3
42
Hello,
I add comments in the other discussion about the similar problem...
I have an AMD GPU RX590 8Gb

I try to reboot my server on the old kernel 5.11.22-5, but it's a fail... my server is down... I need to connect a screen to check where is the problem and restore the boot

If I want to restart my VM with the kernel 5.13.19-1, I must to remove the GPU PCI Passtrought
 

leesteken

Famous Member
May 31, 2020
1,425
271
83
The commit messages are:
pve-kernel (5.11.22-7) bullseye; urgency=medium * cherry-pick fixes for CVE-2021-3656 and CVE-2021-3653 pve-kernel (5.11.22-6) bullseye; urgency=medium * io_uring: don't block level reissue off completion path
Which don't give a hint about changes to behavior of amdgpu, unfortunately. I'll try 5.11.22-6 to see if it behaves like -5 or like -7.
 

leesteken

Famous Member
May 31, 2020
1,425
271
83
I found a better work-around: blacklist amdgpu (as it breaks passthrough since 5.11.22-7) but don't use video=efifb:off. Instead unbind efifb using echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind (using Perl in a hook scipt). This allows me to see all boot messages and still start the VM that uses the same GPU. This reddit post gave me the idea, and it appears to work with all current PVE kernel versions.

Unfortunately, rebinding the efifb does not fully work. It shows these successful messages in the Proxmox node Syslog:
efifb: probing for efifb efifb: No BGRT, not showing boot graphics efifb: framebuffer at 0x7fd0000000, using 10800k, total 10800k efifb: mode is 2560x1080x32, linelength=10240, pages=1 efifb: scrolling: redraw efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 Console: switching to colour frame buffer device 320x67 fb0: EFI VGA frame buffer device
But it does not display anything on the monitor (no signal). I also see a successful vendor-reset of my AMD GPU, which makes me think that that is not the problem.

Has anyone any ideas how to get the (EFI) console to display again after GPU passthrough? modprobe amdgpu driver works, but then it breaks passthrough again...
 

leesteken

Famous Member
May 31, 2020
1,425
271
83
Sounds like this behavior changed for nouveau as well. PVE 7.0 did not require blacklisting or early binding to vfio-pci, but PVE 7.1 does need that because unbinding the driver crashes.
 
Last edited:

Duanra08

New Member
Nov 13, 2021
7
1
3
42
Dear @avw ,
Thank you for your instructions:
I found a better work-around: blacklist amdgpu (as it breaks passthrough since 5.11.22-7) but don't use video=efifb:off. Instead unbind efifb using echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind (using Perl in a hook scipt). This allows me to see all boot messages and still start the VM that uses the same GPU. This reddit post gave me the idea, and it appears to work with all current PVE kernel versions.

Unfortunately, rebinding the efifb does not fully work. It shows these successful messages in the Proxmox node Syslog:
efifb: probing for efifb efifb: No BGRT, not showing boot graphics efifb: framebuffer at 0x7fd0000000, using 10800k, total 10800k efifb: mode is 2560x1080x32, linelength=10240, pages=1 efifb: scrolling: redraw efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 Console: switching to colour frame buffer device 320x67 fb0: EFI VGA frame buffer device
But it does not display anything on the monitor (no signal). I also see a successful vendor-reset of my AMD GPU, which makes me think that that is not the problem.

Has anyone any ideas how to get the (EFI) console to display again after GPU passthrough? modprobe amdgpu driver works, but then it breaks passthrough again...

I added amdgpu in the pve-blacklist.conf, and now, all is OK!
 

leesteken

Famous Member
May 31, 2020
1,425
271
83
I can't get passthrough to work with the AMD GPU again and I cannot get boot messages to display at all with kernel 5.15.17-1-pve.
I don't know how to make the system boot 5.15.12-1-pve automatically, so I'm reverting back to 5.13.19-4-pve, which does work as well as 5.15.12-1-pve.
 
  • Like
Reactions: Duanra08

Duanra08

New Member
Nov 13, 2021
7
1
3
42
Hello avw,
I'm always in 5.13.19-4-pve
The passtrough is available for me too.
Now, I know I shouldn't upgrade to 5.15
 

leesteken

Famous Member
May 31, 2020
1,425
271
83
Not using any framebuffer (video=simplefb:off video=efifb:off video=vesafb:off) on kernel 5.15.19-1-pve and letting amdgpu load normally, gives me most boot messages and a host console. Unbinding the consoles and unloading amdgpu before passthrough fixes all passthrough problems for me: echo 0 | tee /sys/class/vtconsole/vtcon*/bind; sleep 3; rmmod amdgpu.
With unbinding vfio-pci and rebinding amdgpu, I can even get a console after shutting down the VM. Just like before with kernel 5.11.22-5-pve.
 
Oct 15, 2021
3
0
1
44
Hi leesteken,

Been following your posts around a bit on the forum, but this looked like the best to respond to. It seems like we have similar setups with a RX 580 (I have the Dooku version) that I'm trying to passthrough to macOS (Monterey). With everything up to 5.13.19-6-pve I had it all working using just vfio_pci.ids=1002:67df,1002:aaf0 video=efifb:off. I upgraded to 7.2 with the 5.15 kernel today and everything broke with boatloads of vfio-pci 0000:18:00.0: BAR 0: can't reserve and macOS just crashing trying to boot.

I followed your advice across multiple threads. I took everything out of GRUB_CMDLINE_LINUX_DEFAULT, unblacklisted the amdgpu module and installed vendor-reset. Even using the 5.15 kernel fix (device_specific) the best I can get is macOS booted so it sees the GPU, but can't use it. Almost like when there is no monitor installed. I tried this with a HDMI monitor attached and my standard DisplayPort dongle with the same result.

Can you tell me a little more about what you have going? Really appreciate any help here. Beat on this for several hours today.
 
Last edited:

leesteken

Famous Member
May 31, 2020
1,425
271
83
I don't know what to tell you but reiterate this and this.
If the device is passed through but not working it is either macOS or vendor-reset is not active. If you still have BAR can't reserve then I think amdgpu did not load for your GPU and you should check /proc/iomem for BOOTFD.
My setup has no need for kernel parameters (because amd_iommu is on by default) nor anything in /etc/modprobe.d/ and I only blacklist snd_hda_intel for convenience. I do run echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null and echo 'device_specific' >"/sys/bus/pci/devices/0000:18:00.0/reset_method" before starting the VM.

As this thread is marked solved, I think we should continue your problem in one of the threads about "PVE 7.2 broke my passthrough" (also because my fix appears not to be working for you at all).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!