GPU Passtrought broken after reboot

turborierer

New Member
Jun 25, 2022
9
3
3
Dear all,
I have quite some machines running with GPU passtrought but all of a sudden one system does not longer come up, and I don't get a clue what changed.

This is my usually working grub entry for a system with two identical 1080Ti cards:

Code:
#HPZ820
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init video=vesa:off vfio-pci.ids=10de:1b81,10de:10f0 pcie_aspm=off vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 modprobe.blacklist=radeon,nouveau,nvidia,nvidiafb,nvidia-gpu"

All with the usual vifo modules etc. As I said, everything was working. On this particular machine I hade to update to 5.19 because of this error:

BAR 0: can't reserve [mem 0xe0000000-0xefffffff 64bit pref]"

I needed also the script from here
Code:
echo 1 > /sys/bus/pci/devices/0000\:09\:00.0/remove
echo 1 > /sys/bus/pci/rescan

https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/page-3#post-484395I

But now, the system got stuck at the very beginning:
Code:
Loading inital ramdisk ...

Then I tried to remove all different parts from the grub entry. One by one. If I removed
Code:
initcall_blacklist=sysfb_init
the system booted at least until here:
IMG_1412.JPG

If I remove everything else, the system boots again. My problem is that I do not get any error messages. I tried also different kernel versions:
IMG_1413.JPG

All without success. Finally I setup a completely new pc with a similar GPU, this time a HP820 workstation. After I upgraded to 5.19.17 and updating to my working grub entry from above....

Again, no boot at all, everything got stuck here:
Code:
Loading inital ramdisk ...

My other cluster nodes with 5.15.74 work exactly with this grub line. What is going on here?
 
Code:
#HPZ820
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init video=vesa:off vfio-pci.ids=10de:1b81,10de:10f0 pcie_aspm=off vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 modprobe.blacklist=radeon,nouveau,nvidia,nvidiafb,nvidia-gpu"

All with the usual vifo modules etc. As I said, everything was working. On this particular machine I hade to update to 5.19 because of this error:

BAR 0: can't reserve [mem 0xe0000000-0xefffffff 64bit pref]"

I needed also the script from here
Code:
echo 1 > /sys/bus/pci/devices/0000\:09\:00.0/remove
echo 1 > /sys/bus/pci/rescan
The initcall_blacklist=sysfb_init should fix the "BAR can't reserve mem" issue and you would not need dropping the GPU from the PCIe bus. Check with cat /proc/cmdline what the actual active kernel parameters are and make sure to run update-initramfs/update-grub to apply changes
Please don't use kernel 5.19 as it gets no updates or security fixes (for a while now) and switch to the latest optional kernel version 6.2.
Also, video=vesa:off does nothing and you should not need to blacklist radeon, nvidia or nvidia-gpu as they are not applicable or install on the host.
My other cluster nodes with 5.15.74 work exactly with this grub line. What is going on here?
I don't know, sorry. Double check /proc/cmdline, /etc/modules and all .conf files in /etc/modprobe.d/ between the nodes and maybe /etc/rc.local and crontab for differences. Also check BIOS settings (Above 4G decoding, Resizable BAR, IOMMU, ACS, etc.)?
 
Thank you for the hints @leesteken I did some more testing with the following result:

1. I reinstalled a fresh 5.15.74 Proxmox because I new that other nodes were running successfully with this kernel.

2. I changed the
Code:
GRUB_CMDLINE_LINUX_DEFAULT
one by one and did always a update-initramfs/update-grub after each change

Results:

Code:
#No boot at all
#GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init video=vesa:off vfio-pci.ids=10de:1b06,10de:10ef,10de:1b06,10de:10ef pcie_aspm=off vfio>

#Boot but got stuck
#GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on pcie_acs_override=downstream,multifunction video=vesa:off vfio-pci.ids=10de:1b06,10de:10ef,10de:1b06,10de:10ef pcie_aspm=off vfio_iommu_type1.allow_unsafe_inte>

#Boot but got stuck at vifo-pci .... (see picture)
#GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on pcie_acs_override=downstream,multifunction vfio-pci.ids=10de:1b06,10de:10ef"

#Booted
#GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"

#No boot at all
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on initcall_blacklist=sysfb_init video=vesa:off vfio-pci.ids=10de:1b06,10de:10ef,10de:1b06,10de:10ef pcie_aspm=off vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.i>

So to conclude, in any case were
Code:
initcall_blacklist=sysfb_init
was included, it got stuck at the very beginning
Code:
Loading inital ramdisk ...
without any further output.

In case the
Code:
pcie_acs_override=downstream,multifunction vfio-pci.ids=10de:1b06,10de:10ef
part was included it got stuck but showed at least some output.

My problem is that in the past it booted always and I could get some error messages from
Code:
dmesg | grep -e DMAR -e IOMMU
. This time I have to "e" the grub command every time to delete everything that the system boots and I can edit the grub file, update everything and make a 2nd try.

How can I see the logs in case everything to stuck at
Code:
Loading inital ramdisk ...
? Please note that the "quiet" command was already removed.

Thx so much.
 

Attachments

  • IMG_1417.JPG
    IMG_1417.JPG
    290.5 KB · Views: 2
Note that initcall_blacklist=sysfb_init intentionally prevents output because you want to passthrough the boot (or single) GPU. Check the logs using the other computer that you use to manage Proxmox (either web GUI or SSH).
If you want to passthrough a second GPU (which is not used for boot), just use early binding to vfio-pci and making sure it is applied before the actual driver using a softdep.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!