Opt-in Linux Kernel 5.15 for Proxmox VE 7.x available

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,204
1,513
164
South Tyrol/Italy
shop.proxmox.com
Just updated to kernel 5.15.17-1-pve and efifb no longer shows boot messages
Hmm, fwiw, I stumbled upon a patch in a next 5.15.19 point release (not yet available for PVE) that sounds related:
https://git.proxmox.com/?p=mirror_u...it;h=de44ce64091566c6413bfcbff271c9420e0ab1af

GPU passthrough worked well enough in 5.15-12-1-pve, to which I have reverted for now. It feels like passthrough with amdgpu is progressively getting worse since 5.11.22-5.
Sorry if I overlooked it, but what gpu model and cpu/mainboard is in use there?
 

leesteken

Famous Member
May 31, 2020
1,515
297
88
Sorry if I overlooked it, but what gpu model and cpu/mainboard is in use there?
AMD RX570 (POLARIS10 according to vendor-reset) with Ryzen 2700X on X470. I blacklisted amdgpu since 5.11.22-7-pve because it would no longer unload for passthrough and used efifb instead to show boot messages. I'm not sure that patch is about efifb or is about amdgpu. I would prefer to use amdgpu but haven't after 5.11.22-5-pve.
EDIT: Using earlycon=efifb does display boot messages but takes minutes to display the first second or so and I did not have the patience.
 
Last edited:

photonate

New Member
Jan 21, 2021
16
3
3
31
I am having BAR Reservation issues with 5.15-17-1 that I wasn't having on 5.13-19-3 (but am having on 5.13-19-4). Has anyone else seen this? I see some reports of GPU and LSI passthrough working in 5.15, and am wondering if there is anything else I can try.
 

blackangus

New Member
Dec 30, 2020
11
0
3
34
I can confirm 5.15-17-1 breaks PCIe GPU passthrough due to resizeable bar issues. Dmesg is filled with a massive amount of errors about not being able to reserve addresses. I'd share the logs but I don't have them at the moment, I had to go back to 5.13 in order to get my stuff up and running ASAP.

Note that this didn't happen for me in 5.15-12-1.
 
Apr 13, 2021
6
0
1
I was having issues with a kernel oops followed by process hangs on 5.13.19-4-pve (although my node running 5.13.19-3-pve seems to be stable - touch wood). Error attached. The issue was reproducible on two systems and triggered when my IPMI removing its USB device when I closed the remote KVM console.

I initially tried 5.15.17-1-pve, but couldn't boot the system after installing that version (hangs immediately after displaying "EFI stub: Loaded initrd from command line option"). Downgrading to 5.15.12-1-pve seems to have resolved the problem.

There seem to be various comments on GPU passthrough on this thread, so for what it's worth on 5.15.12-1 I am able to pass through my Nvidia GPU to a Windows 10 VM. Passing through the i915 on both 5.15.12-1 and 5.13.19-3 seems to work initially but intermittently causes the whole pve node to hang requiring a power cycle on two of my three nodes (all nodes were stable with i915 passthrough on PVE 7.0 prior to upgrading to 7.1). Interestingly, the node that has the Nvidia GPU is the one that's stable passing through the i915, so hard to say whether the other two nodes that are unstable with the i915 would be stable or not passing through the Nvidia GPU.

Generally I'm in a better place having upgraded to kernel 5.15, though still not as stable as PVE 7.0 was.
 

Attachments

  • kernel-5.13.19-4-pve-Oops.txt
    5.8 KB · Views: 2

leesteken

Famous Member
May 31, 2020
1,515
297
88
I initially tried 5.15.17-1-pve, but couldn't boot the system after installing that version (hangs immediately after displaying "EFI stub: Loaded initrd from command line option"). Downgrading to 5.15.12-1-pve seems to have resolved the problem.
This is almost what happened to me, except that the system did continue to boot without displaying anything and the GUI was reachable and all but one VM did start.
 
Apr 13, 2021
6
0
1
This is almost what happened to me, except that the system did continue to boot without displaying anything and the GUI was reachable and all but one VM did start.

Interesting! My root ZFS pool is encrypted and requires that I enter the passphrase on the console before the machine will boot, so it is certainly possible that this was happening... without any output on the console I wouldn't be able to tell the difference between an outright hang versus running with no console and waiting for the passphrase.
 

leesteken

Famous Member
May 31, 2020
1,515
297
88
pve-kernel-5.15.19-1-pve from pvetest does not resolve the no EFI framebuffer/no host console/no boot messages issue for me. It still only displays EFI stub: Loaded initrd from command line option.
 
Apr 13, 2021
6
0
1
pve-kernel-5.15.19-1-pve from pvetest does not resolve the no EFI framebuffer/no host console/no boot messages issue for me. It still only displays EFI stub: Loaded initrd from command line option.
Likewise for me, no console output after EFI stub: Loaded initrd from command line option, like I saw with pve-kernel-5.15.17-1. This time I typed my encryption passphrase in blindly and the machine booted normally. I'll have to test my i915 passthrough later.
 

k.jings

New Member
Dec 4, 2021
5
0
1
60
Just installed 5.15.17-1-pve. Issues noted below are solved for me.
Previously, in 5.13.x (sorry, I forget the revision #)
  • Guest with usb passthrough:
    • Code:
      TASK ERROR: Failed to run vncproxy.
      ;
  • Server reboot hangs:
    • Code:
      synchronizing scsci cache
      .
  • lsusb would hang indefinitely.
 

kriansa

New Member
Mar 24, 2020
8
2
3
29
I also don't get output past `Loading initial ramdisk ...` message, although it seems to boot fine if I connect to it using SSH.

A new (since 5.13) suspicious (and probably the culprit) message is also showing up:

Code:
[   12.307308] mgag200 0000:0e:01.0: vgaarb: deactivate vga console
[   12.307502] mgag200 0000:0e:01.0: [drm] *ERROR* can't reserve VRAM
 

Polyphemus

New Member
Nov 18, 2021
23
3
3
41
I'm trying this kernel, because on the current 'main' kernel, I'm getting DID_BAD_TARGET on my new WD SN550 NVME SSD when I reboot and the system freezes.

That problem is resolved, but now on this kernel, when I start a VM that has a Coral TPU passed through to it, all USB and SSD drives get disconnected, when the VM starts.

Using Ryzen 5 5600G on a Gigabyte A520M H.
 
Last edited:

leesteken

Famous Member
May 31, 2020
1,515
297
88
That problem is resolved, but now on this kernel, when I start a VM that has a Coral TPU passed through to it, all USB and SSD drives get disconnected, when the VM starts.
Sounds like the iOMMU groups have changed between the kernel versions. Did/do you use pcie_acs_override? Can you check your groups before and after?
 

Polyphemus

New Member
Nov 18, 2021
23
3
3
41
Sounds like the iOMMU groups have changed between the kernel versions. Did/do you use pcie_acs_override? Can you check your groups before and after?
No I did not use pcie_acs_override, should I?

I've now removed the NMVE SSD from the M.2 slot, and placed my Coral in there. The Coral was mounted in a mini-PCIe to x1 adapter. I'm just reinstalling Proxmox to test if it is a conflict between the x1 adapter and the NVME-slot. If that is not the case, I will reproduce and test if the iOMMU groups have changed.
 

leesteken

Famous Member
May 31, 2020
1,515
297
88
No I did not use pcie_acs_override, should I?

I've now removed the NMVE SSD from the M.2 slot, and placed my Coral in there. The Coral was mounted in a mini-PCIe to x1 adapter. I'm just reinstalling Proxmox to test if it is a conflict between the x1 adapter and the NVME-slot. If that is not the case, I will reproduce and test if the iOMMU groups have changed.
The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.
 

Polyphemus

New Member
Nov 18, 2021
23
3
3
41
The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.
Thank you for the clarification, I will report back :)
 

Polyphemus

New Member
Nov 18, 2021
23
3
3
41
The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.
Maybe I should open a new topic, but with or without pcie_acs_override (does not make a difference), the Coral is in its own IOMMU group (04), and all USB/SSD/LAN devices are in a separate group, like 02 and 05.

But when I start the VM, groups 02 and 05 get disconnected, when the Coral in 04 is attached to the VM when it is starting...
 
  • Like
Reactions: leesteken

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,204
1,513
164
South Tyrol/Italy
shop.proxmox.com
AMD RX570 (POLARIS10 according to vendor-reset) with Ryzen 2700X on X470. I blacklisted amdgpu since 5.11.22-7-pve because it would no longer unload for passthrough and used efifb instead to show boot messages. I'm not sure that patch is about efifb or is about amdgpu. I would prefer to use amdgpu but haven't after 5.11.22-5-pve.
EDIT: Using earlycon=efifb does display boot messages but takes minutes to display the first second or so and I did not have the patience.
I checked a bit more closely and noticed that 5.15.17-1-pve was the first build switching the SYSFB kernel config option to be enabled by default, which we picked up as we had no explicit override already.

Quoting the whole Kconfig option part, the note at the end is probably most relevant for your setup:

Code:
config SYSFB
        bool
        default y
        depends on X86 || EFI

config SYSFB_SIMPLEFB
        bool "Mark VGA/VBE/EFI FB as generic system framebuffer"
        depends on SYSFB
        help
          Firmwares often provide initial graphics framebuffers so the BIOS,
          bootloader or kernel can show basic video-output during boot for
          user-guidance and debugging. Historically, x86 used the VESA BIOS
          Extensions and EFI-framebuffers for this, which are mostly limited
          to x86 BIOS or EFI systems.
          This option, if enabled, marks VGA/VBE/EFI framebuffers as generic
          framebuffers so the new generic system-framebuffer drivers can be
          used instead. If the framebuffer is not compatible with the generic
          modes, it is advertised as fallback platform framebuffer so legacy
          drivers like efifb, vesafb and uvesafb can pick it up.
          If this option is not selected, all system framebuffers are always
          marked as fallback platform framebuffers as usual.

          Note: Legacy fbdev drivers, including vesafb, efifb, uvesafb, will
          not be able to pick up generic system framebuffers if this option
          is selected. You are highly encouraged to enable simplefb as
          replacement if you select this option. simplefb can correctly deal
          with generic system framebuffers. But you should still keep vesafb
          and others enabled as fallback if a system framebuffer is
          incompatible with simplefb.

          If unsure, say Y.

So it would seem that you need to use simplefb instead of the legacy efifb.
 

leesteken

Famous Member
May 31, 2020
1,515
297
88
So it would seem that you need to use simplefb instead of the legacy efifb.
Thanks for looking into this. Please excuse my ignorance, but how I use simplefb?
I did not choose efifb explicitly, it was what the kernel happened to use to display the host console (when amdgpu is blacklisted). My kernel command line is simply: root=ZFS=rpool/ROOT/pve-1 boot=zfs kvm_amd.avic=1. And it just stopped displaying messages and console.
I will install the kernel again and see if there are errors about simplefb in the logs. Is there a way to (force a) fallback to efifb in the meantime?

EDIT: I found this in the syslog: kernel: fbcon: Taking over console kernel: pci 0000:0b:00.0: BAR 0: assigned to efifb. I do believe that last one interferes with the passthrough. I don't see any messages about simplefb, only less messages about efifb and no creation of fb0 when running the newer kernel. Even video=efifb:off does not get rid of that second log message.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!