Opt-in Linux Kernel 5.15 for Proxmox VE 7.x available

t.lamprecht · Feb 3, 2022

avw said:
Just updated to kernel 5.15.17-1-pve and efifb no longer shows boot messages

Hmm, fwiw, I stumbled upon a patch in a next 5.15.19 point release (not yet available for PVE) that sounds related:
https://git.proxmox.com/?p=mirror_u...it;h=de44ce64091566c6413bfcbff271c9420e0ab1af

avw said:
GPU passthrough worked well enough in 5.15-12-1-pve, to which I have reverted for now. It feels like passthrough with amdgpu is progressively getting worse since 5.11.22-5.

Sorry if I overlooked it, but what gpu model and cpu/mainboard is in use there?

leesteken · Feb 3, 2022

t.lamprecht said:
Sorry if I overlooked it, but what gpu model and cpu/mainboard is in use there?

AMD RX570 (POLARIS10 according to vendor-reset) with Ryzen 2700X on X470. I blacklisted amdgpu since 5.11.22-7-pve because it would no longer unload for passthrough and used efifb instead to show boot messages. I'm not sure that patch is about efifb or is about amdgpu. I would prefer to use amdgpu but haven't after 5.11.22-5-pve.
EDIT: Using earlycon=efifb does display boot messages but takes minutes to display the first second or so and I did not have the patience.

photonate · Feb 4, 2022

I am having BAR Reservation issues with 5.15-17-1 that I wasn't having on 5.13-19-3 (but am having on 5.13-19-4). Has anyone else seen this? I see some reports of GPU and LSI passthrough working in 5.15, and am wondering if there is anything else I can try.

blackangus · Feb 4, 2022

I can confirm 5.15-17-1 breaks PCIe GPU passthrough due to resizeable bar issues. Dmesg is filled with a massive amount of errors about not being able to reserve addresses. I'd share the logs but I don't have them at the moment, I had to go back to 5.13 in order to get my stuff up and running ASAP.

Note that this didn't happen for me in 5.15-12-1.

joed · Feb 4, 2022

I was having issues with a kernel oops followed by process hangs on 5.13.19-4-pve (although my node running 5.13.19-3-pve seems to be stable - touch wood). Error attached. The issue was reproducible on two systems and triggered when my IPMI removing its USB device when I closed the remote KVM console.

I initially tried 5.15.17-1-pve, but couldn't boot the system after installing that version (hangs immediately after displaying "EFI stub: Loaded initrd from command line option"). Downgrading to 5.15.12-1-pve seems to have resolved the problem.

There seem to be various comments on GPU passthrough on this thread, so for what it's worth on 5.15.12-1 I am able to pass through my Nvidia GPU to a Windows 10 VM. Passing through the i915 on both 5.15.12-1 and 5.13.19-3 seems to work initially but intermittently causes the whole pve node to hang requiring a power cycle on two of my three nodes (all nodes were stable with i915 passthrough on PVE 7.0 prior to upgrading to 7.1). Interestingly, the node that has the Nvidia GPU is the one that's stable passing through the i915, so hard to say whether the other two nodes that are unstable with the i915 would be stable or not passing through the Nvidia GPU.

Generally I'm in a better place having upgraded to kernel 5.15, though still not as stable as PVE 7.0 was.

leesteken · Feb 4, 2022

joed said:
I initially tried 5.15.17-1-pve, but couldn't boot the system after installing that version (hangs immediately after displaying "EFI stub: Loaded initrd from command line option"). Downgrading to 5.15.12-1-pve seems to have resolved the problem.

This is almost what happened to me, except that the system did continue to boot without displaying anything and the GUI was reachable and all but one VM did start.

joed · Feb 4, 2022

avw said:
This is almost what happened to me, except that the system did continue to boot without displaying anything and the GUI was reachable and all but one VM did start.

Interesting! My root ZFS pool is encrypted and requires that I enter the passphrase on the console before the machine will boot, so it is certainly possible that this was happening... without any output on the console I wouldn't be able to tell the difference between an outright hang versus running with no console and waiting for the passphrase.

t.lamprecht · Feb 4, 2022

FYI, there's pve-kernel-5.15.19-1-pve available on pvetest, if you want to try that one out.

leesteken · Feb 4, 2022

pve-kernel-5.15.19-1-pve from pvetest does not resolve the no EFI framebuffer/no host console/no boot messages issue for me. It still only displays EFI stub: Loaded initrd from command line option.

joed · Feb 4, 2022

avw said:
pve-kernel-5.15.19-1-pve from pvetest does not resolve the no EFI framebuffer/no host console/no boot messages issue for me. It still only displays EFI stub: Loaded initrd from command line option.

Likewise for me, no console output after EFI stub: Loaded initrd from command line option, like I saw with pve-kernel-5.15.17-1. This time I typed my encryption passphrase in blindly and the machine booted normally. I'll have to test my i915 passthrough later.

k.jings · Feb 4, 2022

Just installed 5.15.17-1-pve. Issues noted below are solved for me.
Previously, in 5.13.x (sorry, I forget the revision #)

Guest with usb passthrough:
- Code:
```
TASK ERROR: Failed to run vncproxy.
```
  ;
Server reboot hangs:
- Code:
```
synchronizing scsci cache
```
  .
lsusb would hang indefinitely.

kriansa · Feb 6, 2022

I also don't get output past `Loading initial ramdisk ...` message, although it seems to boot fine if I connect to it using SSH.

A new (since 5.13) suspicious (and probably the culprit) message is also showing up:

Code:

[   12.307308] mgag200 0000:0e:01.0: vgaarb: deactivate vga console
[   12.307502] mgag200 0000:0e:01.0: [drm] *ERROR* can't reserve VRAM

Polyphemus · Feb 6, 2022

I'm trying this kernel, because on the current 'main' kernel, I'm getting DID_BAD_TARGET on my new WD SN550 NVME SSD when I reboot and the system freezes.

That problem is resolved, but now on this kernel, when I start a VM that has a Coral TPU passed through to it, all USB and SSD drives get disconnected, when the VM starts.

Using Ryzen 5 5600G on a Gigabyte A520M H.

leesteken · Feb 6, 2022

Polyphemus said:
That problem is resolved, but now on this kernel, when I start a VM that has a Coral TPU passed through to it, all USB and SSD drives get disconnected, when the VM starts.

Sounds like the iOMMU groups have changed between the kernel versions. Did/do you use pcie_acs_override? Can you check your groups before and after?

Polyphemus · Feb 6, 2022

avw said:
Sounds like the iOMMU groups have changed between the kernel versions. Did/do you use pcie_acs_override? Can you check your groups before and after?

No I did not use pcie_acs_override, should I?

I've now removed the NMVE SSD from the M.2 slot, and placed my Coral in there. The Coral was mounted in a mini-PCIe to x1 adapter. I'm just reinstalling Proxmox to test if it is a conflict between the x1 adapter and the NVME-slot. If that is not the case, I will reproduce and test if the iOMMU groups have changed.

leesteken · Feb 6, 2022

Polyphemus said:
No I did not use pcie_acs_override, should I?

I've now removed the NMVE SSD from the M.2 slot, and placed my Coral in there. The Coral was mounted in a mini-PCIe to x1 adapter. I'm just reinstalling Proxmox to test if it is a conflict between the x1 adapter and the NVME-slot. If that is not the case, I will reproduce and test if the iOMMU groups have changed.

The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.

Polyphemus · Feb 6, 2022

avw said:
The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.

Thank you for the clarification, I will report back

Polyphemus · Feb 6, 2022

avw said:
The M.2 slot is provided by the CPU and I expect it to be in its own IOMMU group, which you want for passthrough. The PCIe x1 slots are probably part of the A520-chipset and in a IOMMU group together with SATA and most USB and more. I would not have thought that this worked in an earlier kernel version unless you used pcie_acs_override.

Maybe I should open a new topic, but with or without pcie_acs_override (does not make a difference), the Coral is in its own IOMMU group (04), and all USB/SSD/LAN devices are in a separate group, like 02 and 05.

But when I start the VM, groups 02 and 05 get disconnected, when the Coral in 04 is attached to the VM when it is starting...

t.lamprecht · Feb 7, 2022

avw said:
AMD RX570 (POLARIS10 according to vendor-reset) with Ryzen 2700X on X470. I blacklisted amdgpu since 5.11.22-7-pve because it would no longer unload for passthrough and used efifb instead to show boot messages. I'm not sure that patch is about efifb or is about amdgpu. I would prefer to use amdgpu but haven't after 5.11.22-5-pve.
EDIT: Using earlycon=efifb does display boot messages but takes minutes to display the first second or so and I did not have the patience.

I checked a bit more closely and noticed that 5.15.17-1-pve was the first build switching the SYSFB kernel config option to be enabled by default, which we picked up as we had no explicit override already.

Quoting the whole Kconfig option part, the note at the end is probably most relevant for your setup:

Code:

config SYSFB
        bool
        default y
        depends on X86 || EFI

config SYSFB_SIMPLEFB
        bool "Mark VGA/VBE/EFI FB as generic system framebuffer"
        depends on SYSFB
        help
          Firmwares often provide initial graphics framebuffers so the BIOS,
          bootloader or kernel can show basic video-output during boot for
          user-guidance and debugging. Historically, x86 used the VESA BIOS
          Extensions and EFI-framebuffers for this, which are mostly limited
          to x86 BIOS or EFI systems.
          This option, if enabled, marks VGA/VBE/EFI framebuffers as generic
          framebuffers so the new generic system-framebuffer drivers can be
          used instead. If the framebuffer is not compatible with the generic
          modes, it is advertised as fallback platform framebuffer so legacy
          drivers like efifb, vesafb and uvesafb can pick it up.
          If this option is not selected, all system framebuffers are always
          marked as fallback platform framebuffers as usual.

          Note: Legacy fbdev drivers, including vesafb, efifb, uvesafb, will
          not be able to pick up generic system framebuffers if this option
          is selected. You are highly encouraged to enable simplefb as
          replacement if you select this option. simplefb can correctly deal
          with generic system framebuffers. But you should still keep vesafb
          and others enabled as fallback if a system framebuffer is
          incompatible with simplefb.

          If unsure, say Y.

So it would seem that you need to use simplefb instead of the legacy efifb.

leesteken · Feb 7, 2022

t.lamprecht said:
So it would seem that you need to use simplefb instead of the legacy efifb.

Thanks for looking into this. Please excuse my ignorance, but how I use simplefb?
I did not choose efifb explicitly, it was what the kernel happened to use to display the host console (when amdgpu is blacklisted). My kernel command line is simply: root=ZFS=rpool/ROOT/pve-1 boot=zfs kvm_amd.avic=1. And it just stopped displaying messages and console.
I will install the kernel again and see if there are errors about simplefb in the logs. Is there a way to (force a) fallback to efifb in the meantime?

EDIT: I found this in the syslog:

kernel: fbcon: Taking over console
kernel: pci 0000:0b:00.0: BAR 0: assigned to efifb

. I do believe that last one interferes with the passthrough. I don't see any messages about simplefb, only less messages about efifb and no creation of fb0 when running the newer kernel. Even video=efifb:off does not get rid of that second log message.

Opt-in Linux Kernel 5.15 for Proxmox VE 7.x available

Proxmox Staff Member

Distinguished Member

Member

Member

Member

Attachments

Distinguished Member

Member

Proxmox Staff Member

Distinguished Member

Member

Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Proxmox Staff Member

Distinguished Member

We value your privacy