Opt-in Linux Kernel 5.15 for Proxmox VE 7.x available

Please update the Microcode. I have also any PSOD on ESXi7U3 with old Bios and Epyc Rome and Milan. After updates all runs fine.
Have you activated nested virtualization?
Because all the other machines work fine

However even after the update I have the same problem.

After about a day of operation the VM with nested virtualization begins to freeze.
 
Last edited:
The next Proxmox VE point release 7.2 (~ Q2/2022) will presumably use the 5.15 based Linux kernel.
You can test this kernel now, install the meta-package that pulls in the latest 5.15 kernel with:

Code:
apt update && apt install pve-kernel-5.15

It's not required to enable the pvetest repository, the opt-in kernel package is available on all repositories.

We invite you to test your hardware with this kernel, and we are thankful for receiving your feedback.

Please note that while we are trying to provide a stable experience with the Opt-in kernel 5.15, updates for this kernel may appear less frequently until Proxmox projects actually switch to it as their new default.
Thanks, you saved me! (after about 10 hours of struggling with 6 dell poweredge server). This has fixed my IOMMU issue. Thanks a lot!
Do I have to fear that doing a apt dist-upgrade could be a bad idea in general?
 
Last edited:
Do I have to fear that doing a apt dist-upgrade could be a bad idea in general?
No, apt update plus apt dist-upgrade (or the alias apt full-upgrade) is actually the recommended way to upgrade in a major release. Just mentioning it to avoid possible confusion: for in between different major releases (6.x to 7.x) always check our respective upgrade how-to.
 
No, apt update plus apt dist-upgrade (or the alias apt full-upgrade) is actually the recommended way to upgrade in a major release. Just mentioning it to avoid possible confusion: for in between different major releases (6.x to 7.x) always check our respective upgrade how-to.
Glad that you clarify it. So does the full-upgrade (aka dist-update) is only when the iteration release number pass from 6 to 7; 7 to 8, and so forth? Just to be sure!

Btw, thx for your quick answer ;)
 
Glad that you clarify it. So does the full-upgrade (aka dist-update) is only when the iteration release number pass from 6 to 7; 7 to 8, and so forth? Just to be sure!

Btw, thx for your quick answer ;)
@t.lamprecht will confirm, but apt update plus apt dist-upgrade (or the alias apt full-upgrade)) is how you should always upgrade Proxmox within major releases going from 6.3->6.4, 7.1.2-->7.1.3 or 7.1--7.2. So really never just use apt update plus apt upgrade.

Going from 6->7 or 7->8 may require more than a simple apt update plus apt dist-upgrade (or the alias apt full-upgrade)) and you should check the Proxmox upgrade how-to for that new release.
 
  • Like
Reactions: RokaKen and Falk R.
I have issues with Dell PowerEdge 140. ("Cannot import rpool") . Struggled for many hours with this problem, the current tipps (manually import from initram, set boot waiting time to 5 sec, disable iommu) did not work for me.
Cannot provide more information at the moment, because i have no active system here.
Problem on all testet 5.15 version, pve 5.13 work as expected.
 
Last edited:
I have issues with Dell PowerEdge 140. ("Cannot import rpool") . Struggled for many hours with this problem, the current tipps (manually import from initram, set boot waiting time to 5 sec, disable iommu) did not work for me.
Cannot provide more information at the moment, because i have no active system here.
Problem on all testet 5.15 version, pve 5.13 work as expected.
What's your ZFS pool disk layout (zpool status on the working 5.13)? Do you use a RAID controller (even if in something like HBA mode) for the ZFS disks? What's the proxmox-boot-tool status output (failure there doesn't necessarily has to mean something is wrong, just out of interest)
 
Yeah it seems that did not get magically fixed, and support for HW that's more than a decade old (>= 15 year in this case) tends to have a higher chance on breaking over time. FWIW, the 5.13 kernel worked on a platform with a Intel Q6600 (~ same era) here, did not checked the 5.15 yet as that machine gets only powered on for specific tests (eats way to much power otherwise), so this still seems dell specific, so I'd recommend to contact their support (if still existing for that product) or keep ACPI disabled.

I don't think Dell will offer any support for a Dell PE2950 ;-)

If you wan't to document somewhere this Dell PE2950 problem with PVE 7.1 and later, I've found 2 workarounds:

1> Disabling acpi

Disable ACPI at the kernel level:
$ grep acpi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="acpi=off"

Then run update-grub after the modification

Problem: shutdown -h may not work anymore

2> Disabling apic (better solution)

Switch the "Demand-Based Power Management" parameter to "Disabled" in the BIOS.

Disable APIC at the kernel level:
$ grep apic /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="noapic"

Then run update-grub after the modification

I haven't found any side effects yet with the second workaround.
 
Have you activated nested virtualization?
Because all the other machines work fine

However even after the update I have the same problem.

After about a day of operation the VM with nested virtualization begins to freeze.
This is less related to this kernel but more to AMD. In fact 5.15 works even better on AMD then 5.13 with nested Virtualization. There seems to be some bug in cetrain operations in Windows 11 as well as in Windows Server 2022 running HyperV, hovewer running Windows Server 2022 as Core installation (without Graphical User Interface) works very well with HyperV in it. Mines runs now for 2 Months without any issue, so perhaps you can try it with Server Core or you would better switch to Intel.
 
In fact 5.15 works even better on AMD then 5.13 with nested Virtualization.
On 5.13 I have this problem.

There seems to be some bug in cetrain operations in Windows 11 as well as in Windows Server 2022 running HyperV, hovewer running Windows Server 2022 as Core installation (without Graphical User Interface) works very well with HyperV in it.
This confirms that the problem lies in the proxmox - AMD interaction.

Mines runs now for 2 Months without any issue, so perhaps you can try it with Server Core or you would better switch to Intel.
Both are out of the question.
 
This confirms that the problem lies in the proxmox - AMD interaction.
No, on the contrary. As nested Proxmox VEs, and other OS work in general fine, it rather seems like an issue with HyperV in some operation modi in nested AMD environments. So, I'd ask hyperV/amd support channels.
 
So the problem could be limited to nested Hyper-V (on AMD environments)?

I have updated to the latest version of the bios and I always have the same problem.
After almost exactly one day the car freezes.
I also tried it by making a proxmox installation from scratch and configuring only that VM.
 
Last edited:
Do you have the latest micro code updates and bios/firmware updates installed?
Now yes, last BIOS and last micro code.
With this instruction:
https://wiki.debian.org/Microcode#Debian_10_.22Buster.22_.28stable.29

What's the level 1 hyper visor OS?
Windows 11 PRO, i will try Windows server 2022.

The VM config would be interesting too. Did you checked both, the level 0 and level 1 hyper visors log for errors?
Please tell me where to find proxmox logs so maybe I can be more helpful in identifying the problem.
 
On a few of my Dell servers I can't get any of the kernel to work properly. I can't pass trough my PERC HBA controller to TrueNas Core (or scale). It hang in different behavior depending of options I'm trying.
Basically, I'm playing with 3 server atm:
r830 using PERC H730p in HBA mode =>NOTHING WORK (grub or uefi)
r730 using PERC H730p in HBA mode => WORK (grub)
r730xp using PERC H730 mini in HBA => NOTHING WORK (grub or uefi)

I think I've re-install proxmox more then 20 times on these severs since the last days.

Fun fact, it worked on my r730. This one is actually using Grub. But all of my journey with the 2 other server in bios Grub wasn't working so actually r730xd and r830 are UEFI (setting stuff around like usual /etc/kernel/cmdline, pve-efiboot-tool refresh)

My "VM use-case" to test the pci pass-through is TrueNasCore latest. For my r730 in grub, nothing special. I've added a pci device, which is my Perc H730p
Code:
root@antares:~# lspci -vmmnn |grep RAID
Class:    RAID bus controller [0104]
Device:    MegaRAID SAS-3 3108 [Invader] [005d]

So far, all of my 3 servers are up to date based on the bios&firmwares Dell's catalog (using lifecycle controller). I've verified that ALL my bios and device settings are the same as the r730, but the r730xd and the r830 don't work. They hang the whole server when I start TrueNas with the PCI passtrough added. If no passtrhough, no problem (but no drive).

On all 3 machine, everything seems to be enabled:
'find /sys/kernel/iommu_groups/ -type l' outputs the list accordingly
IOMMU is enabled, etc. etc.

Here are the kernel I've tried so far:
apt update && apt install pve-kernel-5.15
apt update && apt install pve-kernel-5.15.7-1-pve
apt update && apt install pve-kernel-5.15.19-1-pve
apt update && apt install pve-kernel-5.15.30-1-pve

I can believe I've spent so much time on this. In fact, I've messed around with a typo (intel=iommu=on instead on the right intel_iommu=on), but that was just a bad typo in my notes. I've very few doc for my situation.

To conclude here are the list of lspci followed by dell idrac invetory report, for the 3 controller from proxmox console

R730
03:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
PERC H730P Mini 25.5.9.0001

R830
03:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
PERC H730P Adapter 25.5.9.0001

R730xd
03:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
PERC H730 Mini 25.5.9.0001

Do we need to wait for a fix? I feel a dead end here so far.

Is there a log I can share to help debugging this? I would like to use tail -f /var/log/syslog but the system hang too early. Only thing I've grabbed a few days ago was from the iDrac remote controller windows, (see attached img)
 

Attachments

  • Screenshot from 2022-04-01 18-29-16.png
    Screenshot from 2022-04-01 18-29-16.png
    446 KB · Views: 8
  • Screenshot from 2022-04-02 14-49-59.png
    Screenshot from 2022-04-02 14-49-59.png
    292.6 KB · Views: 12
Last edited:
On a few of my Dell servers I can't get any of the kernel to work properly. I can't pass trough my PERC HBA controller to TrueNas Core (or scale). It hang in different behavior depending of options I'm trying.
To be clear, with working you mean that PCI(e) pass-through is working or not working, but not that the PVE kernel doesn't even boot or there are other errors not related to PCI pass-through? Also, did it work with an older kernel (e.g., 5.13 or 5.11 series)?
 
Has anyone gotten any NVIDIA vgpu Driver to work with this kernel? I'm getting endless build errors about vfio frame sizes and mdev_device errors.
Nvidias official supported KVM Hypervisors (RedHat, SUSE Enterprise Server) are all running 4.xx something kernels...
I've tried different driver Versions but no dice...the 5.15 Patch also failes with multiple errors...

Bash:
/var/lib/dkms/nvidia/470.82/build/nvidia/nv-mmap.c:324:9: note: here
  324 |         default:
      |         ^~~~~~~
  CC [M]  /var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.o
  CC [M]  /var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/vgpu-devices.o
  CC [M]  /var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nv-pci-table.o
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:207:6: error: 'struct mdev_parent_ops' has no member named 'open'
  207 |     .open             = nv_vgpu_vfio_open,
      |      ^~~~
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:207:25: error: initialization of 'long int (*)(struct mdev_device *, unsigned int,  long unsigned int)' from incompa>  207 |     .open             = nv_vgpu_vfio_open,
      |                         ^~~~~~~~~~~~~~~~~
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:207:25: note: (near initialization for 'vgpu_fops.ioctl')
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:208:6: error: 'struct mdev_parent_ops' has no member named 'release'
  208 |     .release          = nv_vgpu_vfio_close,
      |      ^~~~~~~
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:208:25: error: initialization of 'int (*)(struct mdev_device *, struct vm_area_struct *)' from incompatible pointer >  208 |     .release          = nv_vgpu_vfio_close,
      |                         ^~~~~~~~~~~~~~~~~~
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:208:25: note: (near initialization for 'vgpu_fops.mmap')
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:285: /var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/vgpu-devices.c: In function 'nv_vfio_vgpu_get_attach_device':
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/vgpu-devices.c:729:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=]
  729 | }
      | ^
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/vgpu-devices.c: In function 'nv_vgpu_dev_ioctl':
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/vgpu-devices.c:356:1: warning: the frame size of 1120 bytes is larger than 1024 bytes [-Wframe-larger-than=]
  356 | }
      | ^
make[1]: *** [Makefile:1875: /var/lib/dkms/nvidia/470.82/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.30-1-pve'

(Same errors on the 510.47 drivers)

I really don't want to be forced to esxi or Citrix Hypervisor...(or Kernel 5.4...)

Edit: Well, i'll be damned. I tried the Ubuntu .deb and it just worked on Kernel 5.13.
Still - as Kernel 5.15 is around the corner with 7.2, would be nice if somebody get's it running
 
Last edited:
To be clear, with working you mean that PCI(e) pass-through is working or not working, but not that the PVE kernel doesn't even boot or there are other errors not related to PCI pass-through? Also, did it work with an older kernel (e.g., 5.13 or 5.11 series)?
I've figured it out, in no way I can host Proxmox on a disk that is attached to the HBA controller (Raid Perc in HBA). I've find a way to install a sata ssd inside the server and then everything seems to be working fine.

The thing is that with kernel 5.15 and up, I can at least add the pci pass-trough to my controller, but both TrueNas Scale or Core crashed the whole OS (=> hard reboot)

Hopefully some day the kernel will be more flexible and make possible all working within the PERC controller in HBA.

My other server were already using a SATA ssd for proxmox. That's pretty much it.
Down the line, I don't think I've contributed to that thread ...
 
/var/lib/dkms/nvidia/470.82/build/nvidia-vgpu-vfio/nvidia-vgpu-vfio.c:207:6: error: 'struct mdev_parent_ops' has no member named 'open' 207 | .open = nv_vgpu_vfio_open, | ^~~~
The nvidia driver, or at least that version is not compatible with newer kernels.

For example, in 5.13 that referenced struct indeed still has an open member:
https://elixir.bootlin.com/linux/v5.13/source/include/linux/mdev.h#L104
But it was seemingly restructured between that and 5.15:
https://elixir.bootlin.com/linux/v5.15/source/include/linux/mdev.h#L100
Which is fine for the kernel to do, this is internal Linux kernel stuff that only out of tree modules require, not some userspace programs, otherwise kernel development would come to a complete stillstand.

Anyhow, check if there's a newer driver version that can cope with newer kernels or holler at NVIDIA, there's nothing any other entity can do...
 
Has anyone had any luck with PCIe passthrough in the 5.15 kernel, specifically passing through an Nvidia GPU? I am still on the 5.13.19 kernel, but am interested in returning back to the 5.15 kernel.

My last attempts resulted in my LSI card (flashed into IT mode) and/or the Nvidia GPU causing issues on both the host and guest OS's.

Thanks!