Opt-in Linux 6.14 Kernel for Proxmox VE 8 available on test & no-subscription

So did i understand right when 6.14 is not having problems next upcoming weeks 6.11 will be replaced by 6.14 and i have to opt in to 6,14?
 
proxmox-kernel-6.11 running without issues, proxmox-kernel-6.14 is not able to boot a VM with passed through Controller (TrueNAS).

If i can provide logs, please tell me what you need, so i can help finding this possible bug.
the complete journal of a failing boot (`journalctl -b` after you tried to start the VM and it fails) would help for beginning to debug this.
 
So did i understand right when 6.14 is not having problems next upcoming weeks 6.11 will be replaced by 6.14 and i have to opt in to 6,14?

Yes, to continue getting updates you will need to move from 6.11 to 6.14 sooner or later.
While we might release another 6.11 update, there is no active plan for that currently.

And for the record, the 6.8 kernel will continue to be supported for the lifetime of PVE 8, so if one uses that and sees no need for a newer kernel they can continue to use it just fine.
 
Yes, to continue getting updates you will need to move from 6.11 to 6.14 sooner or later.
While we might release another 6.11 update, there is no active plan for that currently.

And for the record, the 6.8 kernel will continue to be supported for the lifetime of PVE 8, so if one uses that and sees no need for a newer kernel they can continue to use it just fine.
Thanks! If an update of the kernel is available it will be posted within this thread?
 
the complete journal of a failing boot (`journalctl -b` after you tried to start the VM and it fails) would help for beginning to debug this.
Sorry, in the mean time i've found a workaroud.

My Controller was only passed through in the VM by "RAW Device" - every Option is 1:1 working with 6.8. and 6.11 - now i've created a mapped device and pass this mapped Device through to the VM the VM starts normally.
 
  • Like
Reactions: gseeley and waltar
i will keep looking, but so far it appears that similarly to 6.8 changes in vfio (and likely other files) have broken compatibility, so we need a patch for 16.9/17.X to restore it, so far on the PolloLoco - NVIDIA vGPU Guide page there is no discussion for it and i do not see anything from the producer of the 6.8 patch here: GreenDam

i will keep searching, hopefully they release something soon.
latest driver supports 6.11 for 16.9/17.4, no plans that I know of patching for 6.14 maybe the new 18.x branch supports it however geting around their changes for stuff i smell not, you'd need to look at the nvidia site for release notes, they tend to only support ubuntu LTS kernels
 
Sorry, in the mean time i've found a workaroud.

My Controller was only passed through in the VM by "RAW Device" - every Option is 1:1 working with 6.8. and 6.11 - now i've created a mapped device and pass this mapped Device through to the VM the VM starts normally.
I will switch from esxi to proxmox for a true (with esxi passthrough of LSI SAS without issues).
How do you passtrough your sas to true as with proxmox ?
Thank for advises
 
I have an Epyc Genoa (Epyc 9554) system that won't boot with the 6.14 kernel. The system is only a few months old, and has run without issue on both the 6.8 and 6.11 kernels. Looking through the journalctl -b logs after a failed boot, I see this:

Code:
Apr 05 00:03:06 proxmox-host kernel: BERT: Error records from previous boot:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: event severity: fatal
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:  Error 0, type: fatal
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:  fru_text: ProcessorError
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   Local APIC_ID: 0x1c
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   CPUID Info:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   00000000: 00a10f11 00000000 1c800800 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   00000010: 76fa320b 00000000 178bfbff 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   Error Information Structure 0:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Error Structure Type: cache error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Check Information: 0x000000000602001f
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:     Transaction Type: 2, Generic
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:     Operation: 0, generic error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:     Level: 0
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:     Processor Context Corrupt: true
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:     Uncorrected: true
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   Context Information Structure 0:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Register Array Size: 0x0050
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    MSR Address: 0xc0002051
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:   Context Information Structure 1:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Register Context Type: Unclassified Data
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Register Array Size: 0x0030
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    Register Array:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    00000000: 00000010 00000000 1c3010c0 fffffffe
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    00000010: 00000011 00000000 cb300024 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]:    00000020: 00000017 00000000 cb300024 00000000
Apr 05 00:03:06 proxmox-host kernel: BERT: Total records found: 1
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: Machine check events logged
Apr 05 00:03:06 proxmox-host kernel: PM:   Magic number: 5:744:8
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: CPU 54: Machine Check: 0 Bank 5: aea0000000000108
Apr 05 00:03:06 proxmox-host kernel: clockevents clockevent82: hash matches
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffffc04dc0ee MISC d0140ff600000000 PPIN 2b0c17e59888012 SYND 4d000000 IPID 500>
Apr 05 00:03:06 proxmox-host kernel: memory memory52: hash matches
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: PROCESSOR 2:a10f11 TIME 1743825773 SOCKET 0 APIC 1c microcode a101148
Apr 05 00:03:06 proxmox-host kernel: RAS: Correctable Errors collector initialized.

I'm skeptical about this being a hardware issue with the CPU. After I saw these errors, I ran stress-ng while booted on the 6.11 kernel for 6+ hours on all cores without any problems. Here are some additional stats about my system that may be relevant:

Proxmox VE 8.3.5
Mirroed ZFS root on 2x SAMSUNG MZ7LH960
CPU: AMD EPYC 9554
Motherboard: ASRock Rack GENOAD8X-2T/BCM
Memory: 8x64GB MICRON DDR5 RDIMM
PCIe: LSI 9300-8I SAS3008, 2x RTX 4090
 
Last edited:
i will keep looking, but so far it appears that similarly to 6.8 changes in vfio (and likely other files) have broken compatibility, so we need a patch for 16.9/17.X to restore it, so far on the PolloLoco - NVIDIA vGPU Guide page there is no discussion for it and i do not see anything from the producer of the 6.8 patch here: GreenDam

i will keep searching, hopefully they release something soon.
i made a patch for kernel 6.14.0-1-pve
https://gitlab.com/GreenDamTan/vgpu....12_also_6.14/550.144.02.patch?ref_type=heads
it seems work now
if it can work in other proxmox server i will make pull request to PolloLoco repo
6.14.0-1-pve.png
 

Attachments

BERT: Error records from previous boot:

BERT is an ACPI table and the kernel only reads it.
That said, the newer kernel could use instructions or code paths that trigger issues that were not present on the older kernel, so it might correlate.

I'd check for firmware updates and potentially talk with your system vendors, it might not be something that gets triggered by pure load like stress-ng, but more subtle, if its indeed the hardware.

FWIW, I got a EPYC 9475F Turin based test system here that has no BERT records triggered by booting 6.14, while it's a different CPU generation and so definitively not 1:1 comparable, it's at least not something that happens on recent EPYC generations in general.
 
BERT is an ACPI table and the kernel only reads it.
That said, the newer kernel could use instructions or code paths that trigger issues that were not present on the older kernel, so it might correlate.

I'd check for firmware updates and potentially talk with your system vendors, it might not be something that gets triggered by pure load like stress-ng, but more subtle, if its indeed the hardware.

FWIW, I got a EPYC 9475F Turin based test system here that has no BERT records triggered by booting 6.14, while it's a different CPU generation and so definitively not 1:1 comparable, it's at least not something that happens on recent EPYC generations in general.
I appreciate the added context. I was able to successfully boot with 6.14 using the mce=off kernel command line param, but this isn't ideal so I've reverted to 6.11 for now. I'll reach out to ASRock Rack this week and see if there might be a newer bios update available that isn't listed on their website.