the complete journal of a failing boot (`journalctl -b` after you tried to start the VM and it fails) would help for beginning to debug this.proxmox-kernel-6.11 running without issues, proxmox-kernel-6.14 is not able to boot a VM with passed through Controller (TrueNAS).
If i can provide logs, please tell me what you need, so i can help finding this possible bug.
So did i understand right when 6.14 is not having problems next upcoming weeks 6.11 will be replaced by 6.14 and i have to opt in to 6,14?
Thanks! If an update of the kernel is available it will be posted within this thread?Yes, to continue getting updates you will need to move from 6.11 to 6.14 sooner or later.
While we might release another 6.11 update, there is no active plan for that currently.
And for the record, the 6.8 kernel will continue to be supported for the lifetime of PVE 8, so if one uses that and sees no need for a newer kernel they can continue to use it just fine.
Sorry, in the mean time i've found a workaroud.the complete journal of a failing boot (`journalctl -b` after you tried to start the VM and it fails) would help for beginning to debug this.
latest driver supports 6.11 for 16.9/17.4, no plans that I know of patching for 6.14 maybe the new 18.x branch supports it however geting around their changes for stuff i smell not, you'd need to look at the nvidia site for release notes, they tend to only support ubuntu LTS kernelsi will keep looking, but so far it appears that similarly to 6.8 changes in vfio (and likely other files) have broken compatibility, so we need a patch for 16.9/17.X to restore it, so far on the PolloLoco - NVIDIA vGPU Guide page there is no discussion for it and i do not see anything from the producer of the 6.8 patch here: GreenDam
i will keep searching, hopefully they release something soon.
I will switch from esxi to proxmox for a true (with esxi passthrough of LSI SAS without issues).Sorry, in the mean time i've found a workaroud.
My Controller was only passed through in the VM by "RAW Device" - every Option is 1:1 working with 6.8. and 6.11 - now i've created a mapped device and pass this mapped Device through to the VM the VM starts normally.
journalctl -b
logs after a failed boot, I see this:Apr 05 00:03:06 proxmox-host kernel: BERT: Error records from previous boot:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: event severity: fatal
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Error 0, type: fatal
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: fru_text: ProcessorError
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: section_type: IA32/X64 processor error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Local APIC_ID: 0x1c
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: CPUID Info:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000000: 00a10f11 00000000 1c800800 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000010: 76fa320b 00000000 178bfbff 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Error Information Structure 0:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Error Structure Type: cache error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Check Information: 0x000000000602001f
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Transaction Type: 2, Generic
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Operation: 0, generic error
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Level: 0
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Processor Context Corrupt: true
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Uncorrected: true
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Context Information Structure 0:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Register Array Size: 0x0050
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: MSR Address: 0xc0002051
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Context Information Structure 1:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Register Context Type: Unclassified Data
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Register Array Size: 0x0030
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: Register Array:
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000000: 00000010 00000000 1c3010c0 fffffffe
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000010: 00000011 00000000 cb300024 00000000
Apr 05 00:03:06 proxmox-host kernel: [Hardware Error]: 00000020: 00000017 00000000 cb300024 00000000
Apr 05 00:03:06 proxmox-host kernel: BERT: Total records found: 1
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: Machine check events logged
Apr 05 00:03:06 proxmox-host kernel: PM: Magic number: 5:744:8
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: CPU 54: Machine Check: 0 Bank 5: aea0000000000108
Apr 05 00:03:06 proxmox-host kernel: clockevents clockevent82: hash matches
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffffc04dc0ee MISC d0140ff600000000 PPIN 2b0c17e59888012 SYND 4d000000 IPID 500>
Apr 05 00:03:06 proxmox-host kernel: memory memory52: hash matches
Apr 05 00:03:06 proxmox-host kernel: mce: [Hardware Error]: PROCESSOR 2:a10f11 TIME 1743825773 SOCKET 0 APIC 1c microcode a101148
Apr 05 00:03:06 proxmox-host kernel: RAS: Correctable Errors collector initialized.
stress-ng
while booted on the 6.11 kernel for 6+ hours on all cores without any problems. Here are some additional stats about my system that may be relevant:i made a patch for kernel 6.14.0-1-pvei will keep looking, but so far it appears that similarly to 6.8 changes in vfio (and likely other files) have broken compatibility, so we need a patch for 16.9/17.X to restore it, so far on the PolloLoco - NVIDIA vGPU Guide page there is no discussion for it and i do not see anything from the producer of the 6.8 patch here: GreenDam
i will keep searching, hopefully they release something soon.
BERT: Error records from previous boot:
I appreciate the added context. I was able to successfully boot with 6.14 using theBERT is an ACPI table and the kernel only reads it.
That said, the newer kernel could use instructions or code paths that trigger issues that were not present on the older kernel, so it might correlate.
I'd check for firmware updates and potentially talk with your system vendors, it might not be something that gets triggered by pure load like stress-ng, but more subtle, if its indeed the hardware.
FWIW, I got a EPYC 9475F Turin based test system here that has no BERT records triggered by booting 6.14, while it's a different CPU generation and so definitively not 1:1 comparable, it's at least not something that happens on recent EPYC generations in general.
mce=off
kernel command line param, but this isn't ideal so I've reverted to 6.11 for now. I'll reach out to ASRock Rack this week and see if there might be a newer bios update available that isn't listed on their website.We use essential cookies to make this site work, and optional cookies to enhance your experience.