Random 6.8.4-2-pve kernel crashes

So, basically updating to 6.8.9 crashes all those freezing and crashing?
That's easier than I thought.
Yes

- either someone downports the fixes to pve
- or upstream debian will update to a more recent kernel (and don't break it with their own "things")

Problem is fixed at least in (vanilla) >= 6.8.8, 6.8.9

Thank you Proxmox for letting me find this out. I tell my friends!

(Go an tell your customers what YOU know now!)
 
Last edited:
We are doing fine. Main production servers are still in vSphere 6.7/7 as we were evaluation all sorts of "jumping the sinking Broadcom ship" solutions just before PVE 8.2 release.

A few of internal servers are all on PVE 8.1, as long as they don't update themselves it's all ok.

I'd say we really dodged one by inches this time.
 
I'd say we really dodged one by inches this time.
You don't "need" 6.8.x at this point.

However.

If you get a 92 core EPYC - you can't wait.

It's more how the situation was handled - not a broken kernel. Even Linus hat to rollback from a Kernel he just released.

I absolutely left in the rain with a gigantic problem - and nobody cared.
 
You don't "need" 6.8.x at this point.

However.

If you get a 92 core EPYC - you can't wait.

It's more how the situation was handled - not a broken kernel. Even Linus hat to rollback from a Kernel he just released.

I absolutely left in the rain with a gigantic problem - and nobody cared.
I don't recall any news of Linux kernel 6.5 not properly supporting high core count EPYC servers.

I could be wrong on this one, 32 cores per socket is very much enough for us and anything higher is causing heat problems.
 
I don't recall any news of Linux kernel 6.5 not properly supporting high core count EPYC servers.

I could be wrong on this one, 32 cores per socket is very much enough for us and anything higher is causing heat problems.

As mentioned before ... 2x 92 cores will be the new normal.... There are patches for >=512 cores.

In other words - a posibility to easily get a custom kernel for pmox (and the fact that eventually - I have to make it happen) - that's my learning of this bug.
 
Can anyone summarize the current state of the kernel crashes and potential fixes?
 
  • Like
Reactions: Bent and Der Harry
Can anyone summarize the current state of the kernel crashes and potential fixes?
All PVE kernel 6.8 versions are unstable. Keep clear of those. Stay with 8.1/kernel 6.5.
No ETA for stable PVE kernel 6.8.
No tested fix available.

If necessary do a full offline reinstall to roll back to 8.1.
Here's the link for 8.1 ISO
https://enterprise.proxmox.com/iso/proxmox-ve_8.1-2.iso

I think I get it rounded up pretty good.
And I'm also pretty sure this post should be pinned at the top of this forum so no more users get hurt by an unstable kernel.
 
Last edited:
  • Like
Reactions: pschneider1968
I have a root server @ Hetzner (AX41-NVMe - AMD Ryzen 5 3600). I have also suffered from the instabilities of kernel 6.8.4.-2: host crash, no longer responding. According to a Hetzner technician, my server didn't show a video output and didn't respond to any keystrokes.
Impossible to see anything in the logs apart from a whole series of unreadable characters.

I went back to version 6.5.13-5 of the kernel a few days ago. This has enabled me to get back to a stable server. This was the last stable kernel my server ran under before upgrading to 6.8.
 
I keep having reboots after these messages, reverting to kernel 6.5 didn't help

Code:
May 10 18:17:01 CRON[15203]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 10 18:17:01 CRON[15204]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
May 10 18:17:01 CRON[15203]: pam_unix(cron:session): session closed for user root
 
I keep having reboots after these messages, reverting to kernel 6.5 didn't help

Code:
May 10 18:17:01 CRON[15203]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 10 18:17:01 CRON[15204]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
May 10 18:17:01 CRON[15203]: pam_unix(cron:session): session closed for user root
How is that a problem?
 
https://en.wikipedia.org/wiki/Cron

Read the section about "@hourly".

A total normal operation for even the linux that runs on a raspberry pi zero.
I know very well what cron is, and I also checked that no new job has been added to the system. Just FYI, I have been running proxmox since v.4

Coincidence is, I get reboots every single time after those cron jobs are completed.

Now, either you are willing to come up with an idea, or just stop pretending to help with wikipedia links.
 
  • Like
Reactions: Der Harry
in case the Proxmox ISO doesn't even boot
I would recommend switching to something else, like another Linux distro + VirtualBox headless + phpVirtualBox.

The unseen fate has decided that PVE is not for that machine, for reason we mortals cannot understand.
 
I would recommend switching to something else, like another Linux distro + VirtualBox headless + phpVirtualBox.

The unseen fate has decided that PVE is not for that machine, for reason we mortals cannot understand.
I've seeh that very often, that the PVE iso doesn't boot.

To be honnest - I've also seen Debian being "so stable" that the default kernel also doesn't boot.

That is actually my learning - from the bug and from the reaction of Proxmox.

We need more (pve-) kernel options. I am working on that.
 
Last edited:
Did anyone with kernel crashes, tryed to disable Hyperthreading?

I have no kernel crashes here, but issues related to the scheduler, its not working how it should (i think). Still debugging the issue.
But my issue is definitively related to HT, because everything that runs on HT-Cores has only like 20-30% of the performance as it would run on a real core.
The CPU has nothing todo, so the task should normally run anyway on real cores instead of HT, but the scheduler priorizes the HT-Cores somehow and those are ultra slow.

Im Not sure if the issue is related, but you guys could test with disabling HT, if your freezes/crashes goes away.
And im still debugging here the Root Cause.

The affected systems are 2 Genoa Servers here. (both with 9374F + 12x64GB Dimms + 8x Micron7450 Max)
Turning Hyperthreading off, doesn't hurt me much, because i have still 32 Cores per Server.

Cheers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!