Kernel 6.8.4-2 causes random server freezing

Sep 22, 2021
27
7
8
49

This thread is dedicated to the issue where the server just freezes.

If the kernel gives error messages when the server crashes

there is a thread https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760

and not AMD GPU related as in https://gitlab.freedesktop.org/drm/amd/-/issues/3173

(problems may be however related... but who knows)




Hi,
i have updated a 8 server cluster to Proxmox 8.2 on 28.4.
On 1.5. i had the first problems with 3 of the 8 servers got stuck.
And then after a couple of hours another server and so on.
It doesn't last six hours and i have to reboot one of the servers.
(and if I'm not lucky enough and i don't restart the server quickly,
another 1 or 2 servers freezes meanwhile and the cluster dies,
because we are using ceph on all 8 nodes)


The server is in a "frozen" condition.
It does display the login prompt and nothing else on connected monitor,
but it does not react to a usb keyboard, even numlock is not working.
No segfault or other messages in dmesg or syslog.
I have to hard reset the frozen node.

The server is a ASUS RS500A-E11 with AMD Epyc Milan series CPU (motherboard KMPA-U16)
And it is not possible that all the servers have suddenly HW problems.
All packages are updated. And I have the latest BIOS - released few months ago.

Today I will try to downgrade the kernels to 6.5
Any other ideas? Thank you in advance.
 
Last edited:
after many attempts
replacing PSU, CPU, RAM etc...
trying different kernel parameters etc...
checking ipmi and system for any error and logs
checking all power cables and upgrading UPS
updating BIOS, Firmware and BCM/IPMI
BIOS configuring different options
installing a new version of amd-firmware

nothing really helped

i have yesterday pinned the kernel version to 6.5
and so far no problem

the funny thing is, that the system just freezes and no logs no kernel crash, nothing
and we are experiencing this problem on 12 different AMD Epyc servers
 
Last edited:
  • Like
Reactions: pschneider1968
A funnier thing is PVE claims its using Ubuntu kernel but I couldn't find anything related to kernel random freeze/crash/panic on Ubuntu 24.04, which also uses kernel 6.8

Meanwhile PVE 8.2 is freezing and crashing left and right here. And we don't see any explanation about what in going wrong with the kernel and when a fix will be available, just people telling eachother to roll back to kernel 6.5.

If the devs has no idea what is going on, pull kernel 6.8 and make 6.5 as defualt again.

At least ESXi had the guts of admitting something is wrong and they have no idea how to fix so they pulled the update.
 
  • Like
Reactions: pschneider1968
we need to know, what changed in kernel 6.8
and when it would be save, to go upstream again
are the any proxmox specific kernel patches?


tried kernel 6.8 without success:
amd_iommu=off iommu=off
default - no parameters
and with ceph optimatizations amd_iommu=on iommu=pt pcie_aspm=off
with limiting c-state (deeper c-states can cause this kind of freezing) idle=poll processor.max_cstate=0

for the VM we use cpu=host
and this is probably the problem

there is quite a lot reported problems with 6.7/6.8
freezes with suspend and resume and hibernate - not this case
freezes at boot before login - not this case
freezes with playback of videos on firefox (Destroy DC context while keeping DML and DML2) - not this case
 
Last edited:
solved!

it is definitely a kernel 6.8 bug


i need to know which kernel patch/commit is causing this regression
i'm looking in to ubuntu kernel and are the any proxmox specific patches?

i will share this kernel problem with asrock and asus vendor
 
Last edited:
Asrock and Asus are server vendors And we have a good relation with them.

And this Is not the case of RX570. As I wrote in my previous post, I'm aware of three different situations of 6.8 freezing - non of them applies to a epyc (non GPU) server.

The RX570 relates to the Destroy DC context while keeping DML and DML2 kernel commit and you can compile your kernel without this commit. https://git.kernel.org/pub/scm/linu...t?id=06ad7e16425619a4a77154c2e85bededb3e04a4f

Special about these described server freezes - is it is not a crash with logs - it is a complete freeze without any logs and you need to hard reset the server...
 
Last edited:
I have the exact same Effect on a 5 Node Epyc Milan / 7313P (Board Supermicro H12SSW-NTR) Cluster.

Ran stable for a year, since 8.2/Kernel 6.8 random lockups of single Nodes after 1-3 days. Console freezes, no error Messages anywhere, just frozen.

Rebooted all nodes today back to 6.5 - I will report on it.

regards.
 
  • Like
Reactions: antonin.chadima
Confirming I've seen the same thing with Intel based systems as well. No errors, just a frozen console and nothing in any logs to indicate a crash nor issue. Some systems were going 2-3 days before crashing, others within 6 hours.

Pinning the kernel to 6.5 currently appears to have corrected the issue.
 
  • Like
Reactions: antonin.chadima

Please help identify the common factors that cause this problem.

I can confirm, that I had exact the same issue as described by @Tim-AU and @Lephisto
Still looking for the code different or added in kernel 6.8 causing this issue.

* I'm using in the VM CPU host, is this the same in your case?
* Proxmox on ZFS
* BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
* ASPEED Technology IPMI and Graphics
* Ceph 18.2.2
* VM discs with IO thread enabled an Async IO threads
 
Last edited:
I am really a little bit surprised that they put this one in the enterprise repos. I thought they were supposed to be tested for extra stability :(
This happens once in ten years...

I've had now some very hot moments at a large scientific organization where I run a entire rack of Proxmox servers and a Ceph cluster.
Even the basic infrastructure was randomly down (Firewalls, DNS, DHCP ...).

But it's OK. Sh1t happens. That's life!
The important part is to identify what exactly is causing this problem.
And to get a confirmation that this bug is solved in a new version,
to be able to get back upstream...

@Moayad could we solve this problem together with Proxmox Server Solutions GmbH? I think this will not be an isolated case.
 
Last edited:
I had 3 freezes in the last 3 days.
Intel and AMD CPUs.

ProxMox system on BTRFS
VMs CPU "x86-64-v3"
VMs on Cheph 8.2.2
VMs with and without IO Trhead enabled. (We disabled IO Thread on our database VMs, because the VMs got stuck with IO Thread enabled )
Ceph OSDs are present on the frozen nodes, but no ceph "mgr", "mon" or "mds" is present on the frozen nodes.
 
  • Like
Reactions: antonin.chadima
I had a freeze on a mac mini 2012 I run test vms on. I did the iommu=off fix on that machine and it has been fine since, but today I had a random reboot on my main server we use for our email and storage vms and it's not an intel. It's an AMD Ryzen that has been solid for a good while. I do actually use iommu on this server for my storage vm and can't turn it off.

I checked all the logs, but couldn't find any indication of a problem in the logs. Just a gap and then all the boot usual boot logging.

So I don't really have any useful info to add right now except, me too. :-(
 
Last edited:
I updated our small cluster to the test repositiory, and started netconsole in hope to catch more information.
But since then i didn't have any freezes anymore. I cant' say if there are changes in test repositorie that could explain that they are not freezing anymore, or if it is just a good run for the last 5 days.

Also a possible differnce is:

Now all my VMs have been stopped and started under Kernel 6.8.x .
Before all my VMs have been started on Kernel 6.5.X and where migrated to updated 6.8.x nodes.
(But i don't think that @Kevo did migrate his VMs from an other running mac mini to the updated mac mini.)

So i stay with good run, or somerthing has changed for the better in the test repository.
 
  • Like
Reactions: antonin.chadima
I updated our small cluster to the test repositiory, and started netconsole in hope to catch more information.
But since then i didn't have any freezes anymore. I cant' say if there are changes in test repositorie that could explain that they are not freezing anymore, or if it is just a good run for the last 5 days.

Also a possible differnce is:

Now all my VMs have been stopped and started under Kernel 6.8.x .
Before all my VMs have been started on Kernel 6.5.X and where migrated to updated 6.8.x nodes.
(But i don't think that @Kevo did migrate his VMs from an other running mac mini to the updated mac mini.)

So i stay with good run, or somerthing has changed for the better in the test repository.
Please keep us posted.

Over here we had a complette shutdown of the whole cluster after the Update, so all VMs have been "born" on 6.8..
 
  • Like
Reactions: antonin.chadima

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!