Kernel 6.8.4-2 causes random server freezing

antonin.chadima · May 3, 2024

This thread is dedicated to the issue where the server just freezes.

If the kernel gives error messages when the server crashes

there is a thread https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760

and not AMD GPU related as in https://gitlab.freedesktop.org/drm/amd/-/issues/3173

(problems may be however related... but who knows)

Hi,
i have updated a 8 server cluster to Proxmox 8.2 on 28.4.
On 1.5. i had the first problems with 3 of the 8 servers got stuck.
And then after a couple of hours another server and so on.
It doesn't last six hours and i have to reboot one of the servers.
(and if I'm not lucky enough and i don't restart the server quickly,
another 1 or 2 servers freezes meanwhile and the cluster dies,
because we are using ceph on all 8 nodes)

The server is in a "frozen" condition.
It does display the login prompt and nothing else on connected monitor,
but it does not react to a usb keyboard, even numlock is not working.
No segfault or other messages in dmesg or syslog.
I have to hard reset the frozen node.

The server is a ASUS RS500A-E11 with AMD Epyc Milan series CPU (motherboard KMPA-U16)
And it is not possible that all the servers have suddenly HW problems.
All packages are updated. And I have the latest BIOS - released few months ago.

Today I will try to downgrade the kernels to 6.5
Any other ideas? Thank you in advance.

Moayad · May 6, 2024

Hello,

When you boot from an older Proxmox-kernel you don't see any issues with your servers? If yes, you can pin the older Proxmox-kernel 6.5 [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_kernel_pin

antonin.chadima · May 6, 2024

after many attempts
replacing PSU, CPU, RAM etc...
trying different kernel parameters etc...
checking ipmi and system for any error and logs
checking all power cables and upgrading UPS
updating BIOS, Firmware and BCM/IPMI
BIOS configuring different options
installing a new version of amd-firmware

nothing really helped

i have yesterday pinned the kernel version to 6.5
and so far no problem

the funny thing is, that the system just freezes and no logs no kernel crash, nothing
and we are experiencing this problem on 12 different AMD Epyc servers

zzz09700 · May 6, 2024

A funnier thing is PVE claims its using Ubuntu kernel but I couldn't find anything related to kernel random freeze/crash/panic on Ubuntu 24.04, which also uses kernel 6.8

Meanwhile PVE 8.2 is freezing and crashing left and right here. And we don't see any explanation about what in going wrong with the kernel and when a fix will be available, just people telling eachother to roll back to kernel 6.5.

If the devs has no idea what is going on, pull kernel 6.8 and make 6.5 as defualt again.

At least ESXi had the guts of admitting something is wrong and they have no idea how to fix so they pulled the update.

antonin.chadima · May 7, 2024

we need to know, what changed in kernel 6.8
and when it would be save, to go upstream again
are the any proxmox specific kernel patches?

tried kernel 6.8 without success:
amd_iommu=off iommu=off
default - no parameters
and with ceph optimatizations amd_iommu=on iommu=pt pcie_aspm=off
with limiting c-state (deeper c-states can cause this kind of freezing) idle=poll processor.max_cstate=0

for the VM we use cpu=host
and this is probably the problem

there is quite a lot reported problems with 6.7/6.8
freezes with suspend and resume and hibernate - not this case
freezes at boot before login - not this case
freezes with playback of videos on firefox (Destroy DC context while keeping DML and DML2) - not this case

antonin.chadima · May 7, 2024

solved!

it is definitely a kernel 6.8 bug

i need to know which kernel patch/commit is causing this regression
i'm looking in to ubuntu kernel and are the any proxmox specific patches?

i will share this kernel problem with asrock and asus vendor

zzz09700 · May 7, 2024

antonin.chadima said:
i will share this kernel problem with asrock and asus vendor

They probably would dismiss the case since it appears to be PVE only. Ubuntu 24.04 LTS seems to be doing okay with the new kernel.

leesteken · May 7, 2024

zzz09700 said:
They probably would dismiss the case since it appears to be PVE only. Ubuntu 24.04 LTS seems to be doing okay with the new kernel.

The amdgpu driver of Ubuntu 24.04 (kernel 6.8) crashes on a RX570. Definitely not Proxmox but generic Linux 6.8: https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/post-661753

antonin.chadima · May 8, 2024

Asrock and Asus are server vendors And we have a good relation with them.

And this Is not the case of RX570. As I wrote in my previous post, I'm aware of three different situations of 6.8 freezing - non of them applies to a epyc (non GPU) server.

The RX570 relates to the Destroy DC context while keeping DML and DML2 kernel commit and you can compile your kernel without this commit. https://git.kernel.org/pub/scm/linu...t?id=06ad7e16425619a4a77154c2e85bededb3e04a4f

Special about these described server freezes - is it is not a crash with logs - it is a complete freeze without any logs and you need to hard reset the server...

Lephisto · May 9, 2024

I have the exact same Effect on a 5 Node Epyc Milan / 7313P (Board Supermicro H12SSW-NTR) Cluster.

Ran stable for a year, since 8.2/Kernel 6.8 random lockups of single Nodes after 1-3 days. Console freezes, no error Messages anywhere, just frozen.

Rebooted all nodes today back to 6.5 - I will report on it.

regards.

Tim-AU · May 10, 2024

Confirming I've seen the same thing with Intel based systems as well. No errors, just a frozen console and nothing in any logs to indicate a crash nor issue. Some systems were going 2-3 days before crashing, others within 6 hours.

Pinning the kernel to 6.5 currently appears to have corrected the issue.

Lephisto · May 10, 2024

I am really a little bit surprised that they put this one in the enterprise repos. I thought they were supposed to be tested for extra stability

antonin.chadima · May 10, 2024

Please help identify the common factors that cause this problem.

I can confirm, that I had exact the same issue as described by @Tim-AU and @Lephisto
Still looking for the code different or added in kernel 6.8 causing this issue.

* I'm using in the VM CPU host, is this the same in your case?
* Proxmox on ZFS
* BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
* ASPEED Technology IPMI and Graphics
* Ceph 18.2.2
* VM discs with IO thread enabled an Async IO threads

antonin.chadima · May 10, 2024

Lephisto said:
I am really a little bit surprised that they put this one in the enterprise repos. I thought they were supposed to be tested for extra stability

This happens once in ten years...

I've had now some very hot moments at a large scientific organization where I run a entire rack of Proxmox servers and a Ceph cluster.
Even the basic infrastructure was randomly down (Firewalls, DNS, DHCP ...).

But it's OK. Sh1t happens. That's life!
The important part is to identify what exactly is causing this problem.
And to get a confirmation that this bug is solved in a new version,
to be able to get back upstream...
@Moayad could we solve this problem together with Proxmox Server Solutions GmbH? I think this will not be an isolated case.

BenediktS · May 10, 2024

I had 3 freezes in the last 3 days.
Intel and AMD CPUs.

ProxMox system on BTRFS
VMs CPU "x86-64-v3"
VMs on Cheph 8.2.2
VMs with and without IO Trhead enabled. (We disabled IO Thread on our database VMs, because the VMs got stuck with IO Thread enabled )
Ceph OSDs are present on the frozen nodes, but no ceph "mgr", "mon" or "mds" is present on the frozen nodes.

Kevo · May 10, 2024

I had a freeze on a mac mini 2012 I run test vms on. I did the iommu=off fix on that machine and it has been fine since, but today I had a random reboot on my main server we use for our email and storage vms and it's not an intel. It's an AMD Ryzen that has been solid for a good while. I do actually use iommu on this server for my storage vm and can't turn it off.

I checked all the logs, but couldn't find any indication of a problem in the logs. Just a gap and then all the boot usual boot logging.

So I don't really have any useful info to add right now except, me too. :-(

BenediktS · Tuesday at 13:56

I updated our small cluster to the test repositiory, and started netconsole in hope to catch more information.
But since then i didn't have any freezes anymore. I cant' say if there are changes in test repositorie that could explain that they are not freezing anymore, or if it is just a good run for the last 5 days.

Also a possible differnce is:

Now all my VMs have been stopped and started under Kernel 6.8.x .
Before all my VMs have been started on Kernel 6.5.X and where migrated to updated 6.8.x nodes.
(But i don't think that @Kevo did migrate his VMs from an other running mac mini to the updated mac mini.)

So i stay with good run, or somerthing has changed for the better in the test repository.

antonin.chadima · Tuesday at 19:25

Maybe you have to restart all the VMs?
(did a live migrate...)

Lephisto · Wednesday at 16:15

BenediktS said:
I updated our small cluster to the test repositiory, and started netconsole in hope to catch more information.
But since then i didn't have any freezes anymore. I cant' say if there are changes in test repositorie that could explain that they are not freezing anymore, or if it is just a good run for the last 5 days.

Also a possible differnce is:

Now all my VMs have been stopped and started under Kernel 6.8.x .
Before all my VMs have been started on Kernel 6.5.X and where migrated to updated 6.8.x nodes.
(But i don't think that @Kevo did migrate his VMs from an other running mac mini to the updated mac mini.)

So i stay with good run, or somerthing has changed for the better in the test repository.

Please keep us posted.

Over here we had a complette shutdown of the whole cluster after the Update, so all VMs have been "born" on 6.8..

eebgmbh · Friday at 11:05

Did anybody try the latest kernel, i.e. proxmox-kernel-6.8.4-3-pve-signed ? Did it fix that problem ?
Thx

Kernel 6.8.4-2 causes random server freezing

Member

This thread is dedicated to the issue where the server just freezes.​

If the kernel gives error messages when the server crashes​

there is a thread https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760​

and not AMD GPU related as in https://gitlab.freedesktop.org/drm/amd/-/issues/3173​

(problems may be however related... but who knows)​

Proxmox Staff Member

Member

Active Member

Member

Member

Active Member

Distinguished Member

Member

Active Member

New Member

Active Member

Member

Please help identify the common factors that cause this problem.​

​

Member

Member

Well-Known Member

Member

Member

Active Member

New Member

This thread is dedicated to the issue where the server just freezes.

If the kernel gives error messages when the server crashes

there is a thread https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760

and not AMD GPU related as in https://gitlab.freedesktop.org/drm/amd/-/issues/3173

(problems may be however related... but who knows)

Please help identify the common factors that cause this problem.