Random 6.8.4-2-pve kernel crashes

Der Harry

Active Member
Sep 9, 2023
244
55
28
With the recent PVE 8.2.2 update, 6.8.x kernels are the default.

I was kindly asked by @t.lamprecht to create a new thread (https://forum.proxmox.com/threads/proxmox-ve-8-2-released.145723/post-657135)

- https://forum.proxmox.com/threads/proxmox-ve-8-2-released.145723/#post-656947

- https://www.reddit.com/r/Proxmox/comments/1c4z9xh/new_to_proxmox_issues_dl380_g9_random/

- also mentioned here: https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.144557/

- edit1: Epyc CPU mentioned here https://forum.proxmox.com/threads/proxmox-ve-8-2-released.145723/page-4#post-657302


I have a "Ryzen 5700G / Asus ROG STRIX B550-A GAMING / old Bios: 3002 / No secure boot / no UEFI boot."

That might be not the problem, has others have Xeons and 13gen Intels.

In my case nfsd crashed several times. After I stopped the LXC container that provides my nfs service - other things crashed with null pointers).

Willing to help / test.
 
Last edited:
Could you please share the logs of the crash as far as possible (if nothing else works even a screenshot might help)
 
  • Like
Reactions: Der Harry
I have 2 crashes.

The kernel was "sick" but the machine was still alive.

I didn't do any screenshots from the first 1-2 rounds.

(Please also check the links in my first post - it looks like it's a Xeon, Intel 13 Gen, Epyic and Ryzen issue)


(Willing to do more testing - but I have to install Proxmox on a USB Stick /2nd Disk)
 

Attachments

hm as far as I can see the issue does not seem related to NFSd? (the traces point to mdraid I/O)

In any case - especially with newer hardware - I'd suggest to try upgrading the BIOS (new kernel's sometimes show bugs present in older BIOS versions, that do not show with older kernels)

from a very quick look it seems there is a version for your board from this year:
https://rog.asus.com/motherboards/rog-strix/rog-strix-b550-a-gaming-model/helpdesk_bios/
(of course please verify it's the correct one)
 
...
In any case - especially with newer hardware - I'd suggest to try upgrading the BIOS (new kernel's sometimes show bugs present in older BIOS versions, that do not show with older kernels)
...
If I would be "the only person" yeah - that's a good idea. But I am not :)

There are Xeons, Intels, Epycs who are crashing - and - just patching the bios (which makes my CPU slower) might be an option - but it's a final option.

It's very very unlikely that it's "only" the bios.

At the moment I am happy with 6.5 and I am waiting for some weeks / months until more is discovered.
 
  • Like
Reactions: MasterChat
There are Xeons, Intels, Epycs who are crashing - and - just patching the bios (which makes my CPU slower) might be an option - but it's a final option.
Yes - but unless they have exactly the same traces as yours the issues are usually not related :)

new kernels do show various bugs in different pieces of hardware.

Upgrading the firmware of components has helped more often than not to make a bug not show up again in my experience - hence the suggestion...
 
Hi,

since the update to Proxmox 8.2.2 and the new kernel 6.8 I have problems with my server.
When no machine is running everything is fine. As soon as I start a VM, Proxmox crashes within a very short time. Only a hard rest helps. Strangely, it only remains accessible via ssh. I have now pinned the 6.5 kernel and everything is running as before.
The only noticeable thing is that all KVM processes with 100% CPU load are displayed in the process overview.
No VM or the web interface is accessible anymore.

Ryzen 3700x
ASRock Rack (B550D4-4L)
64GB ECC
2x 1TB NVME SSD

further post (german) see here

P.S.: Dedicated Server from Hetzner

Addition:
I copied everything out of the syslog in the GUI

Bash:
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends International, LLC.
Version: L0.27
Release Date: 12/08/2022
[...]

# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASRockRack
Product Name: B565D4-V1L
Version:
Serial Number: xxxxxxxxxx
Asset Tag:
Features:
Board is a hosting board
Board is replaceable
Location In Chassis:
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0
[...]
 

Attachments

Last edited:
Also crashes here with 6.8.4 and not with 6.5.13:

Supermicro X10DRU-i+, Bios 3.4
E5-2620 v4
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
88:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

I have 11 machines running these X10-DRU-i+ boards, only the three ceph nodes with the ConnectX-3 and no vms at all crash.
The other VM hosts without ConnectX-3 run just fine with kernel 6.8.4.
 

Attachments

  • Like
Reactions: ben-rampart
For the systems using Intel CPU's - please try adding `intel_iommu=off` to the kernel cmdline - see: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline (for both systemd-boot and grub hitting `e` enables the editing before the kernel is booted)

Background - the default got changed to on between 6.5 and 6.8 - and this can cause issues with older machines with Intel CPUs (also with newer ones, but there the chances are higher that a BIOS update will fix it eventually)

I hope this helps - please report back in any case!
 
  • Like
Reactions: Der Harry
For the systems using Intel CPU's - please try adding `intel_iommu=off` to the kernel cmdline - see: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline (for both systemd-boot and grub hitting `e` enables the editing before the kernel is booted)

Background - the default got changed to on between 6.5 and 6.8 - and this can cause issues with older machines with Intel CPUs (also with newer ones, but there the chances are higher that a BIOS update will fix it eventually)
Did Proxmox silently enable intel_iommu by default again? Were there not enough issues last time this was done in PVE 7? Can someone please update the manual (which was only recently corrected after the last time ;)) and the Wiki pages.

Also, the amdgpu driver that comes with 6.8.4-2-pve crashes on Radeon RX570's but that's probably an upstream issue.
 
Did Proxmox silently enable intel_iommu by default again? Were there not enough issues last time this was done in PVE 7? Can someone please update the manual (which was only recently corrected after the last time ;)) and the Wiki pages.
was actually Ubuntu - and we stuck with it - in order to create as little divergence as possible.
The reasoning sounds sensible to me:
https://git.launchpad.net/~ubuntu-k.../?id=77e530c1a864c601b96622db03bc1f38e51155f1

but yes - we'll add it to the known issues for the 8.2 release and adapt our documentation.
 
For the systems using Intel CPU's - please try adding `intel_iommu=off` to the kernel cmdline - see: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline (for both systemd-boot and grub hitting `e` enables the editing before the kernel is booted)

Background - the default got changed to on between 6.5 and 6.8 - and this can cause issues with older machines with Intel CPUs (also with newer ones, but there the chances are higher that a BIOS update will fix it eventually)

I hope this helps - please report back in any case!

Looks like a plan!


Maybe that's the smoking gun on my AMD, too.

amd_iommu=on iommu

There are no bios update for the NUC i5 (NUC6i5SYB). Maybe for (oder Xeons) turning off iommu isn't an option.
 
Since updating to 8.2 we had multiple crashes on four different servers all having:

* 2U GIGABYTE barebone server R272-Z34
* AMD EPYC 75F3 (Singlesocket)

The strange thing is, that even rollbacking and pinning to kernel 6.5 (which ran fine before without issues) does not help. We will update to latest bios soon and see if the errors stays the same. The logs unfortunately do not tell anything useful. All of those servers had successfully used amd_iommu=on and iommu=pt before.
 
Last edited:
  • Like
Reactions: Der Harry
* 2U GIGABYTE barebone server R272-Z34
* AMD EPYC 75F3 (Singlesocket)
Were you able to capture a kernel dump?
And was it a crash or did the node freeze?
 
Maybe that's the smoking gun on my AMD, too.

amd_iommu=on iommu

There are no bios update for the NUC i5 (NUC6i5SYB). Maybe for (oder Xeons) turning off iommu isn't an option.
a) amd_iommu defaulted to on for a long time (if not from the beginning) - so there should be no need to add that to the commandline
b) why is turning intel_iommu=off not an option? - did you have it explicitly enabled with the 6.5 kernel series? If not - then it has been off with that (and all kernels before) - and now got turned on, which might cause the issues you're having
 
a) amd_iommu defaulted to on for a long time (if not from the beginning) - so there should be no need to add that to the commandline
b) why is turning intel_iommu=off not an option? - did you have it explicitly enabled with the 6.5 kernel series? If not - then it has been off with that (and all kernels before) - and now got turned on, which might cause the issues you're having
a) I didn't know that.
a+b) For testing I am more then happy to turn it off! For production it's not - for everybody - an option.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!