Random 6.8.4-2-pve kernel crashes

Hi, this is interesting. I have also 20 nodes working fine with 6.8.4-2 (without ceph, without osd) for 3 weeks, but 2 nodes with ceph osd are crashing in 24h.
They are lenovovo epyc v3 servers with nvme drivers.

Do you use encryption for your osd ? (I have a trace related to storage/dm-crypt)

I'm currently on holiday, I'll try to newer kernel version next week.
Yes, we use encryption on our OSD and I remember seeing storage/dm-crypt related traces!
 
Do you use encryption for your osd ? (I have a trace related to storage/dm-crypt)
good point - no actual fixes for dm_crypt explicitly - but a few on the block subsystem (and afair the traces indicated issues there in the bottom)
I tried running a Debian VM with dmcrypt disksetup here with our latest kernel and 6.8.4-2-pve - but did not run into issues (tested with a short fio-run)
 
Come back with data ^^^ .

- In (my) setup 6.8.x-zabbly+ Kernel + qemu on debian 12 works fine.
- PVE 6.8.6-3 only works with intel_iommu=off
one test that might make sense is booting the last 6.5 pve kernel with intel_iommu=on (I hope I did not overlook that you tried that?)


does the NUC run apart from the warnings in dmesg?
 
one test that might make sense is booting the last 6.5 pve kernel with intel_iommu=on (I hope I did not overlook that you tried that?)
does the NUC run apart from the warnings in dmesg?

1) yes I did

2) yes but what's the point? The zabbly Kernels don't have the problem - only the Ubuntu based. So we just "need" to do was zabbly does in the

- the ubuntu kernel
- the pve patches

(I personally don't give a damn about patching Ubuntu Kernels - I care about the pve patches).

Again - I can do that. I know what to do, who to ask, where to put it, how to make a pipieline for regression tests, etc. - I am not hired/related to Proxmox GmbH. After working 5 days on this - my budget for non profit benefical work / donation to the world is really exausted. Out of 10.000 people in Facebook, Discord and probably alots of people here - not a single person (could) or wanted to help.

So thanks.

My approach is - I take the zabbly Kernels and just apply the pve patches... this solves the issue: for me + my computers + my customers. Everbody has the GPLed source code and is invited to do the same.

----

- I have no cepfs
- I have no zfs
- I don't have the amdgpu issue ( it looks like that is a generic 6.8.x thing - but I got no feedback, yet and I don't have that issue)


I would love to someone wanting to work with me on this.
 
  • Like
Reactions: static302
my budget for non profit benefical work / donation to the world is really exausted.
Well I guess staff at PVE and basically everyone using PVE 8.2 owe you a big THANK YOU man.

I figure PVE should gift you a year of premium subscription for what you have done. But I guess they don't feel like so.

Also I'd personally thank PVE community for promote broken kernel warning though, we could have a few Ubuntu workstations in a rather useless state (and spend a few days rolling them back to 20.04 LTS) if PVE didn't get SO broken with kernel 6.8.
 
Last edited:
Well I guess staff at PVE and basically everyone using PVE 8.2 owe you a big THANK YOU man.

I figure PVE should gift you a year of premium subscription for what you have done. But I guess they don't feel like so.

I asked for a call / talk with a kernel dev to join forces in a way, where we are not everybody is throwing rocks on a space shuttle...

There are many 6.8.x issues that need to be put in. a list and "they" have to make a decssion on what to care and how to handle these kind of problems in the future:

- A lot of old ESXi customers are arriving in the next moths.
- I am sure you find a few just - on purpose - putting one of their 50 machines as a canary / kernel test / stress machine and run regression tests.

We will soon have 192 core (384 threads) cpu - even 512+ are now (inofficially) announced. It will get much harder to maintain all these things.
 
I asked for a call / talk with a kernel dev to join forces in a way, where we are not everybody is throwing rocks on a space shuttle...
It's their space shuttle, not yours. We users are merely passengers onboard and we do have the option to jump ship.
For my company its not complete my call but we are definitely counting days of how long it takes for PVE team to release a fixed kernel, or provide other works-out-of-the-box solution. Telling people to manually pin older kernel version is not a solution, its last resort struggle.

The handling of PVE 8.2 incident gave me a feeling that people at Proxmox does not give a damn about their users - paid or not. Otherwise kernel 6.8 should be in trash bin like a week ago and PVE 8.2 should be default running kernel 6.5 now.

Bottom line, when people really come out of their way to look into bugs of PVE, they should not be dismissed by some customer care bullshit.

Don't know how this plays out for others but in my company if we have to fix kernel ourselves then the offer from VMWare/Broadcom suddenly became, should I say, rather attractive.
 
Last edited:
  • Like
Reactions: Der Harry
@Der Harry was sagt mir das jetzt? Ich wollte nur wissen, ob ich jetzt immer noch aufpassen muss, wenn ich neustarte oder ob es schon eine konkrete "liste" von zuständen gibt, die zusammen treffen müssen, damit der Crash auftritt?
 
  • Like
Reactions: Der Harry
@Der Harry was sagt mir das jetzt? Ich wollte nur wissen, ob ich jetzt immer noch aufpassen muss, wenn ich neustarte oder ob es schon eine konkrete "liste" von zuständen gibt, die zusammen treffen müssen, damit der Crash auftritt?

Anything written here is about 6.8.4-3 - please have a look at the link I provided.

Thank you.
 

Attachments

  • 1715190680330.png
    1715190680330.png
    96.1 KB · Views: 38
i had a problem with Linux 6.8.4-3-pve - Geforce passthrough to Win11 was broken with info:

Task viewer: VM 300 - StartOutputStatusStopDownloadswtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 7163) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

return to Linux 6.8.4-2-pve fixed it
AMD AM5 platform
 
i had a problem with Linux 6.8.4-3-pve - Geforce passthrough to Win11 was broken with info:

Task viewer: VM 300 - StartOutputStatusStopDownloadswtpm_setup: Not overwriting existing state file.
kvm: ../hw/pci/pci.c:1637: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
stopping swtpm instance (pid 7163) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

return to Linux 6.8.4-2-pve fixed it
AMD AM5 platform
Yeah obviously they broke pcie passthrough with -3.
Now I'm counting how many things can get broken before we see a stable release again.

If PVE team is waiting for Ubuntu to fix the kernel then we'll probably have to wait until around August, at which point the .1 for Ubuntu 24.04 will likely show up.
 
Last edited:
  • Like
Reactions: KrisFromFuture
Just for info (this is not production!)

6.8.9-zabbly+ Kernels work on my nuc (with severe missing features compared to pve kernels!)

- no cpfs
- no zfs
- screen is blank (I think zabbly has - as Debian - a lot of console vesa drivers enabled... using pve's make menuconfig entries will make this go away)

Again - this is just info! (seeing is believing...)

The right approach would be

- bring the bugfix of zabbly to PVE

The "works for my" approach

- I just patched zabbly with PVE


1715253130375.png
 
Just tested the new Linux 6.8.4-3-pve kernel, unfortunately the same problem again, even though the server ran much longer.
All VM's crashed, no more access to the GUI, only via SSH.

When writing the crash log (via SSH) it crashed completely. The file was then unfortunately gone. So I can't attach it anymore. I will now try again in the hope of at least getting a log this time to help troubleshoot the problem.

Edit: Crashlog after Reset:
Edit2: second crash
 

Attachments

Last edited:
Just tested the new Linux 6.8.4-3-pve kernel, unfortunately the same problem again, even though the server ran much longer.
All VM's crashed, no more access to the GUI, only via SSH.

When writing the crash log (via SSH) it crashed completely. The file was then unfortunately gone. So I can't attach it anymore. I will now try again in the hope of at least getting a log this time to help troubleshoot the problem.

Edit: Crashlog after Reset:
Edit2: second crash

If you have no Cepfs / ZFS you can give the 6.8.9-zabbly+ a try on proxmox.

Proxmox will have severe issues - but - I would love that you can also show - "no crash"
 
I am speachless again

zabbly has no patches - it's the valilla 6.8.9

He only has a ghaction to build and a Readme.md

Code:
$ diff -Naur linux-6.8.9 linux-zabbly-6.8.9 > zabbly-6.8.9.patch
$ cat zabbly-6.8.9.patch
diff --color -Naur linux-6.8.9/.github/FUNDING.yml linux-zabbly-6.8.9/.github/FUNDING.yml
--- linux-6.8.9/.github/FUNDING.yml    1970-01-01 01:00:00.000000000 +0100
+++ linux-zabbly-6.8.9/.github/FUNDING.yml    2024-05-03 17:36:05.000000000 +0200
@@ -0,0 +1,5 @@
+# Frequent committers who contribute to Incus on their own time can add
+# themselves to the list here so users who feel like sponsoring can find
+# them.
+github:
+ - stgraber
diff --color -Naur linux-6.8.9/.github/workflows/commits.yml linux-zabbly-6.8.9/.github/workflows/commits.yml
--- linux-6.8.9/.github/workflows/commits.yml    1970-01-01 01:00:00.000000000 +0100
+++ linux-zabbly-6.8.9/.github/workflows/commits.yml    2024-05-03 17:36:05.000000000 +0200
@@ -0,0 +1,40 @@
+name: Commits
+on:
+  - pull_request
+
+permissions:
+  contents: read
+
+jobs:
+  dco-check:
+    permissions:
+      pull-requests: read  # for tim-actions/get-pr-commits to get list of commits from the PR
+    name: Signed-off-by (DCO)
+    runs-on: ubuntu-22.04
+    steps:
+    - name: Get PR Commits
+      id: 'get-pr-commits'
+      uses: tim-actions/get-pr-commits@master
+      with:
+        token: ${{ secrets.GITHUB_TOKEN }}
+
+    - name: Check that all commits are signed-off
+      uses: tim-actions/dco@master
+      with:
+        commits: ${{ steps.get-pr-commits.outputs.commits }}
+
+  target-branch:
+    permissions:
+      contents: none
+    name: Branch target
+    runs-on: ubuntu-22.04
+    steps:
+    - name: Check branch target
+      env:
+        TARGET: ${{ github.event.pull_request.base.ref }}
+      run: |
+        set -x
+        [ "${TARGET}" = "main" ] && exit 0
+
+        echo "Invalid branch target: ${TARGET}"
+        exit 1
diff --color -Naur linux-6.8.9/README.md linux-zabbly-6.8.9/README.md
--- linux-6.8.9/README.md    1970-01-01 01:00:00.000000000 +0100
+++ linux-zabbly-6.8.9/README.md    2024-05-03 17:36:05.000000000 +0200
@@ -0,0 +1,108 @@
+# Linux stable kernel builds
+Those are kernel builds made and supported by Zabbly.
+They track the latest stable mainline kernel and are build for both `x86_64` and `aarch64`.
+
+The general goal behind those kernel builds is to provide a recent
+stable mainline kernel with wide hardware support and a configuration
+that's optimal for running [Incus](https://github.com/lxc/incus) containers and VMs.
+
+Those are usually updated weekly, shortly after a new bugfix release.
+They do not immediately roll to a new kernel release, instead waiting for its first bugfix release to be out.
+
+## Availability
+Those kernels are built for:
+
+ * Ubuntu 20.04 LTS (`focal`)
+ * Ubuntu 22.04 LTS (`jammy`)
+ * Ubuntu 24.04 LTS (`noble`)
+ * Debian 11 (`bullseye`) (`x86_64` only)
+ * Debian 12 (`bookworm`)
+
+## Installation
+
+All commands should be run as root.
+
+### Repository key
+
+Packages provided by the repository are signed. In order to verify the integrity of the packages, you need to import the public key. First, verify that the fin
+
+```sh
+curl -fsSL https://pkgs.zabbly.com/key.asc | gpg --show-keys --fingerprint
+```
+
+```sh
+pub   rsa3072 2023-08-23 [SC] [expires: 2025-08-22]
+      4EFC 5906 96CB 15B8 7C73  A3AD 82CC 8797 C838 DCFD
+uid                      Zabbly Kernel Builds <info@zabbly.com>
+sub   rsa3072 2023-08-23 [E] [expires: 2025-08-22]
+```
+
+If so, save the key locally:
+
+```sh
+mkdir -p /etc/apt/keyrings/
+curl -fsSL https://pkgs.zabbly.com/key.asc -o /etc/apt/keyrings/zabbly.asc
+```
+
+### Stable repository
+
+On any of the distributions above, you can add the package repository at `/etc/apt/sources.list.d/zabbly-kernel-stable.sources`.
+
+Run the following command to add the stable repository:
+
+```sh
+sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-kernel-stable.sources
+Enabled: yes
+Types: deb
+URIs: https://pkgs.zabbly.com/kernel/stable
+Suites: $(. /etc/os-release && echo ${VERSION_CODENAME})
+Components: main
+Architectures: $(dpkg --print-architecture)
+Signed-By: /etc/apt/keyrings/zabbly.asc
+
+EOF'
+```
+
+### Installing the kernel
+
+Finally, install the kernel, with: `apt-get install linux-zabbly`.
+
+## Secure boot
+As those kernels aren't signed by a trusted distribution key, you may
+need to turn off Secure Boot on your system in order to boot this kernel.
+
+## Configuration
+The kernel configuration is a derivative of the Ubuntu configuration for the matching architecture.
+That is, almost everything is enabled and as many components as possible are built as modules.
+
+## Additional changes
+On top of the mainline kernel, the following changes have been made:
+
+ * Support for VFS idmap mounts for cephfs (both architectures)
+ * Revert of a PCIe change breaking Qualcomm servers (aarch64 only)
+ * Revert of the change making `kernel_neon_begin` and `kernel_neon_end` GPL-only (breaks ZFS) (aarch64 only)
+
+## Ceph VFS idmap
+The Ceph VFS idmap support requires protocol changes which haven't been included in upstream Ceph yet.
+To function with stable Ceph, the module must be loaded with the `enable_unsafe_idmap=Y` option.
+
+This can be easily done by creating a file at `/etc/modprobe.d/ceph.conf` containing:
+```
+options ceph enable_unsafe_idmap=Y
+```
+
+## ZFS availability
+For users who need ZFS support, an up to date ZFS package repository can be found: [here](https://github.com/zabbly/zfs)
+That ZFS package repository is tested prior to new kernels being rolled out and so will avoid breakages due to upstream kernel changes.
+
+## Support
+Commercial support for those kernel packages is provided by [Zabbly](https://zabbly.com).
+
+You can also help support the work on those packages through:
+
+ - [Github Sponsors](https://github.com/sponsors/stgraber)
+ - [Patreon](https://patreon.com/stgraber)
+ - [Ko-Fi](https://ko-fi.com/stgraber)
+
+## Repository
+This repository gets actively rebased as new releases come out, DO NOT expect a linear git history.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!