Random 6.8.4-2-pve kernel crashes

Der Harry · May 13, 2024

Ramalama said:
Did anyone with kernel crashes, tryed to disable Hyperthreading?

1) No crashes with 6.8.7. 6.8.8. 6.8.9 vanilla. Same Hardware, Same VM/qcow2 files. (Debian 12 with libvirtd)

2) I wrote about 6.8.x scheduler issues in one of my first posts. Linus himelf reverted a patch: https://www.phoronix.com/news/Linux-6.8-Sched-Regression. I didn't follow the "full path" about this individual "story" in the Kernel - but you can be right about this. Which is academic, as new kernels fixes the issue.

CTCcloud · May 13, 2024

So far, everything I've read in this thread makes sense and explains our crashes perfectly

Our only nodes that crash are the NVMe ONLY nodes that do our hyper-converged Ceph and setting iommu to off doesn't work

Just as other comments have explained, there have been no dmesg nor journal entries helpful in explaining the failure

It's clear that "Der Harry" has bent over backward to help and deserves serious reward for all his hard work. Why? Because this bad kernel was placed in the "Enterprise Repository" that's being paid for by customer subscriptions. In other words, being "Enterprise" it's assumed that it is already stable and KNOWN GOOD before placing it there .. from all the comments, it's clear that's just not the case and chasing down this issue was done for free by someone other than the folks being paid to do so. Not cool guys

Perhaps all of us as users of Proxmox and those of us that are PAYING customers (we've been using Proxmox for our customers since version 2.0) need to speak up and ask for a clear explanation for why Ubuntu kernels are still being used as the basis for Proxmox kernels and insist that it stop .. Why aren't Debian kernels used as the base for pve kernels or if it's felt that something else is needed, why not build on the vanilla kernels straight from kernel.org?

It should be noted that Canonical has a relationship with Microsoft (WSL & WSL2 and patches along with it) .. MS would love nothing more than to see Canonical get a "black eye" in the community and amongst it's Enterprise customers. So, why trust? What good solid basis is there? What proof is there that Canonical/Ubuntu can be trusted when they partner with untrustworthy companies?

Debian isn't perfect by any stretch of the imagination but do much better than Canonical/Ubuntu. Look at Debian as the base operating system .. very stable and that has also always been the aim of the Debian team. Proxmox has stood on the shoulders of giants and should show gratitude for it. They didn't make ZFS, they didn't produce Debian, they didn't produce KVM, nor Ceph, etc, etc and yet all those tools are here and usable

A little humility goes a long way and helps greatly in restoring customer confidence.

antonin.chadima · May 14, 2024

Der Harry said:
1) No crashes with 6.8.7. 6.8.8. 6.8.9 vanilla. Same Hardware, Same VM/qcow2 files. (Debian 12 with libvirtd)

2) I wrote about 6.8.x scheduler issues in one of my first posts. Linus himelf reverted a patch: https://www.phoronix.com/news/Linux-6.8-Sched-Regression. I didn't follow the "full path" about this individual "story" in the Kernel - but you can be right about this. Which is academic, as new kernels fixes the issue.

Hi @Der Harry ,

I have started a similar thread https://forum.proxmox.com/threads/kernel-6-8-4-2-causes-random-server-freezing.146327
My 12 servers are "freezing", no logs, no segfault, no memory leak, no amdgpu, no iommu issue, etc.
Yes we have Ceph and some NVMe drives for Ceph OSD. This is the only thing in common with others with similar issues.
And i have a couple of LXC containers.
I'm now happy running a pinned kernel 6.5 - but this can not work forever...
I need to know what is causing the problem and when there will be an updated kernel (maybe also other stuff) from the Proxmox team.
I have spend one week of my time going through Vanilla kernel commits, Ubuntu patches and the PVE kernel.
I saw a lot of work done in ceph area of the kernel in 6.8 rc and later, but maybe it's not Ceph related. I have also tried to focus on AMD EPYC.
Because of lack of time I didn't find the exact commit/patch which causes the regression (You have to wait cca 6h-24h to let the sh1t happen).
(A week before I have hunted the problem to BIOS, kernel parameters, VM and Ceph config - and HW issues...
without any success, only pinning the kernel to a lower version helped)
And i can afford only one server in the cluster to be "infected" with the 6.8 kernel

1. Did we experience the same problem?
2. Do you have any knowledge what is causing the problem?
3. Or you can find out if there is a fixed pve 6.8 kernel released?
4. I can offer my time and one server for testing different kernels.

PS: You can write personal messages in German.

billy999 · May 14, 2024

Hi, our server with a H12SSL-CT Board (2.7 BIOS, Legacy boot) and an EPYC 7313P (IOMMU on) has been running 6.8.4-3-pve for 2 days just fine.
No Ceph/Cluster, not using any M.2/NVMe, just dozens of SATA and SAS, ZFS only. Running a couple KVMs, no LXC.

spirit · May 14, 2024

Hi,
here the common crash trace I have with 6.8.4-2-pve , related to ceph osd with dmcrypt. Was able to reproduce 8 times, always same log (I'm going to test 6.8.4-3-pve).

Code:

May  6 03:19:17 server1 kernel: [111961.629710] BUG: kernel NULL pointer dereference, address: 0000000000000cd4
May  6 03:19:17 server1 kernel: [111961.629760] #PF: supervisor write access in kernel mode
May  6 03:19:17 server1 kernel: [111961.629781] #PF: error_code(0x0002) - not-present page
May  6 03:19:17 server1 kernel: [111961.629799] PGD 0 P4D 0 
May  6 03:19:17 server1 kernel: [111961.629814] Oops: 0002 [#1] PREEMPT SMP NOPTI
May  6 03:19:17 server1 kernel: [111961.629834] CPU: 28 PID: 795688 Comm: kworker/u197:25 Tainted: P           O       6.8.4-2-pve #1
May  6 03:19:17 server1 kernel: [111961.629856] Hardware name: Lenovo ThinkSystem SR645/7D2XCTO1WW, BIOS D8E136E-3.30 02/21/2024
May  6 03:19:17 server1 kernel: [111961.629880] Workqueue: kcryptd/252:7 _crypt [dm_crypt]
May  6 03:19:17 server1 kernel: [111961.629911] RIP: 0010:_raw_spin_lock_irqsave+0x2c/0x80
May  6 03:19:17 server1 kernel: [111961.629930] Code: 44 00 00 55 48 89 e5 41 54 53 48 89 fb 9c 58 0f 1f 40 00 49 89 c4 fa 0f 1f 44 00 00 65 ff 05 13 45 0e 47 31 c0 ba 01 00 00 00 <f0> 0f b1 13 75 20 4c 89 e0 5b 41 5c 5d 31 d2 31 c9 31 f6 31 ff 45
May  6 03:19:17 server1 kernel: [111961.629969] RSP: 0018:ffff96976c607be0 EFLAGS: 00010046
May  6 03:19:17 server1 kernel: [111961.629985] RAX: 0000000000000000 RBX: 0000000000000cd4 RCX: ffff96975e0b7d08
May  6 03:19:17 server1 kernel: [111961.630006] RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000cd4
May  6 03:19:17 server1 kernel: [111961.630027] RBP: ffff96976c607bf0 R08: 0000000000000000 R09: 0000000000000000
May  6 03:19:17 server1 kernel: [111961.630047] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
May  6 03:19:17 server1 kernel: [111961.630066] R13: 0000000000000cd4 R14: 0000000000000002 R15: 0000000000000003
May  6 03:19:17 server1 kernel: [111961.630087] FS:  0000000000000000(0000) GS:ffff8be10e600000(0000) knlGS:0000000000000000
May  6 03:19:17 server1 kernel: [111961.630109] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  6 03:19:17 server1 kernel: [111961.630129] CR2: 0000000000000cd4 CR3: 00000001c8efa001 CR4: 0000000000f70ef0
May  6 03:19:17 server1 kernel: [111961.630150] PKRU: 55555554
May  6 03:19:17 server1 kernel: [111961.630162] Call Trace:
May  6 03:19:17 server1 kernel: [111961.630175]  <TASK>
May  6 03:19:17 server1 kernel: [111961.630189]  ? show_regs+0x6d/0x80
May  6 03:19:17 server1 kernel: [111961.630207]  ? __die+0x24/0x80
May  6 03:19:17 server1 kernel: [111961.630223]  ? page_fault_oops+0x176/0x500
May  6 03:19:17 server1 kernel: [111961.630242]  ? srso_alias_return_thunk+0x5/0xfbef5
May  6 03:19:17 server1 kernel: [111961.630260]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
May  6 03:19:17 server1 kernel: [111961.630280]  ? do_user_addr_fault+0x2f9/0x6b0
May  6 03:19:17 server1 kernel: [111961.630297]  ? skcipher_walk_skcipher+0xd0/0x100
May  6 03:19:17 server1 kernel: [111961.630316]  ? exc_page_fault+0x83/0x1b0
May  6 03:19:17 server1 kernel: [111961.630334]  ? asm_exc_page_fault+0x27/0x30
May  6 03:19:17 server1 kernel: [111961.630354]  ? _raw_spin_lock_irqsave+0x2c/0x80
May  6 03:19:17 server1 kernel: [111961.630372]  try_to_wake_up+0x56/0x5f0
May  6 03:19:17 server1 kernel: [111961.630389]  wake_up_process+0x15/0x30
May  6 03:19:17 server1 kernel: [111961.630406]  aio_complete+0x1aa/0x280
May  6 03:19:17 server1 kernel: [111961.630425]  aio_complete_rw+0xe9/0x200
May  6 03:19:17 server1 kernel: [111961.630828]  blkdev_bio_end_io_async+0x3b/0xa0
May  6 03:19:17 server1 kernel: [111961.631131]  bio_endio+0xee/0x180
May  6 03:19:17 server1 kernel: [111961.631411]  __dm_io_complete+0x210/0x340
May  6 03:19:17 server1 kernel: [111961.631682]  ? srso_alias_return_thunk+0x5/0xfbef5
May  6 03:19:17 server1 kernel: [111961.631936]  clone_endio+0x141/0x1f0
May  6 03:19:17 server1 kernel: [111961.632173]  bio_endio+0xee/0x180
May  6 03:19:17 server1 kernel: [111961.632407]  crypt_dec_pending+0x90/0x120 [dm_crypt]
May  6 03:19:17 server1 kernel: [111961.632648]  kcryptd_crypt_read_convert+0x8f/0x190 [dm_crypt]
May  6 03:19:17 server1 kernel: [111961.632891]  kcryptd_crypt+0x1f/0x40 [dm_crypt]
May  6 03:19:17 server1 kernel: [111961.633130]  process_one_work+0x16d/0x350
May  6 03:19:17 server1 kernel: [111961.633366]  worker_thread+0x306/0x440
May  6 03:19:17 server1 kernel: [111961.633595]  ? __pfx_worker_thread+0x10/0x10
May  6 03:19:17 server1 kernel: [111961.633821]  kthread+0xf2/0x120
May  6 03:19:17 server1 kernel: [111961.634044]  ? __pfx_kthread+0x10/0x10
May  6 03:19:17 server1 kernel: [111961.634267]  ret_from_fork+0x47/0x70
May  6 03:19:17 server1 kernel: [111961.634488]  ? __pfx_kthread+0x10/0x10
May  6 03:19:17 server1 kernel: [111961.634707]  ret_from_fork_asm+0x1b/0x30
May  6 03:19:17 server1 kernel: [111961.634928]  </TASK>
May  6 03:19:17 server1 kernel: [111961.635136] Modules linked in: veth dm_crypt ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_CT iptable_raw xt_mac ipt_REJECT nf_reject_ipv4 xt_physdev xt_addrtype xt_multiport xt_conntrack xt_comment xt_NFLOG xt_tcpudp xt_set xt_mark iptable_filter ip_set_hash_net ip_set sctp scsi_transport_iscsi nf_tables nvme_fabrics vrf vxlan ip6_udp_tunnel udp_tunnel dummy bonding softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ipmi_ssif rapl wmi_bmof pcspkr mgag200 i2c_algo_bit ipmi_si ipmi_devintf ccp ptdma k10temp ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 msr efi_pstore dmi_sysfs ip_tables x_tables autofs4 xfs btrfs blake2b_generic xor raid6_pq mlx5_ib ib_uverbs macsec
May  6 03:19:17 server1 kernel: [111961.635249]  ib_core hid_generic usbmouse usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul mlx5_core nvme xhci_pci mlxfw nvme_core psample ahci xhci_pci_renesas tls nvme_auth libahci xhci_hcd pci_hyperv_intf i2c_piix4 wmi
May  6 03:19:17 server1 kernel: [111961.638019] CR2: 0000000000000cd4
May  6 03:19:17 server1 kernel: [111961.638266] ---[ end trace 0000000000000000 ]---
May  6 03:19:17 server1 kernel: [111961.717384] RIP: 0010:_raw_spin_lock_irqsave+0x2c/0x80
May  6 03:19:17 server1 kernel: [111961.717647] Code: 44 00 00 55 48 89 e5 41 54 53 48 89 fb 9c 58 0f 1f 40 00 49 89 c4 fa 0f 1f 44 00 00 65 ff 05 13 45 0e 47 31 c0 ba 01 00 00 00 <f0> 0f b1 13 75 20 4c 89 e0 5b 41 5c 5d 31 d2 31 c9 31 f6 31 ff 45
May  6 03:19:17 server1 kernel: [111961.718183] RSP: 0018:ffff96976c607be0 EFLAGS: 00010046
May  6 03:19:17 server1 kernel: [111961.718461] RAX: 0000000000000000 RBX: 0000000000000cd4 RCX: ffff96975e0b7d08
May  6 03:19:17 server1 kernel: [111961.718742] RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000cd4
May  6 03:19:17 server1 kernel: [111961.719023] RBP: ffff96976c607bf0 R08: 0000000000000000 R09: 0000000000000000
May  6 03:19:17 server1 kernel: [111961.719304] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007
May  6 03:19:17 server1 kernel: [111961.719587] R13: 0000000000000cd4 R14: 0000000000000002 R15: 0000000000000003
May  6 03:19:17 server1 kernel: [111961.719868] FS:  0000000000000000(0000) GS:ffff8be10e600000(0000) knlGS:0000000000000000
May  6 03:19:17 server1 kernel: [111961.720154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  6 03:19:17 server1 kernel: [111961.720439] CR2: 0000000000000cd4 CR3: 00000001c8efa001 CR4: 0000000000f70ef0
May  6 03:19:17 server1 kernel: [111961.720728] PKRU: 55555554
May  6 03:19:17 server1 kernel: [111961.721016] note: kworker/u197:25[795688] exited with irqs disabled
May  6 03:19:17 server1 kernel: [111961.721324] note: kworker/u197:25[795688] exited with preempt_count 3
²²
[/CODE

darthmaul0181 · May 14, 2024

I've read several times that the freezes could be linked to machines with (only?) nvme SSDs. This is exactly my case (and I suffered the freezes of kernel 6.8) I have two SSDs nvme in raid 1 (soft) and an AMD Ryzen 5 3600. My two cents

CTCcloud · May 14, 2024

Yep, same as us .. not an Intel vs AMD thing for sure .. definitely is pegged to NVMe/ZFS

We have ZFS mirrors for OS on all Ceph nodes and then Ceph OSDs are direct access to NVMe drives via LVM volumes as Ceph always does

Unfortunately, we get no kernel panic nor anything else in logs that can be helpful .. all our nodes that run VMs are running on Dell RAID and the 6.8 kernel is no problem there .. again, NVMe only

Der Harry · May 14, 2024

CTCcloud said:
Yep, same as us .. not an Intel vs AMD thing for sure .. definitely is pegged to NVMe/ZFS

We have ZFS mirrors for OS on all Ceph nodes and then Ceph OSDs are direct access to NVMe drives via LVM volumes as Ceph always does

Unfortunately, we get no kernel panic nor anything else in logs that can be helpful .. all our nodes that run VMs are running on Dell RAID and the 6.8 kernel is no problem there .. again, NVMe only

Here my suggestion.

- Install Debian 12.5
- Install libvird
- Create 10 VMs (check the thread for cook.sh / cook2.sh - I have 2 versions)

It probably runs fine with Stock Debian 12 (and the ancient 6.1.x)

Update the kernel to Zabby (https://github.com/zabbly/linux) - that is (vanilla)6.8.9 at the moment.

Run your 10VMs again.

(Unfortunately you need "more" then one node for the cepfs test...)

My (wasted time) on this:

- stock = 100% ok.
- pve (without the PVE patches - I removed guys - https://git.proxmox.com/?p=pve-kern...e72b828e42baa8d986589a9a384a2784e35ec;hb=HEAD) - just with ZFS + upstream patches - 100% bad

^^^ it's the Ubuntu Kernel...

Lindsay-Sol1 · May 15, 2024

Just wanted to make a quick note to thank the OP for the heads up on this one.
It is something for us to be aware of.

I manage 4 Proxmox clusters day to day. (Plus a few more for customers when they need outside help)
I run a cluster at home, and 3 for work.
Work ones we have are, customer hosting servers prod, internal servers prod and internal dev.

I have a 5 node cluster at home, running Ceph with NVME OSD's. (Just a single OSD per node so only small)
All nodes are Intel NUC type units. (Mix of Intel and Asus NUCs)
2 of the nodes I recently rebooted to kernel 6.8.4-3, and 3 nodes are still on 6.5.13-3 waiting for a reboot.
I cannot say I have seen any stability issues with the 6.8 nodes, and those have been running on 6.8 for a bit over a week.

At work the dev cluster is still on 6.5 so not relevant to this thread.

The work internal prod cluster (A 7 node cluster with Ceph of mixed AMD EPYC and Intel Xeon hosts) just last week I upgraded all nodes to 6.8.4-2.
No stability issues on any nodes.
However all nodes only have SATA drives, no NVME drives involved in that cluster.

The customer hosting cluster is a 5 node EPYC cluster, with all NVME storage (52 NVME OSD's total)
That is still on Proxmox 7, however I am planning to upgrade that cluster to Proxmox 8 sometime within the next 2 weeks.
I am glad I saw this thread as I could have tripped over this kernel problem.

I think it might be wise for me when I upgrade the customer hosting cluster to Proxmox 8 to pin kernel 6.5 instead of running 6.8 based on what I am reading.

Side note we also had massive problems with kernel 5.15 when it came out, and we pinned 5.13 at the time for stability reasons.
Nodes were crashing with kernel panics when LXC containers were shutdown and the ethernet interfaces brought down.
5.13 was fine but 5.15 introduced a regression.

Anyway thanks again for the OP's research on this matter, and I will definitely keep the main customer hosting cluster on 6.5 when I do the Proxmox 7 to 8 upgrade until I see that these problems have been sorted out.

spirit · May 15, 2024

spirit said:
Hi,
here the common crash trace I have with 6.8.4-2-pve , related to ceph osd with dmcrypt. Was able to reproduce 8 times, always same log (I'm going to test 6.8.4-3-pve).

after 24h, no crash with 6.8.4-3-pve. (I'm crossing my finger, because it was crashing multiple times by day with 6.8.4-2-pve )

BTW, I have another cluster with 20 nodes with 6.8.4-2-pve without any problem since 3-4weeks, but with ceph-osd running on it. (Os is installed on 2 nvme m2 in raid1 hardware).

The only difference with crashing nodes, are the nvme drive (U2) used by ceph-osd (with dmcrypt, I don't have cluster without dmcrypt to test)

spirit · May 17, 2024

no crash since 4 days with 6.8.4-3-pve, seem to be fixed for my bug !

DK-fire · May 20, 2024

On Friday I've upgraded my proxmox to 6.8.4-3-pve and unfortunately no luck.

Code:

[78019.882528] BUG: kernel NULL pointer dereference, address: 0000000000000008
[78019.889875] #PF: supervisor write access in kernel mode
[78019.895442] #PF: error_code(0x0002) - not-present page
[78019.900904] PGD 0 P4D 0
[78019.903772] Oops: 0002 [#1] PREEMPT SMP NOPTI
[78019.908442] CPU: 9 PID: 1351 Comm: kvm Tainted: P           O       6.8.4-3-pve #1
[78019.916338] Hardware name: GIGABYTE MX33-BS1-V1/MX33-BS1-V1, BIOS F09d 08/27/2023

VoIP-Ninja · May 22, 2024

spirit said:
after 24h, no crash with 6.8.4-3-pve. (I'm crossing my finger, because it was crashing multiple times by day with 6.8.4-2-pve )

BTW, I have another cluster with 20 nodes with 6.8.4-2-pve without any problem since 3-4weeks, but with ceph-osd running on it. (Os is installed on 2 nvme m2 in raid1 hardware).

The only difference with crashing nodes, are the nvme drive (U2) used by ceph-osd (with dmcrypt, I don't have cluster without dmcrypt to test)

Hi,

can you go a little bit more in detail ( in terms of hardware )?

We had already the case that Samsung SSDs were kind of problematic in combination with LSI Controllers and zfs. The same drive directly attached without a controller didn't reproduce the problem.

thx

malakatronis · May 25, 2024

Getting random freezes on a test workstation with proxmox-kernel-6.8.4-3-pve-signed. It has a pair of NVMe's on RAID1. It happens when there is at least one qemu VM running

Code:

2024-05-22T14:15:53.397552+02:00 ryzen-cobaya kernel: [  353.242845] BUG: kernel NULL pointer dereference, address: 0000000000000008
2024-05-22T14:15:53.397561+02:00 ryzen-cobaya kernel: [  353.242866] #PF: supervisor write access in kernel mode
2024-05-22T14:15:53.397562+02:00 ryzen-cobaya kernel: [  353.242877] #PF: error_code(0x0002) - not-present page
2024-05-22T14:15:53.397562+02:00 ryzen-cobaya kernel: [  353.242887] PGD 0 P4D 0
2024-05-22T14:15:53.397563+02:00 ryzen-cobaya kernel: [  353.242896] Oops: 0002 [#1] PREEMPT SMP NOPTI
2024-05-22T14:15:53.397563+02:00 ryzen-cobaya kernel: [  353.242906] CPU: 3 PID: 2613 Comm: kvm Tainted: P           O       6.8.4-3-pve #1
2024-05-22T14:15:53.397564+02:00 ryzen-cobaya kernel: [  353.242919] Hardware name: ASUS System Product Name/TUF GAMING B550-PLUS, BIOS 3002 02/23/2023

Going back to 6.5.13-5-pve-signed fixes the problem.

fhloston · May 27, 2024

spirit said:
after 24h, no crash with 6.8.4-3-pve. (I'm crossing my finger, because it was crashing multiple times by day with 6.8.4-2-pve )

BTW, I have another cluster with 20 nodes with 6.8.4-2-pve without any problem since 3-4weeks, but with ceph-osd running on it. (Os is installed on 2 nvme m2 in raid1 hardware).

The only difference with crashing nodes, are the nvme drive (U2) used by ceph-osd (with dmcrypt, I don't have cluster without dmcrypt to test)

Same here, 4+ days uptime without crash with 6.8.4-3-pve now. Seems to fix one of the issues.

jsterr · May 27, 2024

I got a reply to one of my linkedin posts that recommend the following parameters, can someone with 6.8 and crashes try these out in /etc/kernel/cmdline and apply those pcie_port_pm=off libata.force=noncq and do a proxmox-boot-tool refresh and reboot?

There are also few new issues, mostly related with big jump in kernel version
- when using veeam agents installed on nodes, dkms fails with building and preparing these (silently) which can lead to broken initramfs rebuild - leaving you with unbootable kernel.
Resolve: apt purge veeam blksnapd
- some 'older' hardware seems to have problems with Power Management SSD/NVME drives (magical slowdowns and hangs after some I/O operations)
Resolve: add kernel parameters
pcie_port_pm=off libata.force=noncq

snakeoilos · May 28, 2024

jsterr said:
I got a reply to one of my linkedin posts that recommend the following parameters, can someone with 6.8 and crashes try these out in /etc/kernel/cmdline and apply those pcie_port_pm=off libata.force=noncq and do a proxmox-boot-tool refresh and reboot?

Trying this now... Tried so many things:
1. Disable cstates
2. Disable e-cores
3. Pin to older kernel
4. Trying different Intel microcode. (Tried the May release yesterday and kernel panic overnight)

But every time the NVMe is hit hard (e.g. doing a backup), the kernel will sometimes crash. It's the unpredictability that's PITA. Will try another NVMe when I get to the shops next time.

fax · Jun 2, 2024

i had some severe i/o problems with zfs and kernel 6.8 three weeks ago.
after that I also pinned to 6.5.13 (after restoring the data from a backup).

after that experience I don't trust the kernel 6.8 - maybe it's been release too early...
It's now the default kernel of PVE 8.2 - so you cannot easily revert.

snakeoilos · Jun 3, 2024

fax said:
i had some severe i/o problems with zfs and kernel 6.8 three weeks ago.
after that I also pinned to 6.5.13 (after restoring the data from a backup).

What do you mean? Don't have to restore from any data?

fax said:
after that experience I don't trust the kernel 6.8 - maybe it's been release too early...
It's now the default kernel of PVE 8.2 - so you cannot easily revert.

It's pretty easy to use an older kernel. Just download the kernel (if it's not already there, e.g. when upgrading from an older version of Proxmox). and then run something like this:

Bash:

proxmox-boot-tool kernel pin 6.5.13-5-pve

To re-use default, IIRC the command line is:

Bash:

proxmox-boot-tool kernel unpin

I'm back to kernel 6.8, and applying jsterr's kernel cmdline. Been good for around 4 days so far (Longest run on 6.8.x thus far).

Earlier today I noticed split lock detections are freezing my VMs when it's doing very high I/O (transferring data with throughputs of 250MB/s to 450 MB/s). Turns out the VMs are not technically freezing, just "slowed down"? Disabling the split lock mitigation and that seems to be OK.

Will keep both changes and see how long the machine can stay up without kernel panic.

So far it seems Proxmox 8.2 will misbehave when there're bursts or sustained I/O on NVMe drives (either heavy activity within the VMs, or when doing a backup). This is probably hardware related and I bet switching to a different NVMe drive will fix it for my case at least, YMMV.

snakeoilos · Jun 13, 2024

It's been more than 2 weeks since I applied the changes suggested by jsterr. Happy to report the node is stable so far. Going to re-enable VM backups on this node and that will be the ultimate stability test. Fingers crossed.

Random 6.8.4-2-pve kernel crashes

Active Member

Renowned Member

Active Member

Member

Distinguished Member

Member

Renowned Member

Active Member

New Member

Distinguished Member

Distinguished Member

New Member

Active Member

Renowned Member

Active Member

Famous Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

We value your privacy