Random 6.8.4-2-pve kernel crashes

Some information of my crashing Hetzner Server.

Code:
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
    Manufacturer: ASRockRack
    Product Name: B450D4U-V1L
    Version:                      
    Serial Number: M80-D5016200816
    Asset Tag:                      
    Features:
        Board is a hosting board
        Board is replaceable
    Location In Chassis:                      
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0
   
Handle 0x0015, DMI type 4, 48 bytes
Processor Information
    Socket Designation: CPU1
    Type: Central Processor
    Family: Zen
    Manufacturer: Advanced Micro Devices, Inc.
    ID: 10 0F 87 00 FF FB 8B 17
    Signature: Family 23, Model 113, Stepping 0
    Flags:
        FPU (Floating-point unit on-chip)
        VME (Virtual mode extension)
        DE (Debugging extension)
        PSE (Page size extension)
        TSC (Time stamp counter)
        MSR (Model specific registers)
        PAE (Physical address extension)
        MCE (Machine check exception)
        CX8 (CMPXCHG8 instruction supported)
        APIC (On-chip APIC hardware supported)
        SEP (Fast system call)
        MTRR (Memory type range registers)
        PGE (Page global enable)
        MCA (Machine check architecture)
        CMOV (Conditional move instruction supported)
        PAT (Page attribute table)
        PSE-36 (36-bit page size extension)
        CLFSH (CLFLUSH instruction supported)
        MMX (MMX technology supported)
        FXSR (FXSAVE and FXSTOR instructions supported)
        SSE (Streaming SIMD extensions)
        SSE2 (Streaming SIMD extensions 2)
        HTT (Multi-threading)
    Version: AMD Ryzen 5 3600 6-Core Processor            
    Voltage: 1.1 V
    External Clock: 100 MHz
    Max Speed: 4200 MHz
    Current Speed: 3600 MHz
    Status: Populated, Enabled
    Upgrade: Socket AM4
    L1 Cache Handle: 0x0012
    L2 Cache Handle: 0x0013
    L3 Cache Handle: 0x0014
    Serial Number: Unknown
    Asset Tag: Unknown
    Part Number: Unknown
    Core Count: 6
    Core Enabled: 6
    Thread Count: 12
    Characteristics:
        64-bit capable
        Multi-Core
        Hardware Thread
        Execute Protection
        Enhanced Virtualization
        Power/Performance Control
 

Attachments

Maybe it's related to the Intel NIC?
B450 and B550 are mostly the same so the platforms are very similar.
 
Maybe it's related to the Intel NIC?
B450 and B550 are mostly the same so the platforms are very similar.
My X540-AT2 based X10DRU-i+ run fine, when they do not have NVME storage and no Connect-X3 and no ceph-osd.

So I do not see the Intel NIC as the culprit.
 
Also having problems with the 6.8 update here, had to pin to 6.5.

Hardware:
Supermicro X13DEI-T
Mellanox ConnectX-5

I am getting the following logs which seem to point to the broadcom driver for the mellanox card?:

May 06 14:20:43 exegol kernel: ------------[ cut here ]------------
May 06 14:20:43 exegol kernel: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
May 06 14:20:43 exegol kernel: shift exponent 64 is too large for 64-bit type 'long unsigned int'
May 06 14:20:43 exegol kernel: CPU: 27 PID: 1541 Comm: (udev-worker) Tainted: P O 6.8.4-2-pve #1
May 06 14:20:43 exegol kernel: Hardware name: Supermicro Super Server/X13DEI-T, BIOS 2.1 12/13/2023
May 06 14:20:43 exegol kernel: Call Trace:
May 06 14:20:43 exegol kernel: <TASK>
May 06 14:20:43 exegol kernel: dump_stack_lvl+0x48/0x70
May 06 14:20:43 exegol kernel: dump_stack+0x10/0x20
May 06 14:20:43 exegol kernel: __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
May 06 14:20:43 exegol kernel: bnxt_qplib_alloc_init_hwq.cold+0x8c/0xd7 [bnxt_re]
May 06 14:20:43 exegol kernel: bnxt_qplib_create_qp+0x1d5/0x8c0 [bnxt_re]
May 06 14:20:43 exegol kernel: bnxt_re_create_qp+0x71d/0xf30 [bnxt_re]
May 06 14:20:43 exegol kernel: ? bnxt_qplib_create_cq+0x247/0x330 [bnxt_re]
May 06 14:20:43 exegol kernel: ? __kmalloc+0x1ab/0x400
May 06 14:20:43 exegol kernel: create_qp+0x17a/0x290 [ib_core]
May 06 14:20:43 exegol kernel: ? create_qp+0x17a/0x290 [ib_core]
May 06 14:20:43 exegol kernel: ib_create_qp_kernel+0x3b/0xe0 [ib_core]
May 06 14:20:43 exegol kernel: create_mad_qp+0x8e/0x100 [ib_core]
May 06 14:20:43 exegol kernel: ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
May 06 14:20:43 exegol kernel: ib_mad_init_device+0x2c2/0x8a0 [ib_core]
May 06 14:20:43 exegol kernel: add_client_context+0x127/0x1c0 [ib_core]
May 06 14:20:43 exegol kernel: enable_device_and_get+0xe6/0x1e0 [ib_core]
May 06 14:20:43 exegol kernel: ib_register_device+0x506/0x610 [ib_core]
May 06 14:20:43 exegol kernel: bnxt_re_probe+0xe7d/0x11a0 [bnxt_re]
May 06 14:20:43 exegol kernel: ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
May 06 14:20:43 exegol kernel: auxiliary_bus_probe+0x3e/0xa0
May 06 14:20:43 exegol kernel: really_probe+0x1c9/0x430
May 06 14:20:43 exegol kernel: __driver_probe_device+0x8c/0x190
May 06 14:20:43 exegol kernel: driver_probe_device+0x24/0xd0
May 06 14:20:43 exegol kernel: __driver_attach+0x10b/0x210
May 06 14:20:43 exegol kernel: ? __pfx___driver_attach+0x10/0x10
May 06 14:20:43 exegol kernel: bus_for_each_dev+0x8a/0xf0
May 06 14:20:43 exegol kernel: driver_attach+0x1e/0x30
May 06 14:20:43 exegol kernel: bus_add_driver+0x156/0x260
May 06 14:20:43 exegol kernel: driver_register+0x5e/0x130
May 06 14:20:43 exegol kernel: __auxiliary_driver_register+0x73/0xf0
May 06 14:20:43 exegol kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
May 06 14:20:43 exegol kernel: bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
May 06 14:20:43 exegol kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
May 06 14:20:43 exegol kernel: do_one_initcall+0x5b/0x340
May 06 14:20:43 exegol kernel: do_init_module+0x97/0x290
May 06 14:20:43 exegol kernel: load_module+0x213a/0x22a0
May 06 14:20:43 exegol kernel: init_module_from_file+0x96/0x100
May 06 14:20:43 exegol kernel: ? init_module_from_file+0x96/0x100
May 06 14:20:43 exegol kernel: idempotent_init_module+0x11c/0x2b0
May 06 14:20:43 exegol kernel: __x64_sys_finit_module+0x64/0xd0
May 06 14:20:43 exegol kernel: do_syscall_64+0x84/0x180
May 06 14:20:43 exegol kernel: ? syscall_exit_to_user_mode+0x86/0x260
May 06 14:20:43 exegol kernel: ? do_syscall_64+0x93/0x180
May 06 14:20:43 exegol kernel: ? syscall_exit_to_user_mode+0x86/0x260
May 06 14:20:43 exegol kernel: ? do_syscall_64+0x93/0x180
May 06 14:20:43 exegol kernel: ? exc_page_fault+0x94/0x1b0
May 06 14:20:43 exegol kernel: entry_SYSCALL_64_after_hwframe+0x73/0x7b
May 06 14:20:43 exegol kernel: RIP: 0033:0x72645a931719
May 06 14:20:43 exegol kernel: Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
May 06 14:20:43 exegol kernel: RSP: 002b:00007ffed52a8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
May 06 14:20:43 exegol kernel: RAX: ffffffffffffffda RBX: 0000608b2a9f24a0 RCX: 000072645a931719
May 06 14:20:43 exegol kernel: RDX: 0000000000000000 RSI: 000072645aac4efd RDI: 0000000000000006
May 06 14:20:43 exegol kernel: RBP: 000072645aac4efd R08: 0000000000000000 R09: 0000608b2a9a89b0
May 06 14:20:43 exegol kernel: R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
May 06 14:20:43 exegol kernel: R13: 0000000000000000 R14: 0000608b2aa20560 R15: 0000608b29633ec1
May 06 14:20:43 exegol kernel: </TASK>
May 06 14:20:43 exegol kernel: ---[ end trace ]---

(...)
ay 06 14:21:43 exegol systemd-udevd[1502]: bnxt_en.rdma.0: Worker [1541] processing SEQNUM=18440 is taking a long time
May 06 14:21:43 exegol systemd-udevd[1502]: bnxt_en.rdma.1: Worker [1605] processing SEQNUM=18443 is taking a long time
May 06 14:22:23 exegol kernel: bnxt_en 0000:3d:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (100205 > 100000) msec active 1
May 06 14:22:23 exegol kernel: bnxt_en 0000:3d:00.0 bnxt_re0: Failed to modify HW QP
May 06 14:22:23 exegol kernel: infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
May 06 14:22:23 exegol kernel: infiniband bnxt_re0: Couldn't start port
May 06 14:22:23 exegol kernel: bnxt_en 0000:3d:00.0 bnxt_re0: Failed to destroy HW QP
May 06 14:22:23 exegol kernel: ------------[ cut here ]------------
May 06 14:22:23 exegol kernel: WARNING: CPU: 53 PID: 1541 at drivers/infiniband/core/cq.c:322 ib_free_cq+0x109/0x150 [ib_core]
May 06 14:22:23 exegol kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iaa_crypto kvm >
May 06 14:22:23 exegol kernel: spi_intel_pci i2c_i801 bnxt_en nvme_auth xhci_hcd libahci spi_intel pci_hyperv_intf i2c_ismt i2c_smbus wmi pinctrl_emmitsburg
May 06 14:22:23 exegol kernel: CPU: 53 PID: 1541 Comm: (udev-worker) Tainted: P O 6.8.4-2-pve #1
May 06 14:22:23 exegol kernel: Hardware name: Supermicro Super Server/X13DEI-T, BIOS 2.1 12/13/2023
May 06 14:22:23 exegol kernel: RIP: 0010:ib_free_cq+0x109/0x150 [ib_core]
May 06 14:22:23 exegol kernel: Code: e8 fc 9c 02 00 65 ff 0d 9d a7 37 3f 0f 85 70 ff ff ff 0f 1f 44 00 00 e9 66 ff ff ff 48 8d 7f 50 e8 6c 5a a5 fa e9 35 ff ff ff <0f> 0b 31 c0 31 f6 31 ff c3 cc cc cc cc 0f 0b eb 80 44 0f b6 25 64
May 06 14:22:23 exegol kernel: RSP: 0018:ff470edb78f43670 EFLAGS: 00010202
May 06 14:22:23 exegol kernel: RAX: 0000000000000002 RBX: 0000000000000001 RCX: 0000000000000000
May 06 14:22:23 exegol kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff45cdbf8bf05800
May 06 14:22:23 exegol kernel: RBP: ff470edb78f436e0 R08: 0000000000000000 R09: 0000000000000000
May 06 14:22:23 exegol kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff45cdbf8aa00000
May 06 14:22:23 exegol kernel: R13: ff45cd9fc95cc300 R14: 00000000ffffff92 R15: ff45cdbf507ee000
May 06 14:22:23 exegol kernel: FS: 000072645a2248c0(0000) GS:ff45cddebf880000(0000) knlGS:0000000000000000
May 06 14:22:23 exegol kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 06 14:22:23 exegol kernel: CR2: 0000608b2a9de018 CR3: 00000020c7a80001 CR4: 0000000000f71ef0
May 06 14:22:23 exegol kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 06 14:22:23 exegol kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
May 06 14:22:23 exegol kernel: PKRU: 55555554
May 06 14:22:23 exegol kernel: Call Trace:
May 06 14:22:23 exegol kernel: <TASK>
May 06 14:22:23 exegol kernel: ? show_regs+0x6d/0x80
May 06 14:22:23 exegol kernel: ? __warn+0x89/0x160
May 06 14:22:23 exegol kernel: ? ib_free_cq+0x109/0x150 [ib_core]
May 06 14:22:23 exegol kernel: ? report_bug+0x17e/0x1b0
May 06 14:22:23 exegol kernel: ? handle_bug+0x46/0x90
May 06 14:22:23 exegol kernel: ? exc_invalid_op+0x18/0x80
May 06 14:22:23 exegol kernel: ? asm_exc_invalid_op+0x1b/0x20
May 06 14:22:23 exegol kernel: ? ib_free_cq+0x109/0x150 [ib_core]
May 06 14:22:23 exegol kernel: ? ib_mad_init_device+0x54c/0x8a0 [ib_core]
May 06 14:22:23 exegol kernel: add_client_context+0x127/0x1c0 [ib_core]
May 06 14:22:23 exegol kernel: enable_device_and_get+0xe6/0x1e0 [ib_core]
May 06 14:22:23 exegol kernel: ib_register_device+0x506/0x610 [ib_core]
May 06 14:22:23 exegol kernel: bnxt_re_probe+0xe7d/0x11a0 [bnxt_re]
May 06 14:22:23 exegol kernel: ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
May 06 14:22:23 exegol kernel: auxiliary_bus_probe+0x3e/0xa0
May 06 14:22:23 exegol kernel: really_probe+0x1c9/0x430
May 06 14:22:23 exegol kernel: __driver_probe_device+0x8c/0x190
May 06 14:22:23 exegol kernel: driver_probe_device+0x24/0xd0
May 06 14:22:23 exegol kernel: __driver_attach+0x10b/0x210
May 06 14:22:23 exegol kernel: ? __pfx___driver_attach+0x10/0x10
May 06 14:22:23 exegol kernel: bus_for_each_dev+0x8a/0xf0
May 06 14:22:23 exegol kernel: driver_attach+0x1e/0x30
May 06 14:22:23 exegol kernel: bus_add_driver+0x156/0x260
May 06 14:22:23 exegol kernel: driver_register+0x5e/0x130
May 06 14:22:23 exegol kernel: __auxiliary_driver_register+0x73/0xf0
May 06 14:22:23 exegol kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
May 06 14:22:23 exegol kernel: bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
May 06 14:22:23 exegol kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
May 06 14:22:23 exegol kernel: do_one_initcall+0x5b/0x340
May 06 14:22:23 exegol kernel: do_init_module+0x97/0x290
May 06 14:22:23 exegol kernel: load_module+0x213a/0x22a0
May 06 14:22:23 exegol kernel: init_module_from_file+0x96/0x100
May 06 14:22:23 exegol kernel: ? init_module_from_file+0x96/0x100
May 06 14:22:23 exegol kernel: idempotent_init_module+0x11c/0x2b0
May 06 14:22:23 exegol kernel: __x64_sys_finit_module+0x64/0xd0
May 06 14:22:23 exegol kernel: do_syscall_64+0x84/0x180
May 06 14:22:23 exegol kernel: ? syscall_exit_to_user_mode+0x86/0x260
May 06 14:22:23 exegol kernel: ? do_syscall_64+0x93/0x180
May 06 14:22:23 exegol kernel: ? syscall_exit_to_user_mode+0x86/0x260
May 06 14:22:23 exegol kernel: ? do_syscall_64+0x93/0x180
May 06 14:22:23 exegol kernel: ? exc_page_fault+0x94/0x1b0
May 06 14:22:23 exegol kernel: entry_SYSCALL_64_after_hwframe+0x73/0x7b
May 06 14:22:23 exegol kernel: RIP: 0033:0x72645a931719
May 06 14:22:23 exegol kernel: Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
May 06 14:22:23 exegol kernel: RSP: 002b:00007ffed52a8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
May 06 14:22:23 exegol kernel: RAX: ffffffffffffffda RBX: 0000608b2a9f24a0 RCX: 000072645a931719
May 06 14:22:23 exegol kernel: RDX: 0000000000000000 RSI: 000072645aac4efd RDI: 0000000000000006
May 06 14:22:23 exegol kernel: RBP: 000072645aac4efd R08: 0000000000000000 R09: 0000608b2a9a89b0
May 06 14:22:23 exegol kernel: R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
May 06 14:22:23 exegol kernel: R13: 0000000000000000 R14: 0000608b2aa20560 R15: 0000608b29633ec1
May 06 14:22:23 exegol kernel: </TASK>
May 06 14:22:23 exegol kernel: ---[ end trace 0000000000000000 ]---
May 06 14:22:23 exegol kernel: bnxt_en 0000:3d:00.0 bnxt_re0: Free MW failed: 0xffffff92
May 06 14:22:23 exegol kernel: infiniband bnxt_re0: Couldn't open port 1

However various processes simply crash after the machine has booted up. Network doesnt work as well.
 
I can confirm that 6.8.4-3-pve is still broken on my nuc.
this seems like the intel_iommu issue - does the issue remain if you add intel_iommu=off to the kernel commandline?
If no - then just leave it at that - I assume that the NUC never worked with intel_iommu=on ...
 
this seems like the intel_iommu issue - does the issue remain if you add intel_iommu=off to the kernel commandline?
If no - then just leave it at that - I assume that the NUC never worked with intel_iommu=on ...

Yes - intel_iommu=off "fixes" it. So no change on 6.8.4-2 -> 6.8.4-3 (on Intel)
 
Yes - intel_iommu=off "fixes" it. So no change on 6.8.4-2 -> 6.8.4-3 (on Intel)
Then I think this is the resolution to the issue on that NUC.
The default will most likely stay 'on' so for systems with broken implementations disabling it is a sensible solution.
 
Then I think this is the resolution to the issue on that NUC.
The default will most likely stay 'on' so for systems with broken implementations disabling it is a sensible solution.

I want to be super super nice.

I am on this issue now for 4 business days with my own time, budget, brain, head smashing.

6.8.4 and 6.8.8 vanilla "zabbly" https://github.com/zabbly/linux

They work out of the box - with intel_iommu=on ... with libvirt, qemu on Debian 12 out of the box with the same hardware and the identical VMs.

So with all respect - and I really mean that with all respect (!) - it's a pve kernel issue. Not a hardware, kernel, ...

In other words ... only the 6.8.4-X-pve crashes.


That is what I am interessted to fix.

Why? If you roll out the vanilla proxmox ISO (8.2.2) on existing Hetzner hardware, you have hundreds of computers not booting. :) Some of them will eventually bekomme a Harry problem
 
6.8.4 and 6.8.8 vanilla "zabbly" https://github.com/zabbly/linux

They work out of the box - with intel_iommu=on ... with libvirt, qemu on Debian 12 out of the box with the same hardware and the identical VMs.
did you add the intel_iommu=on to the commandline?
asking because the package downloaded from: https://pkgs.zabbly.com/kernel/stab...+_6.8.8-amd64-202404301432-debian12_amd64.deb
does not enable intel_iommu by default
(you can check it yourself with `grep -i iommu /boot/config-6.8.8-zabbly+`)

does dmesg on this kernel indicate that iommu is active?
 
did you add the intel_iommu=on to the commandline?
asking because the package downloaded from: https://pkgs.zabbly.com/kernel/stable/pool/main/l/linux-zabbly-6.8.8-amd64-202404301432-debian12/linux-image-6.8.8-zabbly+_6.8.8-amd64-202404301432-debian12_amd64.deb
does not enable intel_iommu by default
(you can check it yourself with `grep -i iommu /boot/config-6.8.8-zabbly+`)

does dmesg on this kernel indicate that iommu is active?

Tripple checked :)

I run this on 6.8.6-zabbly+ and 6.8.8-zabbly+

Only the 6.8.6-X-pve kernels crash... (Same Hardware, Same VM iso, ..)

(the difference that makes the difference is probably the qemu & libraries) but it's the Debian 12 core.

Bash:
root@nuc:~# date
Mon May  6 18:52:16 CEST 2024
root@nuc:~# cat /etc/debian_version
12.5
root@nuc:~# dmesg | grep iommu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-zabbly+ root=UUID=5421a3a9-ee54-4fef-a15f-38275e1971a6 ro quiet intel_iommu=on
[    0.047004] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-zabbly+ root=UUID=5421a3a9-ee54-4fef-a15f-38275e1971a6 ro quiet intel_iommu=on
[    0.447127] iommu: Default domain type: Translated
[    0.447127] iommu: DMA domain TLB invalidation policy: lazy mode
[    0.526160] pci 0000:00:02.0: Adding to iommu group 0
[    0.527548] pci 0000:00:00.0: Adding to iommu group 1
[    0.527577] pci 0000:00:14.0: Adding to iommu group 2
[    0.527590] pci 0000:00:14.2: Adding to iommu group 2
[    0.527610] pci 0000:00:16.0: Adding to iommu group 3
[    0.527625] pci 0000:00:17.0: Adding to iommu group 4
[    0.527649] pci 0000:00:1c.0: Adding to iommu group 5
[    0.527672] pci 0000:00:1e.0: Adding to iommu group 6
[    0.527686] pci 0000:00:1e.6: Adding to iommu group 6
[    0.527727] pci 0000:00:1f.0: Adding to iommu group 7
[    0.527742] pci 0000:00:1f.2: Adding to iommu group 7
[    0.527757] pci 0000:00:1f.3: Adding to iommu group 7
[    0.527771] pci 0000:00:1f.4: Adding to iommu group 7
[    0.527786] pci 0000:00:1f.6: Adding to iommu group 7
[    0.527811] pci 0000:01:00.0: Adding to iommu group 8
[    1.519975] platform idma64.0: Adding to iommu group 9
[    1.520358] platform dw-apb-uart.0: Adding to iommu group 10

cook.sh

... run it with ./cook.sh 1 .... ./cook.sh 10

Bash:
virt-install --virt-type kvm --name bookworm-amd64-$1 \
        --cdrom ./debian-12.5.0-amd64-netinst.iso \
        --os-variant debian11 \
        --disk size=10 --memory 1024


No k'panics. No oops. ...

Again - willing to jump with your kernel devs in a call or work on that with a github repo. It's working on zabbly.
 

Attachments

  • nuc.png
    nuc.png
    420.5 KB · Views: 9
Tripple checked :)

I run this on 6.8.6-zabbly+ and 6.8.8-zabbly+

Only the 6.8.6-X-pve kernels crash... (Same Hardware, Same VM iso, ..)
Hm - thanks for verifying this and for your efforts so far!

could you try the zabbly kernel for 6.8.4 (as that's what our kernel is actually based on)
and the Ubuntu 24.04 kernel:
https://packages.ubuntu.com/noble/amd64/linux-image-6.8.0-31-generic/download

one further difference I noticed in the zabbly config vs. ours is `CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON` is set for ours
could you try booting the PVE kernel with `intel_iommu=sm_off` (not 100% sure that it disables that part - but still would be interesting if this works on your machine)

Thanks!
 
Hm - thanks for verifying this and for your efforts so far!

could you try the zabbly kernel for 6.8.4 (as that's what our kernel is actually based on)
and the Ubuntu 24.04 kernel:
https://packages.ubuntu.com/noble/amd64/linux-image-6.8.0-31-generic/download

one further difference I noticed in the zabbly config vs. ours is `CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON` is set for ours
could you try booting the PVE kernel with `intel_iommu=sm_off` (not 100% sure that it disables that part - but still would be interesting if this works on your machine)

Thanks!

Tomorrow :)

- I will create scripts to download + install the Ubuntu Kernels on Debian 12
- Probably a git repo for this.



If this isn't working - lets jump into the git kernel tree diff madness. That will be a 5 cups of coffee day - but willing to do it.

From what I understood, the zabbly kernel adds kvm patches. I get less "noise" about a broken bios (in contrast to the Debian kernel)

-------------------------

And proxmox guys - please promisse one thing.

Get some of the machines you killed (or ask Hetzner for a donation....) and add them to your QA/Unittests. It's a no brainer. They are cheap and you need to unittest them for 8.2.2 -> 8.x updates.

Ping me if you need help. I am doing this for very egoistic reasons here! I love to "make" things. We all hate (re-)fixing bugs.
 
In this thread it was asked if folks could reply to whether or not "intel_iommu=off" worked to fix the crashing

For us it does not. Booting to kernel 6.8.4 still crashes the machine hard

This is on kernel 6.8.4 .. our fix is to pin our boot to 6.5.13-5-pve at this time

Our machines are Dell R740s, NVMe only nodes with no hardware RAID, only RAID involved is ZFS mirrors for OS .. Ceph is directly accessing NVMe drives

Ceph version is 18.2.2

Current working kernel is 6.5.13-5-pve

Hope this extra bit of information helps

Thanks
 
For us it does not. Booting to kernel 6.8.4 still crashes the machine hard

This is on kernel 6.8.4 .. our fix is to pin our boot to 6.5.13-5-pve at this time
do you have any logs from the crash? - else it's hard to see where the issue is rooted...
also did you try `6.8.4-3-pve` ?- this got released to the no-subscription repository yesterday and contains quite a few fixes.
 
Hm - thanks for verifying this and for your efforts so far!

could you try the zabbly kernel for 6.8.4 (as that's what our kernel is actually based on)
and the Ubuntu 24.04 kernel:
https://packages.ubuntu.com/noble/amd64/linux-image-6.8.0-31-generic/download

one further difference I noticed in the zabbly config vs. ours is `CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON` is set for ours
could you try booting the PVE kernel with `intel_iommu=sm_off` (not 100% sure that it disables that part - but still would be interesting if this works on your machine)

Thanks!

Ubuntu kernel

(Tested on Debian 12)

Bash:
wget -c -t0 http://de.archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-modules-6.8.0-31-generic_6.8.0-31.31_amd64.deb
wget -c -t0 http://de.archive.ubuntu.com/ubuntu/pool/main/l/linux-signed/linux-image-6.8.0-31-generic_6.8.0-31.31_amd64.deb


sudo dpkg -i linux-modules-6.8.0-31-generic_6.8.0-31.31_amd64.deb
sudo dpkg -i linux-image-6.8.0-31-generic_6.8.0-31.31_amd64.deb


sudo update-grub

  • I select the "generic-6.8.0-31.31" in the Debian Grub Boot menu
  • The kernel boots
  • basically "nothing" works - no wifi - no X11 (only in 640x480)
  • libvirtd errors (I have no idea why and I don't care why)
  • We have a /dev/kve
BUT: The PVE kernels (without any VMs!) crashed at this point. No VM started - just after boot. Check my logs :)


PVE Kernel (intel_iommu=sm_off)

(Tested on Promox 8.2.2, Kernel 6.8.4-3-pve)

Didn't help.

Bash:
root@nuc:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=UUID=fa27a3ec-e659-4b5d-8416-ac160913f16b ro quiet intel_iommu=sm_off
root@nuc:~# dmesg  | more
[    1.674009] intel-lpss 0000:00:1e.0: enabling device (0000 -> 0002)
[    1.674362] ------------[ cut here ]------------
[    1.674364] WARNING: CPU: 3 PID: 141 at drivers/iommu/intel/iommu.c:167 intel_iommu_probe_device+0x26d/0x8d0
[    1.674371] Modules linked in: intel_lpss_pci(+) i2c_i801 xhci_pci_renesas crc32_pclmul intel_lpss e1000e i2c_smbus cqhci xhci_hcd ahci sdhci idma64 libahci video wmi pinctrl_sunrisepoint
[    1.674391] CPU: 3 PID: 141 Comm: (udev-worker) Tainted: G          I        6.8.4-3-pve #1
[    1.674395] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0073.2020.0909.1625 09/09/2020
[    1.674397] RIP: 0010:intel_iommu_probe_device+0x26d/0x8d0
[    1.674401] Code: b7 f6 0f b7 42 d4 48 8d 4a 10 66 c1 c0 08 0f b7 c0 39 c6 0f 8c 90 00 00 00 0f 8f 86 00 00 00 4c 89 fe 4c 89 f7 e8 23 27 6a 00 <0f> 0b 48 c7 c0 ef ff ff ff 4c 89 ef 48 89 45 c0 e8 6e a8 95 ff 48
[    1.674404] RSP: 0018:ffffa9e8403b3518 EFLAGS: 00010246
[    1.674408] RAX: 0000000000000000 RBX: ffff9a4090999c10 RCX: 0000000000000000
[    1.674410] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.674412] RBP: ffffa9e8403b3560 R08: 0000000000000000 R09: 0000000000000000
[    1.674414] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a40802c6000
[    1.674416] R13: ffff9a4081c59660 R14: ffff9a40802c6148 R15: 0000000000000246
[    1.674419] FS:  0000753d2b5ce8c0(0000) GS:ffff9a43f2380000(0000) knlGS:0000000000000000
[    1.674422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.674425] CR2: 00005751bb740108 CR3: 0000000110178003 CR4: 00000000003706f0
[    1.674427] Call Trace:
[    1.674429]  <TASK>
[    1.674432]  ? show_regs+0x6d/0x80
[    1.674438]  ? __warn+0x89/0x160
[    1.674443]  ? intel_iommu_probe_device+0x26d/0x8d0
[    1.674446]  ? report_bug+0x17e/0x1b0
 
amdgpu still crashes (looks like same log) and leaves RX570 in unusable state.

Here my suggestion.

Let's wait for a final reaction of @t.lamprecht or @Stoiko Ivanov - I am willing to fix the problem whatever it takes.

I am on day 5.

> Unfortunately my suggestions to jump with a Proxmox Kernel Dev in a debug session didn't get an answer.

Are you skilled to help / do you have someone in your company willing to help?

Next steps are clear - build 5-10 test kernels based on the Ubuntu kernel and leave out the proxmox patches - test them.

I run gentoo 10+ years. I know what todo - but - I have no idea how the pve build system works.


Greetings from Denmark.
Harald
 
Are you skilled to help / do you have someone in your company willing to help?
It's just me at home on a day off. 6.8.8-zabbly+ won't boot because it lacks ZFS and Ubuntu-6.8.0-31 does not come with amdgpu(?).
I tried the Ubuntu 24.04 installer (with kernel 6.8.0-31) and it also crashed the GPU when amdgpu is loaded automatically. So my issue is an upstream kernel 6.8 or Ubuntu issue.
I could try a nested Proxmox (without ZFS) and GPU passthrough to test further, but it looks like Ubuntu kernel 6.8 is just an unlucky choice (at the moment).
 
  • Like
Reactions: Der Harry

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!