Random 6.8.4-2-pve kernel crashes

Wikibear · Sep 2, 2024

Hello everyone,

I'm a complete new user of Proxmox. I started with N100. First that seems that everything is running. After nearly 6h the system crash without any response. Ethernet lights flickering that's all. I reboot and the system is running again nearly 6h. Start to investigate this and read that VM can 'cause this crash. Found it of posts from 2022... I proof everything and find out that the memory of my VM with 4GB is on the limit. I put memory to 8GB and the systems runs nearly 30h and crashed again. But in this case I add a screen to see any issue. It's look like a kernal panic and the system is crashing. The N100 is brand new and I found this thread. I read here most cases Intel systems. So there is maybe an issue generally with intel CPUs?

Thanks.

Swen

gfngfn256 · Sep 2, 2024

Wikibear said:
I'm a complete new user of Proxmox

You don't provide focused details of your HW, NW & setup. What's your VM's config setup etc.
New N100 (mini?) pc. Run a complete stress test with a regular Distro before committing it to a HV. Thermals, RAM & PSU must all be (heavily) tested.

Wikibear · Sep 2, 2024

Thanks for your Answer. It's a MiniPC with N100 architecture Nipogi AK2 Plus. 16GB RAM and 512GB Harddisk with Alder Lake CPU. I don't make a stress test before. For me its new that i make a stress test before I use the HW. Normally I do this if I have trouble or the system is build by myself. I can do it.
6.8.12-1-pve -> crash after 2-6h
6.8.4-2-pve -> crash after 30h
It feels that there is more a bug in Kernel and Proxmox is effected by this.

I have only one VM at this time. Ubuntu Server 22.04 LTS 8GB RAM and 2 CPU.

gfngfn256 · Sep 2, 2024

Wikibear said:
Nipogi AK2 Plus

I researched it online. The one thing I note in the available reviews, is the all plastic design & lack of adequate venting. This surely would not be the optimum for a 24/7 HV. Most reviews also discuss the non-branded (unknown) inaccessible M.2 SSD, with lackluster performance - again something relevant when running a HV with LXCs/VMs. I didn't find anything on the NIC's brand & performance - but this may be relevant also. (I assume you are not using Wi-Fi - that wouldn't be good anyway).

The one thing that I would change/try in the VM's config, is to remove that USB passthrough, I suspect that that matchbox mb/chipset/CPU can't perform well doing passthrough. (It only has a maximum of 9 PCI lanes). I realize you might absolutely require that USB in the VM for some purpose, but give it a try & see if that stops it crashing.

Wikibear · Sep 2, 2024

Thanks for your post. I need USB and it's connected via ETH.

gfngfn256 · Sep 2, 2024

Wikibear said:
I need USB and it's connected via ETH.

Just let Proxmox host take that ETH connection too. Then just add (another) Linux Bridge as vmbr1 to that ETH port, and give the VM (another) Network Device (net1). All this can be done in the GUI. You'll have the same result - but no passthrough.

Wikibear · Sep 2, 2024

Thanks I'll give a try.

snakeoilos · Sep 3, 2024

FWIW I've updated to the latest Intel microcode (i9-13900H) back on 6.8.12-1... Been 7 days without trouble. Don't think this CPU is affected by the high current bug, however it does seem this latest microcode also limit the turbo mode more compared to previous firmwares (really hard to tell so this is just anedoctal).

Backups been running fine, compilation works (but seems slower from what I remembered so I moved the dev VM to another node).

Everything worked real well with 6.5.x.. And for > 7 days, everything works well again for me in 6.8.12 after the microcode update (maybe it runs slower, but prob just my imagination).

I know correlation doesn't imply causation.. But my system is more or less stable now, and I'm happy. Thanks to everybody who has helped me on this.

fapizarro · Sep 4, 2024

Hi all,

I don't think it's always an intel's CPU problem, we have a lot of Proxmox servers with Intel and AMD CPU's without problems.

In our new server, we updated the BIOS on saturday and the system it's has been worked without any problems with 6.8.4-2 kernel. Last night, CPU start growing from ~5% to 80/90% blocking the vm. Investigating the issue, aparentely (I didn't have many time to research) cron task start at this time to update package list.

snakeoilos said:
FWIW I've updated to the latest Intel microcode (i9-13900H) back on 6.8.12-1... Been 7 days without trouble.

Everything worked real well with 6.5.x.. And for > 7 days, everything works well again for me in 6.8.12 after the microcode update (maybe it runs slower, but prob just my imagination).

Interesting that information, remember what 6.5.x kernel version you used?

Thanks for all, Fernando.

snakeoilos · Sep 4, 2024

fapizarro said:
Interesting that information, remember what 6.5.x kernel version you used?

It was the latest 6.5 from proxmox.

Note what works for my setup may well not work on others, there's really a YMMV on this. There seems to be a lot of unknowns and I honestly cannot find a pattern.. E.g. I know of a cluster setup with 3x Lenovo P3 Ultras (IIRC)... I can't remember if they're 13th gen or 14th gen though.

Two of 'em works while the remaining one will just crash on idle - for no apparent reason.. That symptom is very different from mine, where the machine crashed during daily backups or heavy VM use.

All the P3s are built identically... Only one has a problem. The problem machine is pinned to 6.5 and it's been good since. Did ask him to try 6.8.12-1 with this new microcode update but he's not game to do so

.

Also don't think it's just an Intel only problem. There are AMD problems mentioned here in this thread also.. There could be a root cause to all this, or maybe not. Everything looks too random to me.

My machine been up 8 days 19 hours and 30 minutes. Hope I don't jinx it.

aliang · Sep 10, 2024

I always have this problem when docker continues high CPU load
kernel：Linux 6.8.8-3-pve (2024-07-16T16:16Z)
cpu: 12600T
storage: 4x 4T intel P4510 in ZFS
So the only solution right now is to downgrade to 6.5 kernel?

Code:

Sep 09 23:08:06 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000620
Sep 09 23:08:06 pve kernel: #PF: supervisor read access in kernel mode
Sep 09 23:08:06 pve kernel: #PF: error_code(0x0000) - not-present page
Sep 09 23:08:06 pve kernel: PGD 0 P4D 0
Sep 09 23:08:06 pve kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Sep 09 23:08:06 pve kernel: CPU: 5 PID: 3545995 Comm: pt_main_thread Tainted: P     U     OE      6.8.8-3-pve #1
Sep 09 23:08:06 pve kernel: Hardware name: ASUS System Product Name/ROG STRIX Z690-I GAMING WIFI, BIOS 3302 02/21/2024
Sep 09 23:08:06 pve kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 60 de 37 bc 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 8a a6 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 a0 ce 00 49 89 04 24
Sep 09 23:08:06 pve kernel: RSP: 0018:ffffb1413431b638 EFLAGS: 00010286
Sep 09 23:08:06 pve kernel: RAX: ffff8ef7ca3e0000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 09 23:08:06 pve kernel: RDX: ffff8f063f7d5000 RSI: ffffb1413431b680 RDI: ffffed60e62a2c00
Sep 09 23:08:06 pve kernel: RBP: ffffb1413431b648 R08: 0000000000000000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb1413431b680
Sep 09 23:08:06 pve kernel: R13: ffff8ef6ae088000 R14: ffff8efcca535648 R15: 0000000000000007
Sep 09 23:08:06 pve kernel: FS:  00007582ed600640(0000) GS:ffff8f05ff480000(0000) knlGS:0000000000000000
Sep 09 23:08:06 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620 CR3: 00000002506c4000 CR4: 0000000000f52ef0
Sep 09 23:08:06 pve kernel: PKRU: 55555554
Sep 09 23:08:06 pve kernel: Call Trace:
Sep 09 23:08:06 pve kernel:  <TASK>
Sep 09 23:08:06 pve kernel:  ? show_regs+0x6d/0x80
Sep 09 23:08:06 pve kernel:  ? __die+0x24/0x80
Sep 09 23:08:06 pve kernel:  ? page_fault_oops+0x176/0x500
Sep 09 23:08:06 pve kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Sep 09 23:08:06 pve kernel:  ? exc_page_fault+0x83/0x1b0
Sep 09 23:08:06 pve kernel:  ? asm_exc_page_fault+0x27/0x30
Sep 09 23:08:06 pve kernel:  ? folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel:  release_pages+0x267/0x4c0
Sep 09 23:08:06 pve kernel:  ? smp_call_function_many_cond+0x113/0x500
Sep 09 23:08:06 pve kernel:  free_pages_and_swap_cache+0x4a/0x60
Sep 09 23:08:06 pve kernel:  tlb_batch_pages_flush+0x43/0x80
Sep 09 23:08:06 pve kernel:  tlb_flush_mmu+0x3d/0x110
Sep 09 23:08:06 pve kernel:  unmap_page_range+0xd36/0x11c0
Sep 09 23:08:06 pve kernel:  unmap_single_vma+0x89/0xf0
Sep 09 23:08:06 pve kernel:  unmap_vmas+0xb5/0x190
Sep 09 23:08:06 pve kernel:  unmap_region+0xe8/0x180
Sep 09 23:08:06 pve kernel:  do_vmi_align_munmap+0x3e8/0x5b0
Sep 09 23:08:06 pve kernel:  do_vmi_munmap+0xdf/0x190
Sep 09 23:08:06 pve kernel:  __vm_munmap+0xad/0x180
Sep 09 23:08:06 pve kernel:  __x64_sys_munmap+0x27/0x40
Sep 09 23:08:06 pve kernel:  x64_sys_call+0x1b1f/0x24b0
Sep 09 23:08:06 pve kernel:  do_syscall_64+0x81/0x170
Sep 09 23:08:06 pve kernel:  ? __update_load_avg_cfs_rq+0x380/0x3f0
Sep 09 23:08:06 pve kernel:  ? update_load_avg+0x82/0x830
Sep 09 23:08:06 pve kernel:  ? trigger_load_balance+0x167/0x370
Sep 09 23:08:06 pve kernel:  ? scheduler_tick+0x134/0x320
Sep 09 23:08:06 pve kernel:  ? account_user_time+0xa2/0xc0
Sep 09 23:08:06 pve kernel:  ? update_process_times+0x8e/0xb0
Sep 09 23:08:06 pve kernel:  ? tick_sched_handle+0x32/0x70
Sep 09 23:08:06 pve kernel:  ? timerqueue_add+0xa6/0xd0
Sep 09 23:08:06 pve kernel:  ? ktime_get+0x45/0xc0
Sep 09 23:08:06 pve kernel:  ? __pfx_tick_nohz_highres_handler+0x10/0x10
Sep 09 23:08:06 pve kernel:  ? lapic_next_deadline+0x2c/0x50
Sep 09 23:08:06 pve kernel:  ? clockevents_program_event+0xb3/0x140
Sep 09 23:08:06 pve kernel:  ? tick_program_event+0x43/0xa0
Sep 09 23:08:06 pve kernel:  ? hrtimer_interrupt+0x11f/0x250
Sep 09 23:08:06 pve kernel:  ? irqentry_exit_to_user_mode+0x7e/0x260
Sep 09 23:08:06 pve kernel:  ? irqentry_exit+0x43/0x50
Sep 09 23:08:06 pve kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Sep 09 23:08:06 pve kernel: RIP: 0033:0x7583f3c98a7b
Sep 09 23:08:06 pve kernel: Code: 8b 15 b9 b3 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b3 0f 00 f7 d8 64 89 01 48
Sep 09 23:08:06 pve kernel: RSP: 002b:00007582ed5fd348 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Sep 09 23:08:06 pve kernel: RAX: ffffffffffffffda RBX: ffffffffffffff80 RCX: 00007583f3c98a7b
Sep 09 23:08:06 pve kernel: RDX: 0000000000000000 RSI: 0000000008955000 RDI: 0000758126c00000
Sep 09 23:08:06 pve kernel: RBP: 0000000000000022 R08: 0000758126c00000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 00000000000000ee R11: 0000000000000206 R12: 0000758126c00030
Sep 09 23:08:06 pve kernel: R13: 0000000000000000 R14: 0000758126c00080 R15: 0000758126c00068
Sep 09 23:08:06 pve kernel:  </TASK>
Sep 09 23:08:06 pve kernel: Modules linked in: uas usb_storage msr macvlan nf_conntrack_netlink xt_nat nft_chain_nat xt_MASQUERADE nf_nat xfrm_user xfrm_algo overlay ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nft_compat tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 softdog nf_tables nvme_fabrics sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling i915(OE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match kvm_intel
Sep 09 23:08:06 pve kernel:  snd_soc_acpi soundwire_generic_allocation soundwire_bus kvm xe iwlmvm snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_usb_audio snd_hda_intel crct10dif_pclmul drm_gpuvm snd_intel_dspcfg polyval_clmulni drm_exec polyval_generic snd_intel_sdw_acpi mac80211 snd_usbmidi_lib ghash_clmulni_intel gpu_sched snd_ump sha256_ssse3 snd_hda_codec drm_buddy snd_rawmidi sha1_ssse3 drm_suballoc_helper aesni_intel btusb drm_ttm_helper snd_seq_device snd_hda_core btrtl ttm mc snd_hwdep crypto_simd btintel mei_pxp mei_hdcp libarc4 cryptd snd_pcm btbcm drm_display_helper btmtk snd_timer cec rapl cmdlinepart bluetooth iwlwifi intel_pmc_core spi_nor rc_core mei_me snd ecdh_generic intel_cstate pcspkr eeepc_wmi asus_nb_wmi wmi_bmof cfg80211 pmt_telemetry mtd mei i2c_algo_bit soundcore ecc intel_vsec plx_dma pmt_class zfs(PO) acpi_pad acpi_tad joydev input_leds mac_hid spl(O) vhost_net vhost vhost_iotlb tap pkcs8_key_parser vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd nct6775 nct6775_core hwmon_vid
Sep 09 23:08:06 pve kernel:  coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid hid mfd_aaeon asus_wmi ledtrig_audio sparse_keymap nvme xhci_pci platform_profile crc32_pclmul xhci_pci_renesas spi_intel_pci i2c_i801 thunderbolt intel_lpss_pci ahci nvme_core spi_intel intel_lpss i2c_smbus xhci_hcd igc libahci idma64 vmd nvme_auth video wmi pinctrl_alderlake [last unloaded: cpuid]
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620
Sep 09 23:08:06 pve kernel: ---[ end trace 0000000000000000 ]---
Sep 09 23:08:06 pve kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 60 de 37 bc 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 8a a6 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 a0 ce 00 49 89 04 24
Sep 09 23:08:06 pve kernel: RSP: 0018:ffffb1413431b638 EFLAGS: 00010286
Sep 09 23:08:06 pve kernel: RAX: ffff8ef7ca3e0000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 09 23:08:06 pve kernel: RDX: ffff8f063f7d5000 RSI: ffffb1413431b680 RDI: ffffed60e62a2c00
Sep 09 23:08:06 pve kernel: RBP: ffffb1413431b648 R08: 0000000000000000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb1413431b680
Sep 09 23:08:06 pve kernel: R13: ffff8ef6ae088000 R14: ffff8efcca535648 R15: 0000000000000007
Sep 09 23:08:06 pve kernel: FS:  00007582ed600640(0000) GS:ffff8f05ff480000(0000) knlGS:0000000000000000
Sep 09 23:08:06 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620 CR3: 00000002506c4000 CR4: 0000000000f52ef0
Sep 09 23:08:06 pve kernel: PKRU: 55555554
Sep 09 23:08:06 pve kernel: note: pt_main_thread[3545995] exited with irqs disabled

snakeoilos · Sep 10, 2024

aliang said:
I always have this problem when docker continues high CPU load
kernel：Linux 6.8.8-3-pve (2024-07-16T16:16Z)
cpu: 12600T
storage: 4x 4T intel P4510 in ZFS
So the only solution right now is to downgrade to 6.5 kernel?

Why not try 6.8.12-1?

That has an update to fix a null pointer exception (unsure if it's the same bug as yours but can't hurt to try).

Wikibear · Sep 18, 2024

6 8 12 1 crash after 30 min to a hour. Today I installed 6 8 12 2. Maybe this runs more stable.

silverstone · Sep 22, 2024

I'm NOT sure it's the same Issue described in this thread, but I'm also getting a Kernel Panic of 6.8.x AT BOOT TIME.
This is approx. 2-4s after GRUB boots the Kernel, before even Clevis unluks the LUKS encrypted Disks and ZFS Mounts the Filesystem.

I had the Impression all/most Users affected here are having Issue once Proxmox VE is completely booted up for a few Days or similar.

My System_C with Problems:

Motherboard: ASUS ROG STRIX B550-A GAMING
CPU: AMD Ryzen 5950X
RAM: 2 x KSM32ED8/32ME + 2 x KSM32ED8/32HC (different RAM in different Channels - Memtest86 passes without Issues)
SSD: 2 x Crucial MX500 4000GB
PSU: Currently running on a Seasonic S12II - 620Bronze until I receive the final one (more efficient)
GPU: ASUS AMD RX 6600

Kernel linux-image-6.9.10+bpo-amd64 was used for the Installation (I typically install first Debian with the backported ZFS, then install Proxmox VE on top of it) and worked fine. Whereas the latest and "greatest" proxmox-kernel-6.8.12-2-pve-signed causes the Kernel Panic within a few Seconds.

For reference, I have 2 Other Ryzen-based Systems where I did NOT (until now, knock on Wood) face the Issue.

System_A without Issues:

Motherboard: ASUS ROG Crosshair VIII Dark Hero Wifi
CPU: AMD Ryzen 5950X
RAM: 4 x KSM32ED8/32ME
NVME: 2 x Crucial P5 M.2 NVME 2000GB
SSD: None
PSU: Seasonic PX-1000 SSR-1000PD Active PFC F3
GPU: Some cheapo VERY old single slot card (cannot remember, just to get a display Output)

System_B without Issues:

Motherboard: ASUS ROG STRING X570-E GAMING
CPU: AMD Ryzen 5950X
RAM: 4 x KSM32ED8/32ME
NVME: 2 x Crucial P5 M.2 NVME 1000GB
SSD: None
PSU: ANTEC SIGNATURE 1300 PLATINUM
GPU: Some cheapo VERY old single slot card (cannot remember, just to get a display Output)

I attempted to blacklist amdgpu in /etc/modprobe.d/blacklist-amdgpu.conf but it didn't do any difference.
Disabling IOMMU via amd_iommu=off and iommu=off in Kernel Command Line does absolutely nothing. Neither anything changes if I disable IOMMU in the BIOS.

Seems like the AMB B550 Platform has many Issues with this Kernel, whereas the X570 seems to work just fine.

I did NOT update (or check) the BIOS on the B550, since it was working fine in Memtest86 and from LiveUSB. Do you think I need to do a BIOS Upgrade ? That could also introduce other Issues though ...

It also seems that LiveUSB often crashes (Panics ?) after a While. I suspect it's the Intel I225-V 2.5Gb Ethernet Controller that's doing extremely weird things in the LiveUSB. Not sure that is also what occurs on the Proxmox VE Kernel during Boot though ...

And I managed to install everything from LiveUSB and Chroot originally, so it's weird it's starting to do this now.

EDIT 1: I just updated the ASUS BIOS from 3404 to 3407. No difference though

.

EDIT 2: I now put blacklist amdgpu in /etc/modprobe.d/blacklist-amdgpu.conf on the 2nd Line (1st Line I added a Comment, cannot remember if it was /etc/modprobe.d/xxx.conf or another "System" like e.g. /etc/modules-load.d/ on Debian that completely ignores the first Line).

Also set some Extra Options on the Kernel Command Line and the System at least Booted now with both Kernel 6.8.x and Kernel 6.5.x.

Overall Command Line (the /etc/default/grub.d/zfs.cfg is anyways always needed due to the First Part of the Command Line always being incorrect and NOT including the Pool Name):

Code:

BOOT_IMAGE=/vmlinuz-6.8.12-2-pve root=ZFS=/ROOT/debian ro iommu=pt amd_iommu=on pcie_port_pm=off pcie_aspm.policy=performance root=ZFS=rpool/ROOT/debian zfs_force=1 boot=zfs iommu=pt amd_iommu=on pcie_port_pm=off pcie_aspm.policy=performance root=ZFS=rpool/ROOT/debian zfs_force=1 boot=zfs

This is what I added:

Code:

iommu=pt amd_iommu=on pcie_port_pm=off pcie_aspm.policy=performance

chacha · Oct 21, 2024

Hi @silverstone,

did you fix your issue ?

I am also having random crash, with a similar configuration

Config:
- CPU Ryzen 5 3600
- MB GIGABYTE B550M DS3H
- RAM 2x32GO CMK64G4M2E3200C16
- HDD 3x Toshiba SATA X300 8TB
- SSD SOLIDIGM NVME P44 Pro 1TB + Kingston SATA A400 256GB for system
- GPU INTEL ARC A380 (with resizable BAR and LP / HuC-GuC confirmed working)
- NIC 2x10GB BCM57810S
- PSU Corsair CV 550W

I am using kernel 6.8.4-3.

Random crashes every 1-2 days, sometime a bit more. cannot ping anymore, nothing really relevant in logs.

I have to stick with 6.8.4-* kernel because there is a Intel i915 Driver bug in other kernel versions, preventing QSV/Transcoding to work properly.

Relevant note from Jellyfin team about Intel I915 compat:

The LTS kernel range 6.6.26 - 6.6.32 and the stable kernel range 6.8.5 - 6.9.3 have i915 driver bugs, which may cause problems on Intel Gen 12.5 DG2 / ARC A-series GPUs. If you are affected, please upgrade to kernel 6.6.33+ (LTS) or 6.9.4+. Ubuntu 24.04 with kernel versions 6.8.0-38 thru 6.8.0-41 are also affected by this issue. Upgrade to Ubuntu kernel 6.8.0-44+ if you are on the affected kernels.

I'd like to go to 6.9.4+ but I am waiting for proxmox team to be ready.

I thought my problem was hardware but most of it is new, except the CPU but it was 100% working in the previous machine.
Memtest is ok.

I will soon try (in order.. if it still dont work):
- 6.8.4-4 kernel
- Add kernel boot setting (pcie_port_pm=off libata.force=noncq)
- disable Cx state
- [ applying myself intel patch on last pve kernel (actually already tried in August and failed

but can try again)] OR [ downgrade to 6.5.* ...]

Could also add a serial debug but.. probably only worth it for latest kernel.

Edit 23/10/24:
- Update to 6.8.4-4 was a MASSIVE failure. My machine was still booting but impossible to write on any block device (even though seen as rw) and because of that tooks 40min to be reachable through ssh. As it is a remote machine it was a real headache to recover... I end-up downloading a local version of kexec to live-replace the kernel with 6.8.4-3 (because I could not write on any device so cannot modify GRUB, and for some reason my UEFI configuration was gone after every reboot, maybe related..) Logfile was showing many stack trace btw... But as its an "old" kernel I doubt there's anything interesting.
- Now Tying 2nd and 3rd steps with boot command line: processor.max_cstate=1 intel_idle.max_cstate=0 pcie_aspm=off pcie_port_pm=off libata.force=noncq . It is now up and working, will see how long it last before freezing (hopefully it wont).
In case it helps someone, thats how you load another kernel from the current one without rebooting and without writing to any disk (to be adapted to your kernel ofc...):

Bash:

cd /tmp
rm -Rf /tmp/deb
rm -Rf /tmp/out
mkdir /tmp/deb
mkdir /tmp/out
apt-get --download-only -o Dir::Cache="/tmp/deb" -o Dir::Cache::archives="/tmp/deb" install kexec-tools -y
cd /tmp/deb
find . -type f -name "*.deb" -exec sh -c 'mkdir "${1%.*}"; ar x ${1%.*}.deb --output ${1%.*}' _ {} \;
cd /tmp
find deb -type f -name "data.tar.xz" -exec sh -c 'tar -xf $1 -C /tmp/out' _ {} \;
cd /tmp/out/sbin
LD_LIBRARY_PATH=../usr/lib/x86_64-linux-gnu/ ./kexec -i -l -e -y -f -y -d /boot/vmlinuz-6.8.4-3-pve --initrd=/boot/initrd.img-6.8.4-3-pve --command-line="BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root initrd=\initrd.img-6.8.4-3-pve ro pcie_port_pm=off libata.force=noncq"

silverstone · Oct 28, 2024

chacha said:
Hi @silverstone,

did you fix your issue ?

I am also having random crash, with a similar configuration

Config:
- CPU Ryzen 5 3600
- MB GIGABYTE B550M DS3H
- RAM 2x32GO CMK64G4M2E3200C16
- HDD 3x Toshiba SATA X300 8TB
- SSD SOLIDIGM NVME P44 Pro 1TB + Kingston SATA A400 256GB for system
- GPU INTEL ARC A380 (with resizable BAR and LP / HuC-GuC confirmed working)
- NIC 2x10GB BCM57810S
- PSU Corsair CV 550W

I am using kernel 6.8.4-3.

Random crashes every 1-2 days, sometime a bit more. cannot ping anymore, nothing really relevant in logs.

I have to stick with 6.8.4-* kernel because there is a Intel i915 Driver bug in other kernel versions, preventing QSV/Transcoding to work properly.

Relevant note from Jellyfin team about Intel I915 compat:

I'd like to go to 6.9.4+ but I am waiting for proxmox team to be ready.

I thought my problem was hardware but most of it is new, except the CPU but it was 100% working in the previous machine.
Memtest is ok.

I will soon try (in order.. if it still dont work):
- 6.8.4-4 kernel
- Add kernel boot setting (pcie_port_pm=off libata.force=noncq)
- disable Cx state
- [ applying myself intel patch on last pve kernel (actually already tried in August and failed but can try again)] OR [ downgrade to 6.5.* ...]

Could also add a serial debug but.. probably only worth it for latest kernel.

Edit 23/10/24:
- Update to 6.8.4-4 was a MASSIVE failure. My machine was still booting but impossible to write on any block device (even though seen as rw) and because of that tooks 40min to be reachable through ssh. As it is a remote machine it was a real headache to recover... I end-up downloading a local version of kexec to live-replace the kernel with 6.8.4-3 (because I could not write on any device so cannot modify GRUB, and for some reason my UEFI configuration was gone after every reboot, maybe related..) Logfile was showing many stack trace btw... But as its an "old" kernel I doubt there's anything interesting.
- Now Tying 2nd and 3rd steps with boot command line: processor.max_cstate=1 intel_idle.max_cstate=0 pcie_aspm=off pcie_port_pm=off libata.force=noncq . It is now up and working, will see how long it last before freezing (hopefully it wont).
In case it helps someone, thats how you load another kernel from the current one without rebooting and without writing to any disk (to be adapted to your kernel ofc...):

Bash:

cd /tmp rm -Rf /tmp/deb rm -Rf /tmp/out mkdir /tmp/deb mkdir /tmp/out apt-get --download-only -o Dir::Cache="/tmp/deb" -o Dir::Cache::archives="/tmp/deb" install kexec-tools -y cd /tmp/deb find . -type f -name "*.deb" -exec sh -c 'mkdir "${1%.*}"; ar x ${1%.*}.deb --output ${1%.*}' _ {} \; cd /tmp find deb -type f -name "data.tar.xz" -exec sh -c 'tar -xf $1 -C /tmp/out' _ {} \; cd /tmp/out/sbin LD_LIBRARY_PATH=../usr/lib/x86_64-linux-gnu/ ./kexec -i -l -e -y -f -y -d /boot/vmlinuz-6.8.4-3-pve --initrd=/boot/initrd.img-6.8.4-3-pve --command-line="BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root initrd=\initrd.img-6.8.4-3-pve ro pcie_port_pm=off libata.force=noncq"

Hi @chacha . Sorry for the Late Reply.

chacha said:
cannot ping anymore, nothing really relevant in logs.

The HOST or the GUEST VM ?

I managed to achieve around 7 Days uptime with my RX 6600 before it would crash ("Internal VM Error", marked in Yellow with an Explanation Mark in the GUI).

At that point, the only Thing to do is to reboot the Host. I didn't found a better solution so far unfortunately.

The most difference I believe came from:
- Some Settings on the GRUB Command Line (see below)
- Possibly BIOS, either putting ASPM and Energy-Saving Options to "AUTO" or "DISABLED" (instead of Forcibly Enabling). So IIRC I went with e.g. L1 instead of L0+L1 for Energy Saving (or Auto ... cannot remember)

Still not perfect though

. Better 7 Days uptime than 30 Minutes though ...

My /proc/cmdline (the first Part is the WRONG auto-generated ZFS Root Pool Typo by GRUB ... the LAST one is the real one of course):

Code:

BOOT_IMAGE=/vmlinuz-6.8.12-2-pve root=ZFS=/ROOT/debian ro initcall_blacklist=sysfb_init default_hugepagesz=2M hugepagesz=1G hugepages=64 transparent_hugepage=never iommu=pt amd_iommu=on pcie_port_pm=off root=ZFS=rpool/ROOT/debian zfs_force=1 boot=zfs initcall_blacklist=sysfb_init default_hugepagesz=2M hugepagesz=1G hugepages=64 transparent_hugepage=never iommu=pt amd_iommu=on pcie_port_pm=off root=ZFS=rpool/ROOT/debian zfs_force=1 boot=zfs

Hugepages is to speed up the VM Start Process to avoid allocating RAM on the Spot (and leading to RAM Fragmentation), while the root=ZFS entries are to fix the GRUB Auto/Mis-Detection of Root Pool. The Other Settings are those relevant for Virtualization and Power Management (mostly).

I basically disabled PCIe Powerb Management at least for now ...

/etc/modprobe.d/kvm.conf

Code:

# Ignore MSRS
#options kvm ignore_msrs=1

# If you see a lot of warning messages in your 'dmesg' system log
options kvm ignore_msrs=1 report_ignored_msrs=0

/etc/modprobe.d/vfio.conf

Code:

options vfio-pci ids=1002:73ff,1002:ab28 disable_vga=1

/etc/modprobe.d/unsafe-interrupts.conf

Code:

# Enable Unsafe Interrupts

options vfio_iommu_type1 allow_unsafe_interrupts=1

/etc/modprobe.d/blacklist-amdgpu.conf

Code:

# Blacklist AMD GPU Driver (causes Kernel Panic during Proxmox VE Host Boot)

blacklist amdgpu
blacklist radeon
blacklist snd_hda_intel

/etc/modprobe.d/blacklist-nvidia.conf

Code:

# Blacklist NVIDIA Drivers
blacklist nouveau
blacklist nvidia

/etc/modules-load.d/pcie-passthrough.conf

Code:

vfio
vfio_iommu_type1
vfio_pci

/etc/modules-load.d/qemu-server.conf

Code:

vhost_net

Another note, not sure though, is to install the Video Card Drivers, Firmware and possible radeontop Utility inside the Guest VM.

Let me know if you need the details inside the Guest VM, although since you do NOT have an AMD GPU, not sure how much relevant these might be ...

My GUEST VM (Fedora 40) is pretty much without Kernel Configuration at all except for Podman related systemd.unified_cgroup_hierarchy=1 in /proc/cmdline

Code:

BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.11.4-201.fc40.x86_64 root=UUID=141e6012-6ecd-4475-a179-1633507ffce0 ro resume=UUID=de34c4c4-91ad-459b-9dd5-bcb5c1e1a066 rhgb quiet systemd.unified_cgroup_hierarchy=1

chacha · Oct 28, 2024

Hi @silverstone ,

Thank you for answering !

Was the Host, and... it just failed again today :-/ After 5 days of uptime (before my new kernel cmd line it was 1-2 days max).
I am not using pathtrough tho, I just allows my unprivilegied LXC container to use the host GPU (no KVM here).

Thank you for the ASPM trick but I just tried it and did not fix it either :-/.
Now I will try to revert to 6.5 (when I will have physical access to the machine to reset it ...)
Then I will try again to port the Intel patch to the last proxmox kernel... or maybe backport the whole i915 driver but that might not be easier.
As you mentioned the firmware, my Intel ARC A380 might not have the latest one so thank you, its one other thing to try.
I might have a look to hugepage, but as I am using LXC, not sure it's going to help, doesnt seems related to the current problem.

trials update (28/10/2024):
~~- 6.8.4-4 kernel~~ => this one is veryyy wrong, dont even boot properly
~~- Add kernel boot settings (pcie_port_pm=off libata.force=noncq)~~ => maybe stable for longer but still "freeze"
~~- disable Cx state (processor.max_cstate=1 intel_idle.max_cstate=0 )~~ => maybe stable for longer but still "freeze"
~~- disable ASPM (pcie_aspm=off )~~ => maybe stable for longer but still "freeze"
- build "edge" kernel 6.11.0-1-pve => to be tried
- revert to 6.5 => to be tried
- try last kernel version + proper i915 patch for QSV/Transcoding => to be tried
- update ARC A380 firmware => to be tried

Edit:
Or maybe I just have to wait a little bit ... :

silverstone · Oct 28, 2024

chacha said:
Hi @silverstone ,

Thank you for answering !

Was the Host, and... it just failed again today :-/ After 5 days of uptime (before my new kernel cmd line it was 1-2 days max).
I am not using pathtrough tho, I just allows my unprivilegied LXC container to use the host GPU (no KVM here).

Thank you for the ASPM trick but I just tried it and did not fix it either :-/.
Now I will try to revert to 6.5 (when I will have physical access to the machine to reset it ...)
Then I will try again to port the Intel patch to the last proxmox kernel... or maybe backport the whole i915 driver but that might not be easier.
As you mentioned the firmware, my Intel ARC A380 might not have the latest one so thank you, its one other thing to try.
I might have a look to hugepage, but as I am using LXC, not sure it's going to help, doesnt seems related to the current problem.

trials update (28/10/2024):
~~- 6.8.4-4 kernel~~ => this one is veryyy wrong, dont even boot properly
~~- Add kernel boot settings (pcie_port_pm=off libata.force=noncq)~~ => maybe stable for longer but still "freeze"
~~- disable Cx state (processor.max_cstate=1 intel_idle.max_cstate=0 )~~ => maybe stable for longer but still "freeze"
~~- disable ASPM (pcie_aspm=off )~~ => maybe stable for longer but still "freeze"
- revert to 6.5 => to be tried
- try last kernel version + proper i915 patch for QSV/Transcoding => to be tried
- update ARC A380 firmware => to be tried

Last shot in the Dark ...

Did you try to play with the IOMMU, SR-IOV, Re-sizeable BAR and Above 4G Decoding Settings in the BIOS ? Power Related Settings such as P-States and C-States ?

IIRC the Re-sizeable BAR, while beneficial in some cases, could cause Issues.

I must admit when I had this Issue I started flipping like 10 BIOS Settings at once. Not the best way to identify the root Cause (but I don't want to lose too much Time on this Troubleshooting either !).

@chacha: you are getting the HOST Crashing when an LXC Unprivileged Container Crashes

? Or "just" the LXC Unprivileged Guest (like I would expect, it's unprivileged for a Reason ...) ?

chacha · Oct 28, 2024

silverstone said:
Last shot in the Dark ...

Did you try to play with the IOMMU, SR-IOV, Re-sizeable BAR and Above 4G Decoding Settings in the BIOS ? Power Related Settings such as P-States and C-States ?

IIRC the Re-sizeable BAR, while beneficial in some cases, could cause Issues.

I must admit when I had this Issue I started flipping like 10 BIOS Settings at once. Not the best way to identify the root Cause (but I don't want to lose too much Time on this Troubleshooting either !).

@chacha: you are getting the HOST Crashing when an LXC Unprivileged Container Crashes ? Or "just" the LXC Unprivileged Guest (like I would expect, it's unprivileged for a Reason ...) ?

- IOMMU : no because I am not passing through my GPU (LXC only, so dont need that)
- SR-IOV: same
- Resizable-BAR: yes, its enabled because it affect transcoding performances a lot (that remind me I enabled it quite recently... maybe worth disabling it to try)
- C-State: not in the BIOS yet because I do not have physical acess to the machine, but its in the cmd line and confirmed working (powertop)

So the whole host is crashing, randomly but usually when its very busy. Sometime when trancoding in a container, sometime during backups, sometime when there's a lot of players in my dedicated server... its really random but when it happen the machine is fully dead. its been doing that for month now, since I downgrade to 6.8.4-x to have Intel transcoding working. Before that I had 90+ days uptime.

Will compile the 6.11 kernel to try

.

silverstone · Oct 29, 2024

chacha said:
- IOMMU : no because I am not passing through my GPU (LXC only, so dont need that)
- SR-IOV: same
- Resizable-BAR: yes, its enabled because it affect transcoding performances a lot (that remind me I enabled it quite recently... maybe worth disabling it to try)
- C-State: not in the BIOS yet because I do not have physical acess to the machine, but its in the cmd line and confirmed working (powertop)

So the whole host is crashing, randomly but usually when its very busy. Sometime when trancoding in a container, sometime during backups, sometime when there's a lot of players in my dedicated server... its really random but when it happen the machine is fully dead. its been doing that for month now, since I downgrade to 6.8.4-x to have Intel transcoding working. Before that I had 90+ days uptime.

Will compile the 6.11 kernel to try .

Well IOMMU/SR-IOV don't matter to you for USING them, but they might still cause ISSUES if enabled. So if you are 100% sure that you do NOT need them, I'd suggest trying to disable them.

Do you have actual Logs / Messages of how the HOST Crashes ? Without it we have to make a lot of guesses.

And of course, if you do NOT have netconsole, Remote Syslog Server configured (or possibly just using netcat over e.g. a VPN) then we have no idea of what might happen.

There might be something saved in /var/log as part of the Systemd Journal though.

Otherwise you might want to do sometihng like netcat over VPN (e.g. Wireguard) and try to capture the log remotely when the issue occurs (which as you said could take a while - but I don't have much better Ideas).

I did this for a completely separate Issue (ZFS ZIO Errors Barrage for a ZFS on top of LUKS Issue), you might take some Inspiration from it
https://github.com/luckylinux/workaround-zfs-on-luks-bug/blob/main/usr/sbin/looptab-debug

There are probably better & safer Options to netcat (e.g. syslog-ng tls) but might be more difficult to setup. And over a Wireguard VPN I'd say netcat is safe

.

Random 6.8.4-2-pve kernel crashes

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Member

New Member

Member

New Member

Member

New Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

We value your privacy