Random 6.8.4-2-pve kernel crashes

Hello everyone,

I'm a complete new user of Proxmox. I started with N100. First that seems that everything is running. After nearly 6h the system crash without any response. Ethernet lights flickering that's all. I reboot and the system is running again nearly 6h. Start to investigate this and read that VM can 'cause this crash. Found it of posts from 2022... I proof everything and find out that the memory of my VM with 4GB is on the limit. I put memory to 8GB and the systems runs nearly 30h and crashed again. But in this case I add a screen to see any issue. It's look like a kernal panic and the system is crashing. The N100 is brand new and I found this thread. I read here most cases Intel systems. So there is maybe an issue generally with intel CPUs?

Thanks.

Swen
 
I'm a complete new user of Proxmox
You don't provide focused details of your HW, NW & setup. What's your VM's config setup etc.
New N100 (mini?) pc. Run a complete stress test with a regular Distro before committing it to a HV. Thermals, RAM & PSU must all be (heavily) tested.
 
Thanks for your Answer. It's a MiniPC with N100 architecture Nipogi AK2 Plus. 16GB RAM and 512GB Harddisk with Alder Lake CPU. I don't make a stress test before. For me its new that i make a stress test before I use the HW. Normally I do this if I have trouble or the system is build by myself. I can do it.
6.8.12-1-pve -> crash after 2-6h
6.8.4-2-pve -> crash after 30h
It feels that there is more a bug in Kernel and Proxmox is effected by this.

I have only one VM at this time. Ubuntu Server 22.04 LTS 8GB RAM and 2 CPU.

1725283579642.png

1725283551230.png
 
Nipogi AK2 Plus
I researched it online. The one thing I note in the available reviews, is the all plastic design & lack of adequate venting. This surely would not be the optimum for a 24/7 HV. Most reviews also discuss the non-branded (unknown) inaccessible M.2 SSD, with lackluster performance - again something relevant when running a HV with LXCs/VMs. I didn't find anything on the NIC's brand & performance - but this may be relevant also. (I assume you are not using Wi-Fi - that wouldn't be good anyway).

The one thing that I would change/try in the VM's config, is to remove that USB passthrough, I suspect that that matchbox mb/chipset/CPU can't perform well doing passthrough. (It only has a maximum of 9 PCI lanes). I realize you might absolutely require that USB in the VM for some purpose, but give it a try & see if that stops it crashing.
 
I need USB and it's connected via ETH.
Just let Proxmox host take that ETH connection too. Then just add (another) Linux Bridge as vmbr1 to that ETH port, and give the VM (another) Network Device (net1). All this can be done in the GUI. You'll have the same result - but no passthrough.
 
FWIW I've updated to the latest Intel microcode (i9-13900H) back on 6.8.12-1... Been 7 days without trouble. Don't think this CPU is affected by the high current bug, however it does seem this latest microcode also limit the turbo mode more compared to previous firmwares (really hard to tell so this is just anedoctal).

Backups been running fine, compilation works (but seems slower from what I remembered so I moved the dev VM to another node).

Everything worked real well with 6.5.x.. And for > 7 days, everything works well again for me in 6.8.12 after the microcode update (maybe it runs slower, but prob just my imagination).

I know correlation doesn't imply causation.. But my system is more or less stable now, and I'm happy. Thanks to everybody who has helped me on this.
 
Hi all,

I don't think it's always an intel's CPU problem, we have a lot of Proxmox servers with Intel and AMD CPU's without problems.

In our new server, we updated the BIOS on saturday and the system it's has been worked without any problems with 6.8.4-2 kernel. Last night, CPU start growing from ~5% to 80/90% blocking the vm. Investigating the issue, aparentely (I didn't have many time to research) cron task start at this time to update package list.

FWIW I've updated to the latest Intel microcode (i9-13900H) back on 6.8.12-1... Been 7 days without trouble.

Everything worked real well with 6.5.x.. And for > 7 days, everything works well again for me in 6.8.12 after the microcode update (maybe it runs slower, but prob just my imagination).

Interesting that information, remember what 6.5.x kernel version you used?

Thanks for all, Fernando.
 
Last edited:
Interesting that information, remember what 6.5.x kernel version you used?
It was the latest 6.5 from proxmox.

Note what works for my setup may well not work on others, there's really a YMMV on this. There seems to be a lot of unknowns and I honestly cannot find a pattern.. E.g. I know of a cluster setup with 3x Lenovo P3 Ultras (IIRC)... I can't remember if they're 13th gen or 14th gen though.

Two of 'em works while the remaining one will just crash on idle - for no apparent reason.. That symptom is very different from mine, where the machine crashed during daily backups or heavy VM use.

All the P3s are built identically... Only one has a problem. The problem machine is pinned to 6.5 and it's been good since. Did ask him to try 6.8.12-1 with this new microcode update but he's not game to do so :D.

Also don't think it's just an Intel only problem. There are AMD problems mentioned here in this thread also.. There could be a root cause to all this, or maybe not. Everything looks too random to me.

My machine been up 8 days 19 hours and 30 minutes. Hope I don't jinx it. :p
 
I always have this problem when docker continues high CPU load
kernel:Linux 6.8.8-3-pve (2024-07-16T16:16Z)
cpu: 12600T
storage: 4x 4T intel P4510 in ZFS
So the only solution right now is to downgrade to 6.5 kernel?

Code:
Sep 09 23:08:06 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000620
Sep 09 23:08:06 pve kernel: #PF: supervisor read access in kernel mode
Sep 09 23:08:06 pve kernel: #PF: error_code(0x0000) - not-present page
Sep 09 23:08:06 pve kernel: PGD 0 P4D 0
Sep 09 23:08:06 pve kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Sep 09 23:08:06 pve kernel: CPU: 5 PID: 3545995 Comm: pt_main_thread Tainted: P     U     OE      6.8.8-3-pve #1
Sep 09 23:08:06 pve kernel: Hardware name: ASUS System Product Name/ROG STRIX Z690-I GAMING WIFI, BIOS 3302 02/21/2024
Sep 09 23:08:06 pve kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 60 de 37 bc 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 8a a6 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 a0 ce 00 49 89 04 24
Sep 09 23:08:06 pve kernel: RSP: 0018:ffffb1413431b638 EFLAGS: 00010286
Sep 09 23:08:06 pve kernel: RAX: ffff8ef7ca3e0000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 09 23:08:06 pve kernel: RDX: ffff8f063f7d5000 RSI: ffffb1413431b680 RDI: ffffed60e62a2c00
Sep 09 23:08:06 pve kernel: RBP: ffffb1413431b648 R08: 0000000000000000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb1413431b680
Sep 09 23:08:06 pve kernel: R13: ffff8ef6ae088000 R14: ffff8efcca535648 R15: 0000000000000007
Sep 09 23:08:06 pve kernel: FS:  00007582ed600640(0000) GS:ffff8f05ff480000(0000) knlGS:0000000000000000
Sep 09 23:08:06 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620 CR3: 00000002506c4000 CR4: 0000000000f52ef0
Sep 09 23:08:06 pve kernel: PKRU: 55555554
Sep 09 23:08:06 pve kernel: Call Trace:
Sep 09 23:08:06 pve kernel:  <TASK>
Sep 09 23:08:06 pve kernel:  ? show_regs+0x6d/0x80
Sep 09 23:08:06 pve kernel:  ? __die+0x24/0x80
Sep 09 23:08:06 pve kernel:  ? page_fault_oops+0x176/0x500
Sep 09 23:08:06 pve kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Sep 09 23:08:06 pve kernel:  ? exc_page_fault+0x83/0x1b0
Sep 09 23:08:06 pve kernel:  ? asm_exc_page_fault+0x27/0x30
Sep 09 23:08:06 pve kernel:  ? folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel:  release_pages+0x267/0x4c0
Sep 09 23:08:06 pve kernel:  ? smp_call_function_many_cond+0x113/0x500
Sep 09 23:08:06 pve kernel:  free_pages_and_swap_cache+0x4a/0x60
Sep 09 23:08:06 pve kernel:  tlb_batch_pages_flush+0x43/0x80
Sep 09 23:08:06 pve kernel:  tlb_flush_mmu+0x3d/0x110
Sep 09 23:08:06 pve kernel:  unmap_page_range+0xd36/0x11c0
Sep 09 23:08:06 pve kernel:  unmap_single_vma+0x89/0xf0
Sep 09 23:08:06 pve kernel:  unmap_vmas+0xb5/0x190
Sep 09 23:08:06 pve kernel:  unmap_region+0xe8/0x180
Sep 09 23:08:06 pve kernel:  do_vmi_align_munmap+0x3e8/0x5b0
Sep 09 23:08:06 pve kernel:  do_vmi_munmap+0xdf/0x190
Sep 09 23:08:06 pve kernel:  __vm_munmap+0xad/0x180
Sep 09 23:08:06 pve kernel:  __x64_sys_munmap+0x27/0x40
Sep 09 23:08:06 pve kernel:  x64_sys_call+0x1b1f/0x24b0
Sep 09 23:08:06 pve kernel:  do_syscall_64+0x81/0x170
Sep 09 23:08:06 pve kernel:  ? __update_load_avg_cfs_rq+0x380/0x3f0
Sep 09 23:08:06 pve kernel:  ? update_load_avg+0x82/0x830
Sep 09 23:08:06 pve kernel:  ? trigger_load_balance+0x167/0x370
Sep 09 23:08:06 pve kernel:  ? scheduler_tick+0x134/0x320
Sep 09 23:08:06 pve kernel:  ? account_user_time+0xa2/0xc0
Sep 09 23:08:06 pve kernel:  ? update_process_times+0x8e/0xb0
Sep 09 23:08:06 pve kernel:  ? tick_sched_handle+0x32/0x70
Sep 09 23:08:06 pve kernel:  ? timerqueue_add+0xa6/0xd0
Sep 09 23:08:06 pve kernel:  ? ktime_get+0x45/0xc0
Sep 09 23:08:06 pve kernel:  ? __pfx_tick_nohz_highres_handler+0x10/0x10
Sep 09 23:08:06 pve kernel:  ? lapic_next_deadline+0x2c/0x50
Sep 09 23:08:06 pve kernel:  ? clockevents_program_event+0xb3/0x140
Sep 09 23:08:06 pve kernel:  ? tick_program_event+0x43/0xa0
Sep 09 23:08:06 pve kernel:  ? hrtimer_interrupt+0x11f/0x250
Sep 09 23:08:06 pve kernel:  ? irqentry_exit_to_user_mode+0x7e/0x260
Sep 09 23:08:06 pve kernel:  ? irqentry_exit+0x43/0x50
Sep 09 23:08:06 pve kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Sep 09 23:08:06 pve kernel: RIP: 0033:0x7583f3c98a7b
Sep 09 23:08:06 pve kernel: Code: 8b 15 b9 b3 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b3 0f 00 f7 d8 64 89 01 48
Sep 09 23:08:06 pve kernel: RSP: 002b:00007582ed5fd348 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Sep 09 23:08:06 pve kernel: RAX: ffffffffffffffda RBX: ffffffffffffff80 RCX: 00007583f3c98a7b
Sep 09 23:08:06 pve kernel: RDX: 0000000000000000 RSI: 0000000008955000 RDI: 0000758126c00000
Sep 09 23:08:06 pve kernel: RBP: 0000000000000022 R08: 0000758126c00000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 00000000000000ee R11: 0000000000000206 R12: 0000758126c00030
Sep 09 23:08:06 pve kernel: R13: 0000000000000000 R14: 0000758126c00080 R15: 0000758126c00068
Sep 09 23:08:06 pve kernel:  </TASK>
Sep 09 23:08:06 pve kernel: Modules linked in: uas usb_storage msr macvlan nf_conntrack_netlink xt_nat nft_chain_nat xt_MASQUERADE nf_nat xfrm_user xfrm_algo overlay ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nft_compat tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 softdog nf_tables nvme_fabrics sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling i915(OE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match kvm_intel
Sep 09 23:08:06 pve kernel:  snd_soc_acpi soundwire_generic_allocation soundwire_bus kvm xe iwlmvm snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_usb_audio snd_hda_intel crct10dif_pclmul drm_gpuvm snd_intel_dspcfg polyval_clmulni drm_exec polyval_generic snd_intel_sdw_acpi mac80211 snd_usbmidi_lib ghash_clmulni_intel gpu_sched snd_ump sha256_ssse3 snd_hda_codec drm_buddy snd_rawmidi sha1_ssse3 drm_suballoc_helper aesni_intel btusb drm_ttm_helper snd_seq_device snd_hda_core btrtl ttm mc snd_hwdep crypto_simd btintel mei_pxp mei_hdcp libarc4 cryptd snd_pcm btbcm drm_display_helper btmtk snd_timer cec rapl cmdlinepart bluetooth iwlwifi intel_pmc_core spi_nor rc_core mei_me snd ecdh_generic intel_cstate pcspkr eeepc_wmi asus_nb_wmi wmi_bmof cfg80211 pmt_telemetry mtd mei i2c_algo_bit soundcore ecc intel_vsec plx_dma pmt_class zfs(PO) acpi_pad acpi_tad joydev input_leds mac_hid spl(O) vhost_net vhost vhost_iotlb tap pkcs8_key_parser vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd nct6775 nct6775_core hwmon_vid
Sep 09 23:08:06 pve kernel:  coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid hid mfd_aaeon asus_wmi ledtrig_audio sparse_keymap nvme xhci_pci platform_profile crc32_pclmul xhci_pci_renesas spi_intel_pci i2c_i801 thunderbolt intel_lpss_pci ahci nvme_core spi_intel intel_lpss i2c_smbus xhci_hcd igc libahci idma64 vmd nvme_auth video wmi pinctrl_alderlake [last unloaded: cpuid]
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620
Sep 09 23:08:06 pve kernel: ---[ end trace 0000000000000000 ]---
Sep 09 23:08:06 pve kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Sep 09 23:08:06 pve kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 60 de 37 bc 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 8a a6 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 a0 ce 00 49 89 04 24
Sep 09 23:08:06 pve kernel: RSP: 0018:ffffb1413431b638 EFLAGS: 00010286
Sep 09 23:08:06 pve kernel: RAX: ffff8ef7ca3e0000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 09 23:08:06 pve kernel: RDX: ffff8f063f7d5000 RSI: ffffb1413431b680 RDI: ffffed60e62a2c00
Sep 09 23:08:06 pve kernel: RBP: ffffb1413431b648 R08: 0000000000000000 R09: 0000000000000000
Sep 09 23:08:06 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb1413431b680
Sep 09 23:08:06 pve kernel: R13: ffff8ef6ae088000 R14: ffff8efcca535648 R15: 0000000000000007
Sep 09 23:08:06 pve kernel: FS:  00007582ed600640(0000) GS:ffff8f05ff480000(0000) knlGS:0000000000000000
Sep 09 23:08:06 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 23:08:06 pve kernel: CR2: 0000000000000620 CR3: 00000002506c4000 CR4: 0000000000f52ef0
Sep 09 23:08:06 pve kernel: PKRU: 55555554
Sep 09 23:08:06 pve kernel: note: pt_main_thread[3545995] exited with irqs disabled
 
I always have this problem when docker continues high CPU load
kernel:Linux 6.8.8-3-pve (2024-07-16T16:16Z)
cpu: 12600T
storage: 4x 4T intel P4510 in ZFS
So the only solution right now is to downgrade to 6.5 kernel?
Why not try 6.8.12-1?

That has an update to fix a null pointer exception (unsure if it's the same bug as yours but can't hurt to try).
 
  • Like
Reactions: gfngfn256

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!