PVE-Node crashed, root cause?

Jan 13, 2025
6
0
1
Hello,
one of my PVE-node (Proxmox VE 8.3) had crashed but I can't see the reason or root cause for the crash.
The log file is full of these entries (see spoiler)
Do you have any idea?

Jan 14 17:58:58 pve1 kernel: BUG: unable to handle page fault for address: ffff907848e0be60
Jan 14 17:58:58 pve1 kernel: #PF: supervisor read access in kernel mode
Jan 14 17:58:58 pve1 kernel: #PF: error_code(0x0000) - not-present page
Jan 14 17:58:58 pve1 kernel: PGD 12ca01067 P4D 12ca01067 PUD 0
Jan 14 17:58:58 pve1 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 14 17:58:58 pve1 kernel: CPU: 1 PID: 810550 Comm: pvedaemon worke Tainted: P O 6.8.12-4-pve #1
Jan 14 17:58:58 pve1 kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Jan 14 17:58:58 pve1 kernel: RIP: 0010:kmem_cache_alloc+0xce/0x370
Jan 14 17:58:58 pve1 kernel: Code: 83 78 10 00 48 8b 38 0f 84 48 02 00 00 48 85 ff 0f 84 3f 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8b 34 24 48 01 f8 <48> 33 18 48 89 c1 48 89 f8 48 0f c9 48 31 cb 48 8d 8a 00 20 00 00
Jan 14 17:58:58 pve1 kernel: RSP: 0018:ffffb4584ff5fa00 EFLAGS: 00010286
Jan 14 17:58:58 pve1 kernel: RAX: ffff907848e0be60 RBX: 96fe745d01f3e4c0 RCX: 0000000000000000
Jan 14 17:58:58 pve1 kernel: RDX: 000000b4fe34e001 RSI: 000000000003cf20 RDI: ffff907848e0be40
Jan 14 17:58:58 pve1 kernel: RBP: ffffb4584ff5fa50 R08: 0000000000000000 R09: 0000000000000000
Jan 14 17:58:58 pve1 kernel: R10: ffff9070691c5f40 R11: 0000000000000000 R12: ffff9070401e2c00
Jan 14 17:58:58 pve1 kernel: R13: 0000000000000cc0 R14: 0000000000000040 R15: ffffffffb1ff3f2a
Jan 14 17:58:58 pve1 kernel: FS: 00007523d5710b80(0000) GS:ffff90779fa80000(0000) knlGS:0000000000000000
Jan 14 17:58:58 pve1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 14 17:58:58 pve1 kernel: CR2: ffff907848e0be60 CR3: 000000053ceea002 CR4: 0000000000f72ef0
Jan 14 17:58:58 pve1 kernel: PKRU: 55555554
Jan 14 17:58:58 pve1 kernel: Call Trace:
Jan 14 17:58:58 pve1 kernel: <TASK>
Jan 14 17:58:58 pve1 kernel: ? show_regs+0x6d/0x80
Jan 14 17:58:58 pve1 kernel: ? __die+0x24/0x80
Jan 14 17:58:58 pve1 kernel: ? page_fault_oops+0x176/0x500
Jan 14 17:58:58 pve1 kernel: ? kmem_cache_alloc+0xce/0x370
Jan 14 17:58:58 pve1 kernel: ? kernelmode_fixup_or_oops.constprop.0+0x69/0x90
Jan 14 17:58:58 pve1 kernel: ? __bad_area_nosemaphore+0x19d/0x270
Jan 14 17:58:58 pve1 kernel: ? bad_area_nosemaphore+0x16/0x30
Jan 14 17:58:58 pve1 kernel: ? do_kern_addr_fault+0x7b/0xa0
Jan 14 17:58:58 pve1 kernel: ? exc_page_fault+0x10d/0x1b0
Jan 14 17:58:58 pve1 kernel: ? asm_exc_page_fault+0x27/0x30
Jan 14 17:58:58 pve1 kernel: ? anon_vma_fork+0x9a/0x150
Jan 14 17:58:58 pve1 kernel: ? kmem_cache_alloc+0xce/0x370
Jan 14 17:58:58 pve1 kernel: ? anon_vma_clone+0x126/0x1d0
Jan 14 17:58:58 pve1 kernel: anon_vma_fork+0x9a/0x150
Jan 14 17:58:58 pve1 kernel: copy_process+0x22c6/0x2550
Jan 14 17:58:58 pve1 kernel: kernel_clone+0xbd/0x440
Jan 14 17:58:58 pve1 kernel: ? do_syscall_64+0x8d/0x170
Jan 14 17:58:58 pve1 kernel: __do_sys_clone+0x66/0xa0
Jan 14 17:58:58 pve1 kernel: __x64_sys_clone+0x25/0x40
Jan 14 17:58:58 pve1 kernel: x64_sys_call+0x1d0e/0x24b0
Jan 14 17:58:58 pve1 kernel: do_syscall_64+0x81/0x170
Jan 14 17:58:58 pve1 kernel: ? ptep_set_access_flags+0x4a/0x70
Jan 14 17:58:58 pve1 kernel: ? wp_page_reuse+0x95/0xc0
Jan 14 17:58:58 pve1 kernel: ? do_wp_page+0xf5/0xb80
Jan 14 17:58:58 pve1 kernel: ? __pte_offset_map+0x1c/0x1b0
Jan 14 17:58:58 pve1 kernel: ? __handle_mm_fault+0xbc6/0xf20
Jan 14 17:58:58 pve1 kernel: ? hrtimer_try_to_cancel+0x1b/0x120
Jan 14 17:58:58 pve1 kernel: ? __count_memcg_events+0x6f/0xe0
Jan 14 17:58:58 pve1 kernel: ? count_memcg_events.constprop.0+0x2a/0x50
Jan 14 17:58:58 pve1 kernel: ? handle_mm_fault+0xad/0x380
Jan 14 17:58:58 pve1 kernel: ? do_user_addr_fault+0x337/0x660
Jan 14 17:58:58 pve1 kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
Jan 14 17:58:58 pve1 kernel: ? irqentry_exit+0x43/0x50
Jan 14 17:58:58 pve1 kernel: ? exc_page_fault+0x94/0x1b0
Jan 14 17:58:58 pve1 kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
Jan 14 17:58:58 pve1 kernel: RIP: 0033:0x7523d5822313
Jan 14 17:58:58 pve1 kernel: Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
Jan 14 17:58:58 pve1 kernel: RSP: 002b:00007ffcfca1f0a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Jan 14 17:58:58 pve1 kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007523d5822313
Jan 14 17:58:58 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Jan 14 17:58:58 pve1 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Jan 14 17:58:58 pve1 kernel: R10: 00007523d5710e50 R11: 0000000000000246 R12: 0000000000000001
Jan 14 17:58:58 pve1 kernel: R13: 00007ffcfca1f1c0 R14: 00007ffcfca1f240 R15: 00007523d5a4b020
Jan 14 17:58:58 pve1 kernel: </TASK>
Jan 14 17:58:58 pve1 kernel: Modules linked in: tcp_diag inet_diag nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common xe snd_sof_pci_intel_tgl snd_sof_intel_hda_common x86_pkg_temp_thermal soundwire_intel intel_powerclamp drm_gpuvm snd_sof_intel_hda_mlink snd_hda_codec_hdmi soundwire_cadence drm_exec kvm_intel snd_sof_intel_hda gpu_sched snd_sof_pci drm_suballoc_helper snd_sof_xtensa_dsp drm_ttm_helper snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core kvm snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core irqbypass crct10dif_pclmul polyval_clmulni snd_compress polyval_generic ghash_clmulni_intel
Jan 14 17:58:58 pve1 kernel: ac97_bus sha256_ssse3 snd_pcm_dmaengine sha1_ssse3 aesni_intel snd_hda_intel crypto_simd snd_intel_dspcfg snd_intel_sdw_acpi cryptd mei_pxp mei_hdcp i915 rapl snd_hda_codec snd_hda_core snd_hwdep snd_pcm btusb btrtl btintel snd_timer btbcm drm_buddy btmtk ttm bluetooth drm_display_helper cmdlinepart snd spi_nor intel_cstate pcspkr wmi_bmof soundcore mtd ecdh_generic ecc mei_me cec mei rc_core i2c_algo_bit igen6_edac intel_pmc_core intel_vsec pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci crc32_pclmul nvme xhci_pci_renesas spi_intel_pci xhci_hcd nvme_core spi_intel igc i2c_i801 i2c_smbus nvme_auth ahci libahci video wmi
Jan 14 17:58:58 pve1 kernel: CR2: ffff907848e0be60
Jan 14 17:58:58 pve1 kernel: ---[ end trace 0000000000000000 ]---
Jan 14 17:58:58 pve1 kernel: RIP: 0010:kmem_cache_alloc+0xce/0x370
Jan 14 17:58:58 pve1 kernel: Code: 83 78 10 00 48 8b 38 0f 84 48 02 00 00 48 85 ff 0f 84 3f 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8b 34 24 48 01 f8 <48> 33 18 48 89 c1 48 89 f8 48 0f c9 48 31 cb 48 8d 8a 00 20 00 00
Jan 14 17:58:58 pve1 kernel: RSP: 0018:ffffb4584ff5fa00 EFLAGS: 00010286
Jan 14 17:58:58 pve1 kernel: RAX: ffff907848e0be60 RBX: 96fe745d01f3e4c0 RCX: 0000000000000000
Jan 14 17:58:58 pve1 kernel: RDX: 000000b4fe34e001 RSI: 000000000003cf20 RDI: ffff907848e0be40
Jan 14 17:58:58 pve1 kernel: RBP: ffffb4584ff5fa50 R08: 0000000000000000 R09: 0000000000000000
Jan 14 17:58:58 pve1 kernel: R10: ffff9070691c5f40 R11: 0000000000000000 R12: ffff9070401e2c00
Jan 14 17:58:58 pve1 kernel: R13: 0000000000000cc0 R14: 0000000000000040 R15: ffffffffb1ff3f2a
Jan 14 17:58:58 pve1 kernel: FS: 00007523d5710b80(0000) GS:ffff90779fa80000(0000) knlGS:0000000000000000
Jan 14 17:58:58 pve1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 14 17:58:58 pve1 kernel: CR2: ffff907848e0be60 CR3: 000000053ceea002 CR4: 0000000000f72ef0
Jan 14 17:58:58 pve1 kernel: PKRU: 55555554
Jan 14 17:58:58 pve1 kernel: note: pvedaemon worke[810550] exited with irqs disabled
...
followed by 6 more of those occurrences then the server was crashed.
 
Jan 14 17:58:58 pve1 kernel: BUG: unable to handle page fault for address

Well, I would start with running memtest86+ for multiple cycles. (May run one night, a day or two...)

Check the BIOS settings and disable everything that smells like "overclocking".

For more hints you should at least list the hardware details of the server - the more details you post, the higher the chance for a helpful reply ;-)
 
It's a low-power mini-PC with an intel N100 (no overclocking), 4x intel 226 2.5Gbps NICS, 1 module 32GB DDR5 4800MHz RAM and a 1TB Kingston M.2 NVME. It was running for about 6 weeks with 5 LXCs and 1 KVM for 24/7 without any issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!