Pagefaults and Hang-Ups ever since Upgrade to 7.x

RedChili

New Member
Mar 19, 2023
5
0
1
Hello,

I'm tracking random hangups of my VE server for months now with not much success. I had automatic reboots enabled in the kernel on segfaults to keep the system up as much as possible and never got any smoking guns in the journal - likely it just rebooted or died before anything useful could be written.
Now I finally got a detailed trace... see below.

The System is a single node Home-Lab, AMD Ryzen 5 PRO 4650G with Radeon Graphics, 64Gb RAM, 2 nVMEs, 2 SSDs, running 7 VMs and 2 LXC containers.
PVE root is on ext4, VMs on BTRFS (had xfs as well before) .... not heavily used.
I've checked:
  1. Temperature is not an issue
  2. RAM didn't show any errors with an extensive memtest86
  3. Disks/FS are all OK
Now here is the pagefault I captured, I have no idea how to interpret this:

Code:
-- Journal begins at Thu 2023-03-23 23:50:09 CET, ends at Sun 2023-06-04 19:21:01 CEST. --
May 27 17:49:42 elcapitan kernel: BUG: unable to handle page fault for address: 000000000000ba72
May 27 17:49:42 elcapitan kernel: #PF: supervisor write access in kernel mode
May 27 17:49:42 elcapitan kernel: #PF: error_code(0x0002) - not-present page
May 27 17:49:42 elcapitan kernel: PGD 0 P4D 0
May 27 17:49:42 elcapitan kernel: Oops: 0002 [#1] SMP NOPTI
May 27 17:49:42 elcapitan kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O      5.15.107-2-pve #1
May 27 17:49:42 elcapitan kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X300M-STX, BIOS P1.40 08/04/2020
May 27 17:49:42 elcapitan kernel: RIP: 0010:__update_load_avg_se+0x12d/0x690
May 27 17:49:42 elcapitan kernel: Code: 00 49 8b 04 24 48 85 c0 74 11 48 c1 e8 0a ba 02 00 00 00 48 83 f8 02 48 0f 42 c2 49 0f af 84 24 88 01 00 00 41 8d 8e 7e b6 00 <00> 31 d2 48 f7 f1 31 d2 49 89 84 24 a0 01 00 00 49 8b 84 24 90 01
May 27 17:49:42 elcapitan kernel: RSP: 0018:ffffb0e080003e40 EFLAGS: 00010046
May 27 17:49:42 elcapitan kernel: RAX: 0000000000000000 RBX: ffff980b8bccc000 RCX: 000000000000ba72
May 27 17:49:42 elcapitan kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000f40
May 27 17:49:42 elcapitan kernel: RBP: ffffb0e080003ea8 R08: 0000000000000000 R09: 0000000000000003
May 27 17:49:42 elcapitan kernel: R10: 00000000000000b4 R11: 0000000000000000 R12: ffff98140a2a6a00
May 27 17:49:42 elcapitan kernel: R13: 0000000000000000 R14: 00000000000003f4 R15: 0000000000000f40
May 27 17:49:42 elcapitan kernel: FS:  0000000000000000(0000) GS:ffff9819aea00000(0000) knlGS:0000000000000000
May 27 17:49:42 elcapitan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 17:49:42 elcapitan kernel: CR2: 000000000000ba72 CR3: 0000000108dee000 CR4: 0000000000350ef0
May 27 17:49:42 elcapitan kernel: Call Trace:
May 27 17:49:42 elcapitan kernel:  <IRQ>
May 27 17:49:42 elcapitan kernel:  ? sched_clock+0x9/0x10
May 27 17:49:42 elcapitan kernel:  ? sched_clock_local+0x17/0x90
May 27 17:49:42 elcapitan kernel:  update_load_avg+0x4c8/0x640
May 27 17:49:42 elcapitan kernel:  update_blocked_averages+0x58a/0x7d0
May 27 17:49:42 elcapitan kernel:  ? lapic_next_event+0x21/0x30
May 27 17:49:42 elcapitan kernel:  ? clockevents_program_event+0xab/0x130
May 27 17:49:42 elcapitan kernel:  run_rebalance_domains+0x4b/0x80
May 27 17:49:42 elcapitan kernel:  __do_softirq+0xd9/0x2ea
May 27 17:49:42 elcapitan kernel:  irq_exit_rcu+0x94/0xc0
May 27 17:49:42 elcapitan kernel:  sysvec_apic_timer_interrupt+0x80/0x90
May 27 17:49:42 elcapitan kernel:  </IRQ>
May 27 17:49:42 elcapitan kernel:  <TASK>
May 27 17:49:42 elcapitan kernel:  asm_sysvec_apic_timer_interrupt+0x1b/0x20
May 27 17:49:42 elcapitan kernel: RIP: 0010:cpuidle_enter_state+0xd9/0x620
May 27 17:49:42 elcapitan kernel: Code: 3d 64 6c 1e 61 e8 f7 2a 6d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 38 38 6d ff 80 7d d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
May 27 17:49:42 elcapitan kernel: RSP: 0018:ffffffffa0003da0 EFLAGS: 00000246
May 27 17:49:42 elcapitan kernel: RAX: ffff9819aea30bc0 RBX: ffff980b83d43000 RCX: 0000de52151fa6df
May 27 17:49:42 elcapitan kernel: RDX: 000000000000003c RSI: 0000de52151fa6df RDI: 0000000000000000
May 27 17:49:42 elcapitan kernel: RBP: ffffffffa0003df0 R08: 0000de52151fa71b R09: 00000000000aae60
May 27 17:49:42 elcapitan kernel: R10: 0000000000000004 R11: 071c71c71c71c71c R12: ffffffffa02e7a00
May 27 17:49:42 elcapitan kernel: R13: 0000000000000001 R14: 0000000000000001 R15: 0000de52151fa71b
May 27 17:49:42 elcapitan kernel:  ? sched_clock_local+0x17/0x90
May 27 17:49:42 elcapitan kernel:  cpuidle_enter+0x2e/0x50
May 27 17:49:42 elcapitan kernel:  do_idle+0x20d/0x2b0
May 27 17:49:42 elcapitan kernel:  cpu_startup_entry+0x20/0x30
May 27 17:49:42 elcapitan kernel:  rest_init+0xd3/0x100
May 27 17:49:42 elcapitan kernel:  ? acpi_enable_subsystem+0x21d/0x229
May 27 17:49:42 elcapitan kernel:  arch_call_rest_init+0xe/0x23
May 27 17:49:42 elcapitan kernel:  start_kernel+0x9b2/0x9dc
May 27 17:49:42 elcapitan kernel:  x86_64_start_reservations+0x24/0x2a
May 27 17:49:42 elcapitan kernel:  x86_64_start_kernel+0xfe/0x109
May 27 17:49:42 elcapitan kernel:  secondary_startup_64_no_verify+0xc2/0xcb
May 27 17:49:42 elcapitan kernel:  </TASK>
May 27 17:49:42 elcapitan kernel: Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay unix_diag tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_hda_codec snd_hda_core kvm irqbypass snd_hwdep snd_pcm crct10dif_pclmul ghash_clmulni_intel snd_timer input_leds aesni_intel crypto_simd snd cryptd rapl soundcore ccp k10temp efi_pstore wmi_bmof pcspkr zfs(PO) mac_hid zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid drm
May 27 17:49:42 elcapitan kernel:  sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci ahci xhci_pci_renesas crc32_pclmul libahci i2c_piix4 xhci_hcd nvme r8169 realtek nvme_core wmi video
May 27 17:49:42 elcapitan kernel: CR2: 000000000000ba72
May 27 17:49:42 elcapitan kernel: ---[ end trace a4de431c6f245348 ]---

(After this first CPU lockup, other lockups (different traces) happen for other CPUs until reboot)
If someone smarter than me has an idea...I'd be happy to get a pointer to a solution :-)
 
X300M-STX, BIOS P1.40 08/04/2020

I would start with updating the bios/UEFI: [1] to, at the very least, 1.70 or 1.80A. (Assuming, what you have is the ASRock DeskMini X300! Otherwise check for an update for the mainboard/system you actually use/have!)

Would also check for firmware updates for all of your SSDs.

If you still encounter problems after the bios/UEFI update, I would try with the 6.2 opt-in kernel: [2].

Additionally to the above, installing the: amd64-microcode: [3] also could not hurt.

[1] https://www.asrock.com/nettop/AMD/DeskMini X300 Series/index.asp#BIOS
[2] https://forum.proxmox.com/threads/opt-in-linux-6-2-kernel-for-proxmox-ve-7-x-available.124189
[3] https://wiki.debian.org/Microcode
 
That's all precious hints, thank you!
I've missed the BIOS update since it has been running rather stable for such a long time. I'll update and also install the microcode package and see if that changes anything!
Cheers!
 
After the BIOS upgrade the system runs stable for a week now, though lets see if that continues. I had a week of uptime before also and then is would randomly crash again. Lets hope for the best.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!