Hello,
I'm tracking random hangups of my VE server for months now with not much success. I had automatic reboots enabled in the kernel on segfaults to keep the system up as much as possible and never got any smoking guns in the journal - likely it just rebooted or died before anything useful could be written.
Now I finally got a detailed trace... see below.
The System is a single node Home-Lab, AMD Ryzen 5 PRO 4650G with Radeon Graphics, 64Gb RAM, 2 nVMEs, 2 SSDs, running 7 VMs and 2 LXC containers.
PVE root is on ext4, VMs on BTRFS (had xfs as well before) .... not heavily used.
I've checked:
(After this first CPU lockup, other lockups (different traces) happen for other CPUs until reboot)
If someone smarter than me has an idea...I'd be happy to get a pointer to a solution
I'm tracking random hangups of my VE server for months now with not much success. I had automatic reboots enabled in the kernel on segfaults to keep the system up as much as possible and never got any smoking guns in the journal - likely it just rebooted or died before anything useful could be written.
Now I finally got a detailed trace... see below.
The System is a single node Home-Lab, AMD Ryzen 5 PRO 4650G with Radeon Graphics, 64Gb RAM, 2 nVMEs, 2 SSDs, running 7 VMs and 2 LXC containers.
PVE root is on ext4, VMs on BTRFS (had xfs as well before) .... not heavily used.
I've checked:
- Temperature is not an issue
- RAM didn't show any errors with an extensive memtest86
- Disks/FS are all OK
Code:
-- Journal begins at Thu 2023-03-23 23:50:09 CET, ends at Sun 2023-06-04 19:21:01 CEST. --
May 27 17:49:42 elcapitan kernel: BUG: unable to handle page fault for address: 000000000000ba72
May 27 17:49:42 elcapitan kernel: #PF: supervisor write access in kernel mode
May 27 17:49:42 elcapitan kernel: #PF: error_code(0x0002) - not-present page
May 27 17:49:42 elcapitan kernel: PGD 0 P4D 0
May 27 17:49:42 elcapitan kernel: Oops: 0002 [#1] SMP NOPTI
May 27 17:49:42 elcapitan kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 5.15.107-2-pve #1
May 27 17:49:42 elcapitan kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X300M-STX, BIOS P1.40 08/04/2020
May 27 17:49:42 elcapitan kernel: RIP: 0010:__update_load_avg_se+0x12d/0x690
May 27 17:49:42 elcapitan kernel: Code: 00 49 8b 04 24 48 85 c0 74 11 48 c1 e8 0a ba 02 00 00 00 48 83 f8 02 48 0f 42 c2 49 0f af 84 24 88 01 00 00 41 8d 8e 7e b6 00 <00> 31 d2 48 f7 f1 31 d2 49 89 84 24 a0 01 00 00 49 8b 84 24 90 01
May 27 17:49:42 elcapitan kernel: RSP: 0018:ffffb0e080003e40 EFLAGS: 00010046
May 27 17:49:42 elcapitan kernel: RAX: 0000000000000000 RBX: ffff980b8bccc000 RCX: 000000000000ba72
May 27 17:49:42 elcapitan kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000f40
May 27 17:49:42 elcapitan kernel: RBP: ffffb0e080003ea8 R08: 0000000000000000 R09: 0000000000000003
May 27 17:49:42 elcapitan kernel: R10: 00000000000000b4 R11: 0000000000000000 R12: ffff98140a2a6a00
May 27 17:49:42 elcapitan kernel: R13: 0000000000000000 R14: 00000000000003f4 R15: 0000000000000f40
May 27 17:49:42 elcapitan kernel: FS: 0000000000000000(0000) GS:ffff9819aea00000(0000) knlGS:0000000000000000
May 27 17:49:42 elcapitan kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 17:49:42 elcapitan kernel: CR2: 000000000000ba72 CR3: 0000000108dee000 CR4: 0000000000350ef0
May 27 17:49:42 elcapitan kernel: Call Trace:
May 27 17:49:42 elcapitan kernel: <IRQ>
May 27 17:49:42 elcapitan kernel: ? sched_clock+0x9/0x10
May 27 17:49:42 elcapitan kernel: ? sched_clock_local+0x17/0x90
May 27 17:49:42 elcapitan kernel: update_load_avg+0x4c8/0x640
May 27 17:49:42 elcapitan kernel: update_blocked_averages+0x58a/0x7d0
May 27 17:49:42 elcapitan kernel: ? lapic_next_event+0x21/0x30
May 27 17:49:42 elcapitan kernel: ? clockevents_program_event+0xab/0x130
May 27 17:49:42 elcapitan kernel: run_rebalance_domains+0x4b/0x80
May 27 17:49:42 elcapitan kernel: __do_softirq+0xd9/0x2ea
May 27 17:49:42 elcapitan kernel: irq_exit_rcu+0x94/0xc0
May 27 17:49:42 elcapitan kernel: sysvec_apic_timer_interrupt+0x80/0x90
May 27 17:49:42 elcapitan kernel: </IRQ>
May 27 17:49:42 elcapitan kernel: <TASK>
May 27 17:49:42 elcapitan kernel: asm_sysvec_apic_timer_interrupt+0x1b/0x20
May 27 17:49:42 elcapitan kernel: RIP: 0010:cpuidle_enter_state+0xd9/0x620
May 27 17:49:42 elcapitan kernel: Code: 3d 64 6c 1e 61 e8 f7 2a 6d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 38 38 6d ff 80 7d d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
May 27 17:49:42 elcapitan kernel: RSP: 0018:ffffffffa0003da0 EFLAGS: 00000246
May 27 17:49:42 elcapitan kernel: RAX: ffff9819aea30bc0 RBX: ffff980b83d43000 RCX: 0000de52151fa6df
May 27 17:49:42 elcapitan kernel: RDX: 000000000000003c RSI: 0000de52151fa6df RDI: 0000000000000000
May 27 17:49:42 elcapitan kernel: RBP: ffffffffa0003df0 R08: 0000de52151fa71b R09: 00000000000aae60
May 27 17:49:42 elcapitan kernel: R10: 0000000000000004 R11: 071c71c71c71c71c R12: ffffffffa02e7a00
May 27 17:49:42 elcapitan kernel: R13: 0000000000000001 R14: 0000000000000001 R15: 0000de52151fa71b
May 27 17:49:42 elcapitan kernel: ? sched_clock_local+0x17/0x90
May 27 17:49:42 elcapitan kernel: cpuidle_enter+0x2e/0x50
May 27 17:49:42 elcapitan kernel: do_idle+0x20d/0x2b0
May 27 17:49:42 elcapitan kernel: cpu_startup_entry+0x20/0x30
May 27 17:49:42 elcapitan kernel: rest_init+0xd3/0x100
May 27 17:49:42 elcapitan kernel: ? acpi_enable_subsystem+0x21d/0x229
May 27 17:49:42 elcapitan kernel: arch_call_rest_init+0xe/0x23
May 27 17:49:42 elcapitan kernel: start_kernel+0x9b2/0x9dc
May 27 17:49:42 elcapitan kernel: x86_64_start_reservations+0x24/0x2a
May 27 17:49:42 elcapitan kernel: x86_64_start_kernel+0xfe/0x109
May 27 17:49:42 elcapitan kernel: secondary_startup_64_no_verify+0xc2/0xcb
May 27 17:49:42 elcapitan kernel: </TASK>
May 27 17:49:42 elcapitan kernel: Modules linked in: cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay unix_diag tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_hda_codec snd_hda_core kvm irqbypass snd_hwdep snd_pcm crct10dif_pclmul ghash_clmulni_intel snd_timer input_leds aesni_intel crypto_simd snd cryptd rapl soundcore ccp k10temp efi_pstore wmi_bmof pcspkr zfs(PO) mac_hid zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid drm
May 27 17:49:42 elcapitan kernel: sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci ahci xhci_pci_renesas crc32_pclmul libahci i2c_piix4 xhci_hcd nvme r8169 realtek nvme_core wmi video
May 27 17:49:42 elcapitan kernel: CR2: 000000000000ba72
May 27 17:49:42 elcapitan kernel: ---[ end trace a4de431c6f245348 ]---
(After this first CPU lockup, other lockups (different traces) happen for other CPUs until reboot)
If someone smarter than me has an idea...I'd be happy to get a pointer to a solution