[SOLVED] Frequent Kernel Panics

Jan 20, 2022
40
6
13
24
For a while now I am dealing with frequent kernel crashes, sometimes twice within minutes, sometimes once every 2 weeks.
Couldn't really pin it down but was getting severe filesystem errors (BTRFS) which now look like they are an effect of the crashes not the actual issue.

I replaced the NVME drive holding root, didn't help much.
Did yet another fresh install on Friday, now with v8.2.2 and switched from BTRFS to EXT4 and since then have had 4-5 kernel panics.
Sometimes I can recover via REISUB, sometimes only a power-cycle will work. Some crashes leave no trace in the log, some I manage to capture.

With the new NVME drive I think I can exclude drive issues, with the switch from BTRFS to EXT4 I can exclude filesystem related issues, so next on my list is a downgrade of the kernel. With the new fresh install I am on Linux pve2 6.8.4-3-pve and haven't got any older kernels installed yet.

The only other reason I can think of right now is maybe faulty ram. There seem to be quite a few page_faults in the log. But again, they may be an effect rather than the actual issue.
I am no expert in reading traces but there are a few pointers here which make me consider a faulty RAM?
not-present page and page_fault_oops sound like RAM issues?

Code:
Jun 01 13:36:46 pve2 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
Jun 01 13:36:46 pve2 kernel: #PF: supervisor read access in kernel mode
Jun 01 13:36:47 pve2 kernel: #PF: error_code(0x0000) - not-present page
Jun 01 13:36:47 pve2 kernel: PGD 0 P4D 0
Jun 01 13:36:47 pve2 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 01 13:36:47 pve2 kernel: CPU: 1 PID: 24154 Comm: postgres Tainted: P           O       6.8.4-3-pve #1
Jun 01 13:36:47 pve2 kernel: Hardware name: To Be Filled By O.E.M. B550 Steel Legend/B550 Steel Legend, BIOS P2.40 10/19/2022
Jun 01 13:36:47 pve2 kernel: RIP: 0010:__memcg_slab_post_alloc_hook+0x9e/0x230
Jun 01 13:36:47 pve2 kernel: Code: 03 05 3e e9 69 01 48 8b 50 08 49 89 c6 f6 c2 01 0f 85 75 01 00 00 0f 1f 44 00 00 49 8b 06 f6 c4 08 b8 00 00 00 00 4c 0f 44 f0 <49> 8b 46 38 48 83 f8 03 77 20 8b 55 c4 31 c9 4>
Jun 01 13:36:47 pve2 kernel: RSP: 0018:ffffbf3a541dfb18 EFLAGS: 00010246
Jun 01 13:36:47 pve2 kernel: RAX: 0000000000000000 RBX: ffff9970ed03b980 RCX: 0000000000000001
Jun 01 13:36:47 pve2 kernel: RDX: dead000000000100 RSI: ffff99704357a880 RDI: 0000000000000cc0
Jun 01 13:36:47 pve2 kernel: RBP: ffffbf3a541dfb60 R08: ffffbf3a541dfb80 R09: ffff9970ed03b980
Jun 01 13:36:47 pve2 kernel: R10: ffff99704357a880 R11: 0000000000000000 R12: ffff99704357a880
Jun 01 13:36:47 pve2 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff996f8022c800
Jun 01 13:36:47 pve2 kernel: FS:  00007d6ad0ea8b48(0000) GS:ffff99769da80000(0000) knlGS:0000000000000000
Jun 01 13:36:47 pve2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 01 13:36:47 pve2 kernel: CR2: 0000000000000038 CR3: 000000022a618000 CR4: 0000000000f50ef0
Jun 01 13:36:47 pve2 kernel: PKRU: 55555554
Jun 01 13:36:47 pve2 kernel: Call Trace:
Jun 01 13:36:47 pve2 kernel:  <TASK>
Jun 01 13:36:47 pve2 kernel:  ? show_regs+0x6d/0x80
Jun 01 13:36:47 pve2 kernel:  ? __die+0x24/0x80
Jun 01 13:36:47 pve2 kernel:  ? page_fault_oops+0x176/0x500
Jun 01 13:36:47 pve2 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Jun 01 13:36:47 pve2 kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Jun 01 13:36:47 pve2 kernel:  ? exc_page_fault+0x83/0x1b0
Jun 01 13:36:47 pve2 kernel:  ? asm_exc_page_fault+0x27/0x30

I am adding the full log here in case someone can read more into it.
 

Attachments

Thanks both of you. I also found this thread by now, https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/page-2 and changed GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off"
Also installed kernel 6.5 via apt install pve-kernel-6.5 and created a MemTest USB stick which I will let run over night

I will report back…

Edit:
Well…one of the 4 RAM bars was defect. The initial RAM check with all 4 modules returned a big FAIL. When I then tried them one by one, the server wouldn't even boot with the faulty one.
Quite surprising this machine was running at all…

Will let the machine run with half the memory for a while now and if all remains stable will do a (hopefully) final reinstall to get back to my prefered BTRFS/SystemdBoot setup.

Thanks for helping along!
 
Last edited:
  • Like
Reactions: justinclift

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!