For a while now I am dealing with frequent kernel crashes, sometimes twice within minutes, sometimes once every 2 weeks.
Couldn't really pin it down but was getting severe filesystem errors (BTRFS) which now look like they are an effect of the crashes not the actual issue.
I replaced the NVME drive holding root, didn't help much.
Did yet another fresh install on Friday, now with v8.2.2 and switched from BTRFS to EXT4 and since then have had 4-5 kernel panics.
Sometimes I can recover via REISUB, sometimes only a power-cycle will work. Some crashes leave no trace in the log, some I manage to capture.
With the new NVME drive I think I can exclude drive issues, with the switch from BTRFS to EXT4 I can exclude filesystem related issues, so next on my list is a downgrade of the kernel. With the new fresh install I am on Linux pve2 6.8.4-3-pve and haven't got any older kernels installed yet.
The only other reason I can think of right now is maybe faulty ram. There seem to be quite a few page_faults in the log. But again, they may be an effect rather than the actual issue.
I am no expert in reading traces but there are a few pointers here which make me consider a faulty RAM?
I am adding the full log here in case someone can read more into it.
Couldn't really pin it down but was getting severe filesystem errors (BTRFS) which now look like they are an effect of the crashes not the actual issue.
I replaced the NVME drive holding root, didn't help much.
Did yet another fresh install on Friday, now with v8.2.2 and switched from BTRFS to EXT4 and since then have had 4-5 kernel panics.
Sometimes I can recover via REISUB, sometimes only a power-cycle will work. Some crashes leave no trace in the log, some I manage to capture.
With the new NVME drive I think I can exclude drive issues, with the switch from BTRFS to EXT4 I can exclude filesystem related issues, so next on my list is a downgrade of the kernel. With the new fresh install I am on Linux pve2 6.8.4-3-pve and haven't got any older kernels installed yet.
The only other reason I can think of right now is maybe faulty ram. There seem to be quite a few page_faults in the log. But again, they may be an effect rather than the actual issue.
I am no expert in reading traces but there are a few pointers here which make me consider a faulty RAM?
not-present page
and page_fault_oops
sound like RAM issues?
Code:
Jun 01 13:36:46 pve2 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
Jun 01 13:36:46 pve2 kernel: #PF: supervisor read access in kernel mode
Jun 01 13:36:47 pve2 kernel: #PF: error_code(0x0000) - not-present page
Jun 01 13:36:47 pve2 kernel: PGD 0 P4D 0
Jun 01 13:36:47 pve2 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 01 13:36:47 pve2 kernel: CPU: 1 PID: 24154 Comm: postgres Tainted: P O 6.8.4-3-pve #1
Jun 01 13:36:47 pve2 kernel: Hardware name: To Be Filled By O.E.M. B550 Steel Legend/B550 Steel Legend, BIOS P2.40 10/19/2022
Jun 01 13:36:47 pve2 kernel: RIP: 0010:__memcg_slab_post_alloc_hook+0x9e/0x230
Jun 01 13:36:47 pve2 kernel: Code: 03 05 3e e9 69 01 48 8b 50 08 49 89 c6 f6 c2 01 0f 85 75 01 00 00 0f 1f 44 00 00 49 8b 06 f6 c4 08 b8 00 00 00 00 4c 0f 44 f0 <49> 8b 46 38 48 83 f8 03 77 20 8b 55 c4 31 c9 4>
Jun 01 13:36:47 pve2 kernel: RSP: 0018:ffffbf3a541dfb18 EFLAGS: 00010246
Jun 01 13:36:47 pve2 kernel: RAX: 0000000000000000 RBX: ffff9970ed03b980 RCX: 0000000000000001
Jun 01 13:36:47 pve2 kernel: RDX: dead000000000100 RSI: ffff99704357a880 RDI: 0000000000000cc0
Jun 01 13:36:47 pve2 kernel: RBP: ffffbf3a541dfb60 R08: ffffbf3a541dfb80 R09: ffff9970ed03b980
Jun 01 13:36:47 pve2 kernel: R10: ffff99704357a880 R11: 0000000000000000 R12: ffff99704357a880
Jun 01 13:36:47 pve2 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff996f8022c800
Jun 01 13:36:47 pve2 kernel: FS: 00007d6ad0ea8b48(0000) GS:ffff99769da80000(0000) knlGS:0000000000000000
Jun 01 13:36:47 pve2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 01 13:36:47 pve2 kernel: CR2: 0000000000000038 CR3: 000000022a618000 CR4: 0000000000f50ef0
Jun 01 13:36:47 pve2 kernel: PKRU: 55555554
Jun 01 13:36:47 pve2 kernel: Call Trace:
Jun 01 13:36:47 pve2 kernel: <TASK>
Jun 01 13:36:47 pve2 kernel: ? show_regs+0x6d/0x80
Jun 01 13:36:47 pve2 kernel: ? __die+0x24/0x80
Jun 01 13:36:47 pve2 kernel: ? page_fault_oops+0x176/0x500
Jun 01 13:36:47 pve2 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jun 01 13:36:47 pve2 kernel: ? do_user_addr_fault+0x2f9/0x6b0
Jun 01 13:36:47 pve2 kernel: ? exc_page_fault+0x83/0x1b0
Jun 01 13:36:47 pve2 kernel: ? asm_exc_page_fault+0x27/0x30
I am adding the full log here in case someone can read more into it.