SimplyNUC Server suddenly crashes with PVE 8.4.11

christian.hahn

New Member
Feb 23, 2024
8
1
3
Hi !

I'm working on a PVE setup for 2 weeks now and one of the servers is a SimplyNUC server running PVE 8.4.11. Yesterday suddenly I couldn't reach it anymore and had to hard reboot the machine. I then checked the logfiles and saw the following:

Bash:
Aug 20 18:27:59 site1-snuc pvedaemon[1518]: starting 1 worker(s)

Aug 20 18:27:59 site1-snuc pvedaemon[1518]: worker 37678 started

Aug 20 18:27:59 site1-snuc kernel: general protection fault, probably for non-canonical address 0x8b4557be6fdd5b6d: 0000 [#1] PREEMPT SMP NOPTI

Aug 20 18:27:59 site1-snuc kernel: CPU: 8 PID: 1489 Comm: pvestatd Tainted: P           O       6.8.12-13-pve #1

Aug 20 18:27:59 site1-snuc kernel: Hardware name: Simply NUC NUC24OXGv9/AHWSA, BIOS AHWSA.1.23 04/12/2024

Aug 20 18:27:59 site1-snuc kernel: RIP: 0010:kmem_cache_alloc+0xce/0x370

Aug 20 18:27:59 site1-snuc kernel: Code: 83 78 10 00 48 8b 38 0f 84 48 02 00 00 48 85 ff 0f 84 3f 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8b 34 24 48 01 f8 <48> 33 18 48 89 c1 48 89 f8 48 0f c9 48>

Aug 20 18:27:59 site1-snuc kernel: RSP: 0018:ffffb8cc02df7aa0 EFLAGS: 00010282

Aug 20 18:27:59 site1-snuc kernel: RAX: 8b4557be6fdd5b6d RBX: efca6af1a646adb6 RCX: 0000000000000000

Aug 20 18:27:59 site1-snuc kernel: RDX: 00000001365ca008 RSI: 000000000003cfe0 RDI: 8b4557be6fdd5b5d

Aug 20 18:27:59 site1-snuc kernel: RBP: ffffb8cc02df7af0 R08: 0000000000000000 R09: 0000000000000000

Aug 20 18:27:59 site1-snuc kernel: R10: ffff8fc0585b6bc0 R11: 0000000000000000 R12: ffff8fc04020b400

Aug 20 18:27:59 site1-snuc kernel: R13: 0000000000000cc0 R14: 0000000000000028 R15: ffffffffb0eeb1ef

Aug 20 18:27:59 site1-snuc kernel: FS:  000076ab32f0fb80(0000) GS:ffff8fcfaf600000(0000) knlGS:0000000000000000

Aug 20 18:27:59 site1-snuc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Aug 20 18:27:59 site1-snuc kernel: CR2: 00005c0b5061ed48 CR3: 000000012bcc2000 CR4: 0000000000f52ef0

Aug 20 18:27:59 site1-snuc kernel: PKRU: 55555554

Aug 20 18:27:59 site1-snuc kernel: Call Trace:

Aug 20 18:27:59 site1-snuc kernel:  <TASK>

Aug 20 18:27:59 site1-snuc kernel:  ? show_regs+0x6d/0x80

Aug 20 18:27:59 site1-snuc kernel:  ? die_addr+0x37/0xa0

Aug 20 18:27:59 site1-snuc kernel:  ? exc_general_protection+0x1dc/0x480

Aug 20 18:27:59 site1-snuc kernel:  ? asm_exc_general_protection+0x27/0x30

Aug 20 18:27:59 site1-snuc kernel:  ? vm_area_dup+0x4f/0x140

Aug 20 18:27:59 site1-snuc kernel:  ? kmem_cache_alloc+0xce/0x370

Aug 20 18:27:59 site1-snuc kernel:  vm_area_dup+0x4f/0x140

Aug 20 18:27:59 site1-snuc kernel:  ? mas_find+0x76/0x150

Aug 20 18:27:59 site1-snuc kernel:  copy_process+0x1f5e/0x2510

Aug 20 18:27:59 site1-snuc kernel:  kernel_clone+0xbd/0x440

Aug 20 18:27:59 site1-snuc kernel:  ? do_fcntl+0x437/0x6b0

Aug 20 18:27:59 site1-snuc kernel:  __do_sys_clone+0x69/0xa0

Aug 20 18:27:59 site1-snuc kernel:  __x64_sys_clone+0x25/0x40

Aug 20 18:27:59 site1-snuc kernel:  x64_sys_call+0x18f9/0x2480

Aug 20 18:27:59 site1-snuc kernel:  do_syscall_64+0x81/0x170

Aug 20 18:27:59 site1-snuc kernel:  ? do_syscall_64+0x8d/0x170

Aug 20 18:27:59 site1-snuc kernel:  ? __handle_mm_fault+0xba9/0xf70

Aug 20 18:27:59 site1-snuc kernel:  ? do_syscall_64+0x8d/0x170

Aug 20 18:27:59 site1-snuc kernel:  ? __count_memcg_events+0x6f/0xe0

Aug 20 18:27:59 site1-snuc kernel:  ? count_memcg_events.constprop.0+0x2a/0x50

Aug 20 18:27:59 site1-snuc kernel:  ? handle_mm_fault+0xad/0x380

Aug 20 18:27:59 site1-snuc kernel:  ? do_user_addr_fault+0x33f/0x660

Aug 20 18:27:59 site1-snuc kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260

Aug 20 18:27:59 site1-snuc kernel:  ? irqentry_exit+0x43/0x50

Aug 20 18:27:59 site1-snuc kernel:  ? exc_page_fault+0x94/0x1b0

Aug 20 18:27:59 site1-snuc kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80

Aug 20 18:27:59 site1-snuc kernel: RIP: 0033:0x76ab33021353

Aug 20 18:27:59 site1-snuc kernel: Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75>

Aug 20 18:27:59 site1-snuc kernel: RSP: 002b:00007ffe128aa8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000038

Aug 20 18:27:59 site1-snuc kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 000076ab33021353

Aug 20 18:27:59 site1-snuc kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011

Aug 20 18:27:59 site1-snuc kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000

Aug 20 18:27:59 site1-snuc kernel: R10: 000076ab32f0fe50 R11: 0000000000000246 R12: 0000000000000001

Aug 20 18:27:59 site1-snuc kernel: R13: 00007ffe128aaa00 R14: 00007ffe128aaa80 R15: 000076ab3324a020

Aug 20 18:27:59 site1-snuc kernel:  </TASK>


I tried to check the kernel stack trace with chatgpt and it adviced that there could be a Hardware issue. I then run 7 iterations of memtest86 and some other tools for several hours (stressapptest,stress-ng,mprime) but they didn't show any hardware problem.
does anyone have an idea about what the issue could be and how to solve or check next as this server will be relocated to a remote facility which is hard to reach / maintain afterwards ?