Hi !
I'm working on a PVE setup for 2 weeks now and one of the servers is a SimplyNUC server running PVE 8.4.11. Yesterday suddenly I couldn't reach it anymore and had to hard reboot the machine. I then checked the logfiles and saw the following:
I tried to check the kernel stack trace with chatgpt and it adviced that there could be a Hardware issue. I then run 7 iterations of memtest86 and some other tools for several hours (stressapptest,stress-ng,mprime) but they didn't show any hardware problem.
does anyone have an idea about what the issue could be and how to solve or check next as this server will be relocated to a remote facility which is hard to reach / maintain afterwards ?
I'm working on a PVE setup for 2 weeks now and one of the servers is a SimplyNUC server running PVE 8.4.11. Yesterday suddenly I couldn't reach it anymore and had to hard reboot the machine. I then checked the logfiles and saw the following:
Bash:
Aug 20 18:27:59 site1-snuc pvedaemon[1518]: starting 1 worker(s)
Aug 20 18:27:59 site1-snuc pvedaemon[1518]: worker 37678 started
Aug 20 18:27:59 site1-snuc kernel: general protection fault, probably for non-canonical address 0x8b4557be6fdd5b6d: 0000 [#1] PREEMPT SMP NOPTI
Aug 20 18:27:59 site1-snuc kernel: CPU: 8 PID: 1489 Comm: pvestatd Tainted: P O 6.8.12-13-pve #1
Aug 20 18:27:59 site1-snuc kernel: Hardware name: Simply NUC NUC24OXGv9/AHWSA, BIOS AHWSA.1.23 04/12/2024
Aug 20 18:27:59 site1-snuc kernel: RIP: 0010:kmem_cache_alloc+0xce/0x370
Aug 20 18:27:59 site1-snuc kernel: Code: 83 78 10 00 48 8b 38 0f 84 48 02 00 00 48 85 ff 0f 84 3f 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8b 34 24 48 01 f8 <48> 33 18 48 89 c1 48 89 f8 48 0f c9 48>
Aug 20 18:27:59 site1-snuc kernel: RSP: 0018:ffffb8cc02df7aa0 EFLAGS: 00010282
Aug 20 18:27:59 site1-snuc kernel: RAX: 8b4557be6fdd5b6d RBX: efca6af1a646adb6 RCX: 0000000000000000
Aug 20 18:27:59 site1-snuc kernel: RDX: 00000001365ca008 RSI: 000000000003cfe0 RDI: 8b4557be6fdd5b5d
Aug 20 18:27:59 site1-snuc kernel: RBP: ffffb8cc02df7af0 R08: 0000000000000000 R09: 0000000000000000
Aug 20 18:27:59 site1-snuc kernel: R10: ffff8fc0585b6bc0 R11: 0000000000000000 R12: ffff8fc04020b400
Aug 20 18:27:59 site1-snuc kernel: R13: 0000000000000cc0 R14: 0000000000000028 R15: ffffffffb0eeb1ef
Aug 20 18:27:59 site1-snuc kernel: FS: 000076ab32f0fb80(0000) GS:ffff8fcfaf600000(0000) knlGS:0000000000000000
Aug 20 18:27:59 site1-snuc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 20 18:27:59 site1-snuc kernel: CR2: 00005c0b5061ed48 CR3: 000000012bcc2000 CR4: 0000000000f52ef0
Aug 20 18:27:59 site1-snuc kernel: PKRU: 55555554
Aug 20 18:27:59 site1-snuc kernel: Call Trace:
Aug 20 18:27:59 site1-snuc kernel: <TASK>
Aug 20 18:27:59 site1-snuc kernel: ? show_regs+0x6d/0x80
Aug 20 18:27:59 site1-snuc kernel: ? die_addr+0x37/0xa0
Aug 20 18:27:59 site1-snuc kernel: ? exc_general_protection+0x1dc/0x480
Aug 20 18:27:59 site1-snuc kernel: ? asm_exc_general_protection+0x27/0x30
Aug 20 18:27:59 site1-snuc kernel: ? vm_area_dup+0x4f/0x140
Aug 20 18:27:59 site1-snuc kernel: ? kmem_cache_alloc+0xce/0x370
Aug 20 18:27:59 site1-snuc kernel: vm_area_dup+0x4f/0x140
Aug 20 18:27:59 site1-snuc kernel: ? mas_find+0x76/0x150
Aug 20 18:27:59 site1-snuc kernel: copy_process+0x1f5e/0x2510
Aug 20 18:27:59 site1-snuc kernel: kernel_clone+0xbd/0x440
Aug 20 18:27:59 site1-snuc kernel: ? do_fcntl+0x437/0x6b0
Aug 20 18:27:59 site1-snuc kernel: __do_sys_clone+0x69/0xa0
Aug 20 18:27:59 site1-snuc kernel: __x64_sys_clone+0x25/0x40
Aug 20 18:27:59 site1-snuc kernel: x64_sys_call+0x18f9/0x2480
Aug 20 18:27:59 site1-snuc kernel: do_syscall_64+0x81/0x170
Aug 20 18:27:59 site1-snuc kernel: ? do_syscall_64+0x8d/0x170
Aug 20 18:27:59 site1-snuc kernel: ? __handle_mm_fault+0xba9/0xf70
Aug 20 18:27:59 site1-snuc kernel: ? do_syscall_64+0x8d/0x170
Aug 20 18:27:59 site1-snuc kernel: ? __count_memcg_events+0x6f/0xe0
Aug 20 18:27:59 site1-snuc kernel: ? count_memcg_events.constprop.0+0x2a/0x50
Aug 20 18:27:59 site1-snuc kernel: ? handle_mm_fault+0xad/0x380
Aug 20 18:27:59 site1-snuc kernel: ? do_user_addr_fault+0x33f/0x660
Aug 20 18:27:59 site1-snuc kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
Aug 20 18:27:59 site1-snuc kernel: ? irqentry_exit+0x43/0x50
Aug 20 18:27:59 site1-snuc kernel: ? exc_page_fault+0x94/0x1b0
Aug 20 18:27:59 site1-snuc kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
Aug 20 18:27:59 site1-snuc kernel: RIP: 0033:0x76ab33021353
Aug 20 18:27:59 site1-snuc kernel: Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75>
Aug 20 18:27:59 site1-snuc kernel: RSP: 002b:00007ffe128aa8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Aug 20 18:27:59 site1-snuc kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 000076ab33021353
Aug 20 18:27:59 site1-snuc kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Aug 20 18:27:59 site1-snuc kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Aug 20 18:27:59 site1-snuc kernel: R10: 000076ab32f0fe50 R11: 0000000000000246 R12: 0000000000000001
Aug 20 18:27:59 site1-snuc kernel: R13: 00007ffe128aaa00 R14: 00007ffe128aaa80 R15: 000076ab3324a020
Aug 20 18:27:59 site1-snuc kernel: </TASK>
I tried to check the kernel stack trace with chatgpt and it adviced that there could be a Hardware issue. I then run 7 iterations of memtest86 and some other tools for several hours (stressapptest,stress-ng,mprime) but they didn't show any hardware problem.
does anyone have an idea about what the issue could be and how to solve or check next as this server will be relocated to a remote facility which is hard to reach / maintain afterwards ?