Dear all,
i am fighting with Proxmox freezing again and again since months. I now think it could be a hardware fault, but I can't figure it out and would like an opinion or help in narrowing down the problem.
I started with Proxmox Kernel 6.8.12, but quickly made the fix with installing a new Ubuntu Kernel (as proposed here: https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/page-5). I also installed intel-microcode, both measures made the time between crashes larger. While at the beginning, there were something like 30 min to 2 hrs, I afterwards had an uptime of something like 10 hrs.
I now installed the Proxmox Kernel 6.11 (fresh install from ISO and apt update) - which also crashes with different messages, not letting me narrow down the problem. Some examples of recent crashes:
So in the end it seems to be related to memory. I tried a memtest86 for over 35 hrs with no result, tried a cpu stress test with mprime (while proxmox is running) with no result. And tried memtester to run a RAM test while the system is up - without result. However, after a first kernel message (... it does not always crash directly when I receive an error...), memtester failed once and also mprime did, memtester with the following message:
So in the end, I can't figure out if it is related to software (... updating the kernel enhanced the crash rate), to RAM (memtest86 didn't show any error), CPU (stress test only fails after the system is up for 10 hours) or something else. I also doubt that it is related to a heat problem - after a crash, I switched off the system and booted it again, after that it runs 10 hrs...
I am using a SHUTTLE DL30N with an Intel N100 and the latest microcode installed, a dual ethernet card (Intel), headless, no running VM / LXC.
I can for sure provide other logs or full logs, however, as it is always different, I don't want to provide too much unnecessary information.
Thank you for your help, best
Christian
i am fighting with Proxmox freezing again and again since months. I now think it could be a hardware fault, but I can't figure it out and would like an opinion or help in narrowing down the problem.
I started with Proxmox Kernel 6.8.12, but quickly made the fix with installing a new Ubuntu Kernel (as proposed here: https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/page-5). I also installed intel-microcode, both measures made the time between crashes larger. While at the beginning, there were something like 30 min to 2 hrs, I afterwards had an uptime of something like 10 hrs.
I now installed the Proxmox Kernel 6.11 (fresh install from ISO and apt update) - which also crashes with different messages, not letting me narrow down the problem. Some examples of recent crashes:
or2212.714426] BUG: unable to handle page fault for address: ffffa0a1bffff2b8
<1>[ 2212.714452] #PF: supervisor read access in kernel mode
<1>[ 2212.714460] #PF: error_code(0x0000) - not-present page
<6>[ 2212.714468] PGD 3e7c01067 P4D 3e7c01067 PUD 0
<4>[ 2212.714476] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 2212.714484] CPU: 2 PID: 962 Comm: pve-firewall Tainted: P O 6.8.12-4-pve #1
<4>[ 2212.714503] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[ 2212.714517] RIP: 0010:vmap_small_pages_range_noflush+0x260/0x530
<4>[ 2212.714533] Code: 0f 00 00 48 01 d0 49 89 c6 0f 84 cf 01 00 00 49 8d 44 24 ff 48 89 5d 90 4c 89 eb 48 89 45 a8 49 8d 87 00 00 20 00 48 8b 75 a8 <4d> 8b 06 48 25 00 00 e0 ff 48 89 c1 48 8d 40 ff 48 39 f0 49 0f 43
<4>[ 2212.714569] RSP: 0018:ffffaca0c11cf6f0 EFLAGS: 00010286
dmesg-efi_pstore-173343389108002:
or464.025851] Oops: general protection fault, probably for non-canonical address 0xff738cab014e3770: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 464.025950] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Tainted: P O 6.11.0-1-pve #1
<4>[ 464.026009] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
<4>[ 464.026046] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[ 464.026100] RIP: 0010:kmem_cache_alloc_node_noprof+0xb7/0x340
<4>[ 464.026140] Code: 85 c0 0f 84 1c 02 00 00 41 83 fe ff 74 10 48 8b 00 48 c1 e8 36 41 39 c6 0f 85 06 02 00 00 41 8b 44 24 28 49 8b 34 24 48 01 f8 <48> 8b 18 48 89 c1 49 33 9c 24 b8 00 00 00 48 89 f8 48 0f c9 48 31
[31304.497712] BUG: Bad page map in process pve-firewall pte:840100012d0dd805 pmd:105e23067
<1>[31304.497794] addr:00007fffc5991000 vm_flags:00100173 anon_vma:ffff9ee38a716d68 mapping:0000000000000000 index:7ffffff53
<1>[31304.497877] filenull) fault:0x0 mmap:0x0 read_folio:0x0
<4>[31304.497915] CPU: 2 UID: 0 PID: 905 Comm: pve-firewall Tainted: P O 6.11.0-1-pve #1
<4>[31304.497919] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
<4>[31304.497920] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[31304.497921] Call Trace:
<4>[31304.497923] <TASK>
<4>[31304.497925] dump_stack_lvl+0x76/0xa0
<4>[31304.497930] dump_stack+0x10/0x20
So in the end it seems to be related to memory. I tried a memtest86 for over 35 hrs with no result, tried a cpu stress test with mprime (while proxmox is running) with no result. And tried memtester to run a RAM test while the system is up - without result. However, after a first kernel message (... it does not always crash directly when I receive an error...), memtester failed once and also mprime did, memtester with the following message:
pagesize is 4096
pagesizemask is Oxfffffffffffff000 want 10240MB (10737418240 bytes)
got10240MB (10737418240 bytes), trying mlock ...locked.
Loop 1/1:
Stuck Address: testing. OFAILURE: possible bad address line at offset 0x000000020b8222c8.
Skipping to next test..
Random Value: ok
Compare XOR: ok
Compare SUB: ok
Compare MUL: ok
Compare DIV: ok
Compare OR: ok
Compare AND: ok
Sequential Increment: ok
Solid Bits: ok
So in the end, I can't figure out if it is related to software (... updating the kernel enhanced the crash rate), to RAM (memtest86 didn't show any error), CPU (stress test only fails after the system is up for 10 hours) or something else. I also doubt that it is related to a heat problem - after a crash, I switched off the system and booted it again, after that it runs 10 hrs...
I am using a SHUTTLE DL30N with an Intel N100 and the latest microcode installed, a dual ethernet card (Intel), headless, no running VM / LXC.
I can for sure provide other logs or full logs, however, as it is always different, I don't want to provide too much unnecessary information.
Thank you for your help, best
Christian