Proxmox freezes with Kernel 6.11

cschneider

New Member
Dec 13, 2024
2
0
1
Dear all,

i am fighting with Proxmox freezing again and again since months. I now think it could be a hardware fault, but I can't figure it out and would like an opinion or help in narrowing down the problem.

I started with Proxmox Kernel 6.8.12, but quickly made the fix with installing a new Ubuntu Kernel (as proposed here: https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/page-5). I also installed intel-microcode, both measures made the time between crashes larger. While at the beginning, there were something like 30 min to 2 hrs, I afterwards had an uptime of something like 10 hrs.

I now installed the Proxmox Kernel 6.11 (fresh install from ISO and apt update) - which also crashes with different messages, not letting me narrow down the problem. Some examples of recent crashes:

2212.714426] BUG: unable to handle page fault for address: ffffa0a1bffff2b8
<1>[ 2212.714452] #PF: supervisor read access in kernel mode
<1>[ 2212.714460] #PF: error_code(0x0000) - not-present page
<6>[ 2212.714468] PGD 3e7c01067 P4D 3e7c01067 PUD 0
<4>[ 2212.714476] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 2212.714484] CPU: 2 PID: 962 Comm: pve-firewall Tainted: P O 6.8.12-4-pve #1
<4>[ 2212.714503] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[ 2212.714517] RIP: 0010:vmap_small_pages_range_noflush+0x260/0x530
<4>[ 2212.714533] Code: 0f 00 00 48 01 d0 49 89 c6 0f 84 cf 01 00 00 49 8d 44 24 ff 48 89 5d 90 4c 89 eb 48 89 45 a8 49 8d 87 00 00 20 00 48 8b 75 a8 <4d> 8b 06 48 25 00 00 e0 ff 48 89 c1 48 8d 40 ff 48 39 f0 49 0f 43
<4>[ 2212.714569] RSP: 0018:ffffaca0c11cf6f0 EFLAGS: 00010286
dmesg-efi_pstore-173343389108002:
or

464.025851] Oops: general protection fault, probably for non-canonical address 0xff738cab014e3770: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 464.025950] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Tainted: P O 6.11.0-1-pve #1
<4>[ 464.026009] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
<4>[ 464.026046] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[ 464.026100] RIP: 0010:kmem_cache_alloc_node_noprof+0xb7/0x340
<4>[ 464.026140] Code: 85 c0 0f 84 1c 02 00 00 41 83 fe ff 74 10 48 8b 00 48 c1 e8 36 41 39 c6 0f 85 06 02 00 00 41 8b 44 24 28 49 8b 34 24 48 01 f8 <48> 8b 18 48 89 c1 49 33 9c 24 b8 00 00 00 48 89 f8 48 0f c9 48 31
or
[31304.497712] BUG: Bad page map in process pve-firewall pte:840100012d0dd805 pmd:105e23067
<1>[31304.497794] addr:00007fffc5991000 vm_flags:00100173 anon_vma:ffff9ee38a716d68 mapping:0000000000000000 index:7ffffff53
<1>[31304.497877] file:(null) fault:0x0 mmap:0x0 read_folio:0x0
<4>[31304.497915] CPU: 2 UID: 0 PID: 905 Comm: pve-firewall Tainted: P O 6.11.0-1-pve #1
<4>[31304.497919] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
<4>[31304.497920] Hardware name: Shuttle Inc. DL30N/DL30N, BIOS 1.05 07/18/2024
<4>[31304.497921] Call Trace:
<4>[31304.497923] <TASK>
<4>[31304.497925] dump_stack_lvl+0x76/0xa0
<4>[31304.497930] dump_stack+0x10/0x20

So in the end it seems to be related to memory. I tried a memtest86 for over 35 hrs with no result, tried a cpu stress test with mprime (while proxmox is running) with no result. And tried memtester to run a RAM test while the system is up - without result. However, after a first kernel message (... it does not always crash directly when I receive an error...), memtester failed once and also mprime did, memtester with the following message:

pagesize is 4096
pagesizemask is Oxfffffffffffff000 want 10240MB (10737418240 bytes)


got10240MB (10737418240 bytes), trying mlock ...locked.


Loop 1/1:
Stuck Address: testing. OFAILURE: possible bad address line at offset 0x000000020b8222c8.
Skipping to next test..
Random Value: ok
Compare XOR: ok
Compare SUB: ok
Compare MUL: ok
Compare DIV: ok
Compare OR: ok
Compare AND: ok
Sequential Increment: ok
Solid Bits: ok

So in the end, I can't figure out if it is related to software (... updating the kernel enhanced the crash rate), to RAM (memtest86 didn't show any error), CPU (stress test only fails after the system is up for 10 hours) or something else. I also doubt that it is related to a heat problem - after a crash, I switched off the system and booted it again, after that it runs 10 hrs...


I am using a SHUTTLE DL30N with an Intel N100 and the latest microcode installed, a dual ethernet card (Intel), headless, no running VM / LXC.

I can for sure provide other logs or full logs, however, as it is always different, I don't want to provide too much unnecessary information.

Thank you for your help, best
Christian
 
I had constantly hangs on the kernel 6.8.12-4-pve.
I use mini PC Intel(R) Celeron(R) J4125 CPU @ 2.00GHz
8Gb RAM
ZFS
I was afraid I had a memory issue like you I thoroughly checked it and it was good.
Today I've updated on 6.8.12-5-pve and so far it looks cured.
 
Yes, I also tried different kernels ... the newest Ubuntu Kernels enhanced the freeze rate, but in the end, nothing really helped ... I am now using Kernel 6.11.0-2, but also getting crashes. Now installing a pure Debian to see if it is related to the Proxmox / Ubuntu Kernel or if I also get crashes with Debian ... let's see...
 
I know you want to rule out thermals (& believe you already have!) - but that is an obvious suspect in your setup (fan-less mini PC).
Try placing a desk fan pointing to the top of the unit or even remove the cover & point the fan on the internals themselves. See how long you can get it to run before a crash. If your in a cold-winter climate, you can try running the unit outside.

The next thing I'd look at is the PSU, maybe try changing it out & see if that makes a difference.

After that try swapping out the memory - you may make an interesting discovery.
 
Similar unexplained freezes since a few weeks, before the system was running well without any issue, very cumbersome

PVE 8.x with ZFS

Linux pve 6.11.11-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.11.11-1 (2025-03-16T18:52Z) x86_64 GNU/Linux
 
Last edited:
Similar unexplained freezes since a few weeks, before the system was running well without any issue, very cumbersome

PVE 8.x with ZFS

Linux pve 6.11.11-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.11.11-1 (2025-03-16T18:52Z) x86_64 GNU/Linux

Anything in the system logs in your case? How do the freezes present themselves?

And 6.11 is a bit outdated, you might want to check the newer 6.14 opt-in kernel, if you do not want to stick with the 6.8 default kernel:
 
Anything in the system logs in your case? How do the freezes present themselves?

And 6.11 is a bit outdated, you might want to check the newer 6.14 opt-in kernel, if you do not want to stick with the 6.8 default kernel:
Thanks. I rather stick with something that 'just works'

It is a really annoyance with Proxmox these freezes. These do reappear over time, just enough to 'almost forget' about them. To be honest they are a cause for paranoia.

I've considered switching to another HV because of these. Since at no time there is any discernable cause found anywhere in the logs.

To assure i catch as much as possible I've now enable these audit.rules instead of the defaults, it may be worth considering for Proxmox to bundle these or ship as a default. I used to work on such a set of audit rules but I do simply lack time and resources (no income) to spend real time on refining such.

https://github.com/Neo23x0/auditd
 
@t.lamprecht this is unlikely to be related but I've never seen this before
cd /var/log/audit

ls -l
total 2771
-rw-r----- 1 root adm 222805 May 25 02:09 audit.log
-r--r----- 1 root adm 8388822 May 25 02:08 audit.log.1
-r--r----- 1 root adm 8388688 May 25 01:46 audit.log.2
-r--r----- 1 root adm 8388709 May 25 01:20 audit.log.3
-r--r----- 1 root adm 8389023 May 25 01:12 audit.log.4

ls -lh
total 2.8M
-rw-r----- 1 root adm 916K May 25 02:11 audit.log
-r--r----- 1 root adm 8.1M May 25 02:08 audit.log.1
-r--r----- 1 root adm 8.1M May 25 01:46 audit.log.2
-r--r----- 1 root adm 8.1M May 25 01:20 audit.log.3
-r--r----- 1 root adm 8.1M May 25 01:12 audit.log.4

du -sh .
2.8M .

du -sh *
609K audit.log
593K audit.log.1
737K audit.log.2
781K audit.log.3
781K audit.log.4
 
Last edited:
Thanks. I rather stick with something that 'just works'
Why the post here in the opt-in 6.11 kernel thread? Is the default 6.8 based not working for your setup or hardware?

And you did not yet answer how exactly your freezes manifest themselves and in what rough frequency, e.g. monthly?

To assure i catch as much as possible I've now enable these audit.rules instead of the defaults, it may be worth considering for Proxmox to bundle these or ship as a default. I used to work on such a set of audit rules but I do simply lack time and resources (no income) to spend real time on refining such.

https://github.com/Neo23x0/auditd
Not sure how much some general audit access rules, of which–from a quick look–a good amount do not apply to a PVE system, will help much here with diagnostics, but if it's triggered by a program that is only run semi-periodically then it might at least not hurt when trying to pinpoint it.

If you really want to diagnose this and get help you either need to open a new thread and post lots more specific details about your setup (hardware used, cluster setup, basic checks done like upgraded system BIOS/Firmware, ...) and the issues you see, alternatively you can use our enterprise support, which can assist one more proactively, and depending on the level also gather the relevant information themselves, either can be fine, but as is, I do not see a likelihood that we make progress here.
 
@t.lamprecht this is unlikely to be related but I've never seen this before
cd /var/log/audit

ls -l
total 2771
-rw-r----- 1 root adm 222805 May 25 02:09 audit.log
-r--r----- 1 root adm 8388822 May 25 02:08 audit.log.1
-r--r----- 1 root adm 8388688 May 25 01:46 audit.log.2
-r--r----- 1 root adm 8388709 May 25 01:20 audit.log.3
-r--r----- 1 root adm 8389023 May 25 01:12 audit.log.4

ls -lh
total 2.8M
-rw-r----- 1 root adm 916K May 25 02:11 audit.log
-r--r----- 1 root adm 8.1M May 25 02:08 audit.log.1
-r--r----- 1 root adm 8.1M May 25 01:46 audit.log.2
-r--r----- 1 root adm 8.1M May 25 01:20 audit.log.3
-r--r----- 1 root adm 8.1M May 25 01:12 audit.log.4

du -sh .
2.8M .

du -sh *
609K audit.log
593K audit.log.1
737K audit.log.2
781K audit.log.3
781K audit.log.4
This is fine, ls reports the file sizes vs. du reports the file usage, as ZFS compresses data by default when sensible, the actual used space amount on the disks can be way lower than what's in the file itself. And especially text files and even more log files often can be compressed quite nicely. If you want du to report the apparent sizes over the usage on the block devices you can pass it the --apparent-size flag.
 
Why the post here in the opt-in 6.11 kernel thread? Is the default 6.8 based not working for your setup or hardware?

And you did not yet answer how exactly your freezes manifest themselves and in what rough frequency, e.g. monthly?


Not sure how much some general audit access rules, of which–from a quick look–a good amount do not apply to a PVE system, will help much here with diagnostics, but if it's triggered by a program that is only run semi-periodically then it might at least not hurt when trying to pinpoint it.

If you really want to diagnose this and get help you either need to open a new thread and post lots more specific details about your setup (hardware used, cluster setup, basic checks done like upgraded system BIOS/Firmware, ...) and the issues you see, alternatively you can use our enterprise support, which can assist one more proactively, and depending on the level also gather the relevant information themselves, either can be fine, but as is, I do not see a likelihood that we make progress here.

Hey Thomas

My impression is the freezes happen every three weeks.

I'm setting up grafana and other tools to get a better view. I'll update this thread after.

Best

JL
 
I've not chosen to install the 6.11 that one booted after an upgrade + reboot
That cannot really happen, as that kernel is fully opt-in, meaning an admin with access to the machine needs to have used an invocation like apt install proxmox-kernel-6.11 or the like at least once manually to pull that kernel in.

Depending on when that happened it might still show up in the apt logs when using something like zstdgrep --color proxmox-kernel-6.11 /var/log/apt/history.log* - the earliest invocation is the interesting one, further updates will then indeed get pulled in automatically after the manual installation of that meta package.
 
  • Like
Reactions: gfngfn256
That cannot really happen, as that kernel is fully opt-in, meaning an admin with access to the machine needs to have used an invocation like apt install proxmox-kernel-6.11 or the like at least once manually to pull that kernel in.

Depending on when that happened it might still show up in the apt logs when using something like zstdgrep --color proxmox-kernel-6.11 /var/log/apt/history.log* - the earliest invocation is the interesting one, further updates will then indeed get pulled in automatically after the manual installation of that meta package.
I'll have to check myself then, i don't remember doing this explicitly.
Well, guilty.

The system hung again this morning. I've now switched back to the 6.8 kernel, assuming the version will increase automatically over time.
 
Last edited: