Affected PVE kernels: any pve-enterprise / pve-no-subscription kernel since Nov 2023
I am struggling with a single LXC container which causes complete system freezes/crashes for 1.5 years now. It started on 2023-11-15, that time running 6.2.16-19-pve (pve-enterprise) kernel and the same still happens today with latest Proxmox VE kernel 6.8.12-9-pve (pve-no-subscription).
It's always just caused by this same LXC container, while 50+ other containers (all running latest Debian Bookworm, and all running on latest Proxmox VE with zfs) never caused similar issues. I have moved this problematic container already to 5 different physical servers and it managed to crash them all. Sometimes the container runs for 5-10 days, sometimes it crashes multiple times a day.
In 80% the host just hangs and the only way to get it alive again is through power cycling. In 20%, it reboots by itself right after crashing.
I check the logs after every crash and didn't ever find anything suspicious (except from long running PHP FPM processes, but those are spread throughout the whole day) before the crash. When rebooting, syslog reports:
I don't know how to decode this file. Previous (from older kernel, in this case 6.8.12-1-pve) unencoded `dmesg-erst-7403549738764599297` reported such segfaults:
On today's crash, I was logged in on the (physical) host while it happened and these messages popped up:
LXC configuration:
I never found any OOM reported in syslog. In the current setup, I am running this LXC container as the only one on the hostnode, which has 128GB of memory, so the assigned 48GB for this container should not be an issue.
Any help greatly appreciated!
I am struggling with a single LXC container which causes complete system freezes/crashes for 1.5 years now. It started on 2023-11-15, that time running 6.2.16-19-pve (pve-enterprise) kernel and the same still happens today with latest Proxmox VE kernel 6.8.12-9-pve (pve-no-subscription).
It's always just caused by this same LXC container, while 50+ other containers (all running latest Debian Bookworm, and all running on latest Proxmox VE with zfs) never caused similar issues. I have moved this problematic container already to 5 different physical servers and it managed to crash them all. Sometimes the container runs for 5-10 days, sometimes it crashes multiple times a day.
In 80% the host just hangs and the only way to get it alive again is through power cycling. In 20%, it reboots by itself right after crashing.
I check the logs after every crash and didn't ever find anything suspicious (except from long running PHP FPM processes, but those are spread throughout the whole day) before the crash. When rebooting, syslog reports:
Code:
2025-04-03T10:29:13.430445+02:00 hn7 systemd-pstore[833]: PStore dmesg-erst-7489000355091447809.enc.z moved to /var/lib/systemd/pstore/dmesg-erst-7489000355091447809.enc.z
I don't know how to decode this file. Previous (from older kernel, in this case 6.8.12-1-pve) unencoded `dmesg-erst-7403549738764599297` reported such segfaults:
Code:
<6>[34266.440354] perf: interrupt took too long (6279 > 6273), lowering kernel.perf_event_max_sample_rate to 31000
<6>[36428.354318] connection[4951]: segfault at 20 ip 000078e4427c8cdc sp 000078e435ffd600 error 4 in libc.so.6[78e442757000+155000] likely on CPU 7 (core 1, socket 1)
<6>[36428.354337] Code: c0 04 0f 85 98 03 00 00 4c 39 c0 72 e9 44 89 6c 24 18 74 61 48 8b 46 28 66 48 0f 6e c6 66 48 0f 6e d0 66 0f 6c c2 0f 11 42 20 <48> 39 70 20 0f 85 be 01 00 00 48 89 56 28 48 8b 42 28 49 89 f0 48
<6>[36428.354430] connection[2915]: segfault at 0 ip 000057b03063c99f sp 000078e4379788a0 error 4 in mysqld[57b02efaf000+1dc0000] likely on CPU 12 (core 0, socket 0)
<6>[36428.354457] Code: 8d 05 75 fa ed 01 48 0f af d1 48 29 d6 48 8b 10 31 c0 48 39 ce 0f 93 c0 48 0f af c1 48 29 c6 48 8d 04 76 48 c1 e0 06 48 03 02 <8b> 00 48 83 bd a0 f2 ff ff 02 77 0b 3d 00 00 00 10 0f 8f 62 21 00
<6>[36428.367509] php-fpm8.2[514910]: segfault at 0 ip 000060b28a765455 sp 00007ffd3037e800 error 4 in php-fpm8.2[60b28a544000+309000] likely on CPU 11 (core 5, socket 1)
<6>[36428.367522] Code: 31 c0 c7 47 10 ff ff ff ff f3 0f 6f 07 48 c7 47 18 00 00 00 00 48 8d 3d d9 5b 27 00 0f 29 04 24 0f 29 4c 24 10 e8 8b ef ff ff <48> 8b 00 48 8b 00 48 85 c0 74 05 48 89 e7 ff d0 48 8b 44 24 28 64
<4>[36428.367811] slab proc_inode_cache start ffff9d8c06a21e98
<4>[36428.367815] slab proc_inode_cache
<4>[36428.367816] pointer offset 384
<4>[36428.367819] size 704
On today's crash, I was logged in on the (physical) host while it happened and these messages popped up:
Code:
:
Message from syslogd@hn7 at Apr 3 10:26:13 ...
kernel:[13981.116924] usercopy: Kernel memory overwrite attempt detected to vmalloc 'no area' (offset 0, size 4096)!
Message from syslogd@hn7 at Apr 3 10:26:13 ...
kernel:[13981.116924] usercopy: Kernel memory overwrite attempt detected to vmalloc 'no area' (offset 0, size 4096)!
LXC configuration:
Bash:
hn7$ pct config 172
arch: amd64
cpulimit: 16
cpuunits: 256
features: nesting=1
hostname: s002.example.com
memory: 49152
net0: name=eth0,bridge=vmbr0,gw=x.x.x.x,hwaddr=XX:XX:XX:XX:XX:6B,ip=x.x.x.x/25,type=veth
onboot: 1
ostype: debian
rootfs: zfsvols:subvol-172-disk-1,acl=1
swap: 1024
I never found any OOM reported in syslog. In the current setup, I am running this LXC container as the only one on the hostnode, which has 128GB of memory, so the assigned 48GB for this container should not be an issue.
Any help greatly appreciated!
Last edited: