[SOLVED] Proxmox 5.1 periodic kernel segfault every <1min

Mr Pumo

New Member
Oct 4, 2017
8
2
3
51
Hi, strange behaviour on my installation: HP Proliant G8 microserver with Proxmox 5.1 (updated to pve-test)
Many SEGFAULT error in syslog like:

Dec 20 10:24:55 pve kernel: [100995.340203] ml4[62621]: segfault at 10 ip 00007fc3882f6410 sp 00007fc376c6ac80 error 4 in libc-2.19.so[7fc38824a000+1a1000]
Dec 20 10:25:59 pve kernel: [101059.247552] ml2[63212]: segfault at 10 ip 00007fe43bce6410 sp 00007fe42ae5bc80 error 4 in libc-2.19.so[7fe43bc3a000+1a1000]
Dec 20 10:26:31 pve kernel: [101091.247526] ml4[63456]: segfault at 10 ip 00007faa89487410 sp 00007faa775fac80 error 4 in libc-2.19.so[7faa893db000+1a1000]
Dec 20 10:27:03 pve kernel: [101123.243031] ml2[63799]: segfault at 10 ip 00007f7378434410 sp 00007f73675a9c80 error 4 in libc-2.19.so[7f7378388000+1a1000]
Dec 20 10:27:35 pve kernel: [101155.244188] ml2[64051]: segfault at 10 ip 00007f86abbb5410 sp 00007f869ad2ac80 error 4 in libc-2.19.so[7f86abb09000+1a1000]
Dec 20 10:28:07 pve kernel: [101187.250438] ml2[64288]: segfault at 10 ip 00007f53cfd2d410 sp 00007f53beea2c80 error 4 in libc-2.19.so[7f53cfc81000+1a1000]

Sometimes I lost control of node (SSH stop respondig/webUI can't issue command) and need to hw reboot it.

1) Any idea about cause? Faulty MEMORY_MODULE (I've seen that error is always at same relative position in ip xxxxxxxxxxxxx410)?

2) Any suggestion how to debug?

Thx a lot


pveversion -v
proxmox-ve: 5.1-31 (running kernel: 4.13.13-1-pve)
pve-manager: 5.1-40 (running version: 5.1-40/ea05b379)
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 
The libc version looks like from Jessie. Bug report, Debian Jessie: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801638
On my PVE 5.1 system, I have 2.24.
Code:
ii  libc6:amd64        2.24-11+deb9u1        amd64        GNU C Library: Shared libraries

Please check if you are using the right libc version, maybe it was installed with some software and is still leftover. Or you have some old repository included, that pulls it.
 
well.. proxmox host system shared library is the same:

Code:
ii  libc6:amd64                          2.24-11+deb9u1                 amd64        GNU C Library: Shared libraries
ii  libc6-dev:amd64                      2.24-11+deb9u1                 amd64        GNU C Library: Development Libraries and Header Files

I found that debian8 and ubuntu 14.04 lxc template built by "dab" make use of 2.19

/var/lib/vz/template/builds/debian-8.0-minimal-64/rootfs/lib/x86_64-linux-gnu/libc-2.19.so
/var/lib/vz/template/builds/debian-8.0-minimal/rootfs/lib/i386-linux-gnu/libc-2.19.so
/var/lib/vz/template/builds/ubuntu-14.04-trusty-minimal-64/rootfs/lib/x86_64-linux-gnu/libc-2.19.so

So you suggest to investigate all LXC container based on this template ?
 
Yes, I guess they or a program might be named ml2/ml4.
 
SOLVED.
I monitored error in log... while stopping each LXC.
I found the problem was related to a motion daemon working in one container (I think ml2/ml4 process belong to it)
Updated "motion" code to last version on git, recompiled... and no more segfault.

I was confused about error starting with "pve kernel" thinking it should be in something running at kernel level.
Instead it was something running at user level inside one container.

Thx a lot,
 
that just means that on a host called "pve" the kernel logged the following message. since the kernel is responsible for making sure a process does not access memory it isn't supposed to, it logs that a process did a faulty/wrong access. if it were the kernel attempting such an access, the message would look very different (and potentially take down the whole host!)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!