[SOLVED] Proxmox 5.1 periodic kernel segfault every <1min

Mr Pumo · Dec 20, 2017

Hi, strange behaviour on my installation: HP Proliant G8 microserver with Proxmox 5.1 (updated to pve-test)
Many SEGFAULT error in syslog like:

Dec 20 10:24:55 pve kernel: [100995.340203] ml4[62621]: segfault at 10 ip 00007fc3882f6410 sp 00007fc376c6ac80 error 4 in libc-2.19.so[7fc38824a000+1a1000]
Dec 20 10:25:59 pve kernel: [101059.247552] ml2[63212]: segfault at 10 ip 00007fe43bce6410 sp 00007fe42ae5bc80 error 4 in libc-2.19.so[7fe43bc3a000+1a1000]
Dec 20 10:26:31 pve kernel: [101091.247526] ml4[63456]: segfault at 10 ip 00007faa89487410 sp 00007faa775fac80 error 4 in libc-2.19.so[7faa893db000+1a1000]
Dec 20 10:27:03 pve kernel: [101123.243031] ml2[63799]: segfault at 10 ip 00007f7378434410 sp 00007f73675a9c80 error 4 in libc-2.19.so[7f7378388000+1a1000]
Dec 20 10:27:35 pve kernel: [101155.244188] ml2[64051]: segfault at 10 ip 00007f86abbb5410 sp 00007f869ad2ac80 error 4 in libc-2.19.so[7f86abb09000+1a1000]
Dec 20 10:28:07 pve kernel: [101187.250438] ml2[64288]: segfault at 10 ip 00007f53cfd2d410 sp 00007f53beea2c80 error 4 in libc-2.19.so[7f53cfc81000+1a1000]

Sometimes I lost control of node (SSH stop respondig/webUI can't issue command) and need to hw reboot it.

1) Any idea about cause? Faulty MEMORY_MODULE (I've seen that error is always at same relative position in ip xxxxxxxxxxxxx410)?

2) Any suggestion how to debug?

Thx a lot

pveversion -v
proxmox-ve: 5.1-31 (running kernel: 4.13.13-1-pve)
pve-manager: 5.1-40 (running version: 5.1-40/ea05b379)
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

Alwin · Dec 20, 2017

The libc version looks like from Jessie. Bug report, Debian Jessie: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801638
On my PVE 5.1 system, I have 2.24.

Code:

ii  libc6:amd64        2.24-11+deb9u1        amd64        GNU C Library: Shared libraries

Please check if you are using the right libc version, maybe it was installed with some software and is still leftover. Or you have some old repository included, that pulls it.

Mr Pumo · Dec 20, 2017

well.. proxmox host system shared library is the same:

Code:

ii  libc6:amd64                          2.24-11+deb9u1                 amd64        GNU C Library: Shared libraries
ii  libc6-dev:amd64                      2.24-11+deb9u1                 amd64        GNU C Library: Development Libraries and Header Files

I found that debian8 and ubuntu 14.04 lxc template built by "dab" make use of 2.19

/var/lib/vz/template/builds/debian-8.0-minimal-64/rootfs/lib/x86_64-linux-gnu/libc-2.19.so
/var/lib/vz/template/builds/debian-8.0-minimal/rootfs/lib/i386-linux-gnu/libc-2.19.so
/var/lib/vz/template/builds/ubuntu-14.04-trusty-minimal-64/rootfs/lib/x86_64-linux-gnu/libc-2.19.so

So you suggest to investigate all LXC container based on this template ?

Alwin · Dec 20, 2017

Yes, I guess they or a program might be named ml2/ml4.

Mr Pumo · Dec 20, 2017

SOLVED.
I monitored error in log... while stopping each LXC.
I found the problem was related to a motion daemon working in one container (I think ml2/ml4 process belong to it)
Updated "motion" code to last version on git, recompiled... and no more segfault.

I was confused about error starting with "pve kernel" thinking it should be in something running at kernel level.
Instead it was something running at user level inside one container.

Thx a lot,

fabian · Dec 21, 2017

that just means that on a host called "pve" the kernel logged the following message. since the kernel is responsible for making sure a process does not access memory it isn't supposed to, it logs that a process did a faulty/wrong access. if it were the kernel attempting such an access, the message would look very different (and potentially take down the whole host!)

Search

Search

[SOLVED] Proxmox 5.1 periodic kernel segfault every <1min

Mr Pumo

New Member

Alwin

Proxmox Retired Staff

Mr Pumo

New Member

Alwin

Proxmox Retired Staff

Mr Pumo

New Member

fabian

Proxmox Staff Member