KVM Segfault latest community repo on libpthread

robynhub · May 14, 2018

Hello,

lately we're experiencing many segfaults across multiple physical machines (Dell R620) that sporadically lead to crash some VM:

[433025.858682] kvm[3158]: segfault at 18 ip 00007feee18b8c70 sp 00007feece5e3e38 error 6 in libpthread-2.24.so[7feee18ab000+18000]

We're using the latest enterprise repo (community license) and the latest kernel. Every update has been installed:

# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-22
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

# uname -v
#1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12 +0200)

Anyone have the same issue? There some hints to get this resolved?

BTW, we're sure that isn't hardware related. The machine had the memory tested for 48h without any issue and also the dell self-diagnostic tool run without any errors.

Thank you in advance

fabian · May 15, 2018

check all the involved files for corruption (the debsums package might be of help). if that turns out okay, please attempt to get a backtrace of the crashed kvm process after installing the pve-qemu-kvm-dbg package (e.g. with systemd-coredump)

robynhub · May 15, 2018

Hello fabian,

thank you for the answer. The debsums didn't found any error or md5 mismatch. I've installed the pve-qemu-kvm-dbg and systemd-coredump packages. Do you have any guide or documentation how to use those packages?

Thank you in advance.

fabian · May 15, 2018

robynhub said:
Hello fabian,

thank you for the answer. The debsums didn't found any error or md5 mismatch. I've installed the pve-qemu-kvm-dbg and systemd-coredump packages. Do you have any guide or documentation how to use those packages?

Thank you in advance.

the systemd-coredump package ships a man page that tells you how to configure your system to use it. depending on the amount of RAM your VM has, you probably need to bump the coredump file size limit as well. then wait for a segfault, and check "coredumpctl list" for a new entry.

robynhub · May 15, 2018

Here there is the coredumpctl info:

# coredumpctl info kvm

PID: 25687 (kvm)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Tue 2018-05-15 12:56:12 CEST (17min ago)
Command Line: /usr/bin/kvm -id 134 -name TestVM -chardev socket,id=qmp,path=/var/run/qemu-server/134.qmp,server,nowait -mon chardev=qmp,mode=control -pidfile /var/run/qemu-server/134.pid -daemonize -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga std -vnc unix:/var/run/qemu-server/134.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -iscsi initiator-name=iqn.1993-08.org.debian:01:c7e46effc298 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100 -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=200 -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-2.qcow2,if=none,id=drive-virtio1,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-3.qcow2,if=none,id=drive-virtio2,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci.0,addr=0xc -netdev type=tap,id=net0,ifname=tap134i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=DE:99:1C

8:5B:F2,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300 -machine type=pc-i440fx-2.11 -incoming unix:/run/qemu-server/134.migrate -S
Executable: /usr/bin/qemu-system-x86_64
Control Group: /
Slice: -.slice
Boot ID: 86cbea1ac01946f59045bd348bc4493b
Machine ID: 5a2bd0d11a6c41f9a33fd527751224ea
Hostname: VMFOA03
Storage: /var/lib/systemd/coredump/core.kvm.0.86cbea1ac01946f59045bd348bc4493b.25687.1526381772000000000000.lz4
Message: Process 25687 (kvm) of user 0 dumped core.

Stack trace of thread 25735:
#0 0x00007f464b4c6c70 n/a (n/a)
lines 1-17/17 (END)

How this could help?

fabian · May 15, 2018

does this only happen with VMs with disks on gluster?

robynhub · May 15, 2018

fabian said:
does this only happen with VMs with disks on gluster?

I think so. In this cluster every VM have the HD over a gluster storage. This latest crash regards a VM that usually make a lot of IO but I don't know if this could be related.

fabian · May 15, 2018

I think you should upgrade to one of the GlusterFS versions supported upstream - the one in Debian Stretch is very outdated and buggy.

robynhub · May 15, 2018

fabian said:
I think you should upgrade to one of the GlusterFS versions supported upstream - the one in Debian Stretch is very outdated and buggy.

I've only used the version shipped with the latest Proxmox-ve ISO. Do you think that i could upgrade only the clients instead the whole GlusterCluster? The Cluster is in production and there are ~200 VMs that are running over it. I can try to install the new gluster client over a single node and see what happen but upgrade all the gluster servers could be a bit tricky...

wolfgang · May 15, 2018

Is the GlusterFS Cluster on the same node as PVE? Or they are separated cluster?

robynhub · May 15, 2018

wolfgang said:
Is the GlusterFS Cluster on the same node as PVE? Or they are separated cluster?

It's a separate cluster. In detail we have:

-- Gluster Cluster --

4 Nodes each with: Debian stretch - 32 Gb Ram - Xeon E5-1620 - Dual 10Gbe Nic (bonding lacp) - Areca RAID Controller with 24 Disks ( 16 SAS 10K and 8 SATA )

-- Proxmox Cluster --

8 Nodes each with: Proxmox VE 5.1 (latest) with Community Support and Enterprise Repo - 256 Gb Ram - Xeon E5-2643 - Single Nic 10 Gbe for storage access and dual Gigabit (bonding active-backup) for clustering.

Each node it's connected with a 48 18Gbits ports switch (Cisco Nexus 3000 series).

robynhub · May 15, 2018

I've managed to upgrade the cluster without any downtime to version 4.0.2-2. I will let you know if other crashes happen.
If I don't see any crash for a week, I will mark this thread as solved.

Thank you for your support.

wolfgang · May 16, 2018

The important thing is that you always use the same version on the server and the client side.
The Debian packages are not that well maintained as the upstream ones. The recommendation is to use the upstream packages.

Search

Search

KVM Segfault latest community repo on libpthread

robynhub

Renowned Member

fabian

Proxmox Staff Member

robynhub

Renowned Member

fabian

Proxmox Staff Member

robynhub

Renowned Member

fabian

Proxmox Staff Member

robynhub

Renowned Member

fabian

Proxmox Staff Member

robynhub

Renowned Member

wolfgang

Proxmox Retired Staff

robynhub

Renowned Member

robynhub

Renowned Member

wolfgang

Proxmox Retired Staff