KVM Segfault latest community repo on libpthread

robynhub

Renowned Member
Nov 15, 2011
64
0
71
Hello,

lately we're experiencing many segfaults across multiple physical machines (Dell R620) that sporadically lead to crash some VM:

[433025.858682] kvm[3158]: segfault at 18 ip 00007feee18b8c70 sp 00007feece5e3e38 error 6 in libpthread-2.24.so[7feee18ab000+18000]

We're using the latest enterprise repo (community license) and the latest kernel. Every update has been installed:

# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-22
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

# uname -v
#1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12 +0200)

Anyone have the same issue? There some hints to get this resolved?

BTW, we're sure that isn't hardware related. The machine had the memory tested for 48h without any issue and also the dell self-diagnostic tool run without any errors.

Thank you in advance
 
Last edited:
check all the involved files for corruption (the debsums package might be of help). if that turns out okay, please attempt to get a backtrace of the crashed kvm process after installing the pve-qemu-kvm-dbg package (e.g. with systemd-coredump)
 
  • Like
Reactions: robynhub
Hello fabian,

thank you for the answer. The debsums didn't found any error or md5 mismatch. I've installed the pve-qemu-kvm-dbg and systemd-coredump packages. Do you have any guide or documentation how to use those packages?

Thank you in advance.
 
Hello fabian,

thank you for the answer. The debsums didn't found any error or md5 mismatch. I've installed the pve-qemu-kvm-dbg and systemd-coredump packages. Do you have any guide or documentation how to use those packages?

Thank you in advance.

the systemd-coredump package ships a man page that tells you how to configure your system to use it. depending on the amount of RAM your VM has, you probably need to bump the coredump file size limit as well. then wait for a segfault, and check "coredumpctl list" for a new entry.
 
Here there is the coredumpctl info:

# coredumpctl info kvm

PID: 25687 (kvm)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Tue 2018-05-15 12:56:12 CEST (17min ago)
Command Line: /usr/bin/kvm -id 134 -name TestVM -chardev socket,id=qmp,path=/var/run/qemu-server/134.qmp,server,nowait -mon chardev=qmp,mode=control -pidfile /var/run/qemu-server/134.pid -daemonize -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga std -vnc unix:/var/run/qemu-server/134.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -iscsi initiator-name=iqn.1993-08.org.debian:01:c7e46effc298 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100 -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=200 -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-2.qcow2,if=none,id=drive-virtio1,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb -drive file=gluster://172.20.101.2/GoldStorage/images/134/vm-134-disk-3.qcow2,if=none,id=drive-virtio2,format=qcow2,cache=none,aio=native,detect-zeroes=on -device virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci.0,addr=0xc -netdev type=tap,id=net0,ifname=tap134i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=DE:99:1C:D8:5B:F2,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300 -machine type=pc-i440fx-2.11 -incoming unix:/run/qemu-server/134.migrate -S
Executable: /usr/bin/qemu-system-x86_64
Control Group: /
Slice: -.slice
Boot ID: 86cbea1ac01946f59045bd348bc4493b
Machine ID: 5a2bd0d11a6c41f9a33fd527751224ea
Hostname: VMFOA03
Storage: /var/lib/systemd/coredump/core.kvm.0.86cbea1ac01946f59045bd348bc4493b.25687.1526381772000000000000.lz4
Message: Process 25687 (kvm) of user 0 dumped core.

Stack trace of thread 25735:
#0 0x00007f464b4c6c70 n/a (n/a)
lines 1-17/17 (END)

How this could help?
 
does this only happen with VMs with disks on gluster?
 
does this only happen with VMs with disks on gluster?

I think so. In this cluster every VM have the HD over a gluster storage. This latest crash regards a VM that usually make a lot of IO but I don't know if this could be related.
 
I think you should upgrade to one of the GlusterFS versions supported upstream - the one in Debian Stretch is very outdated and buggy.
 
I think you should upgrade to one of the GlusterFS versions supported upstream - the one in Debian Stretch is very outdated and buggy.

I've only used the version shipped with the latest Proxmox-ve ISO. Do you think that i could upgrade only the clients instead the whole GlusterCluster? The Cluster is in production and there are ~200 VMs that are running over it. I can try to install the new gluster client over a single node and see what happen but upgrade all the gluster servers could be a bit tricky...
 
Is the GlusterFS Cluster on the same node as PVE? Or they are separated cluster?
 
Is the GlusterFS Cluster on the same node as PVE? Or they are separated cluster?


It's a separate cluster. In detail we have:

-- Gluster Cluster --

4 Nodes each with: Debian stretch - 32 Gb Ram - Xeon E5-1620 - Dual 10Gbe Nic (bonding lacp) - Areca RAID Controller with 24 Disks ( 16 SAS 10K and 8 SATA )

-- Proxmox Cluster --

8 Nodes each with: Proxmox VE 5.1 (latest) with Community Support and Enterprise Repo - 256 Gb Ram - Xeon E5-2643 - Single Nic 10 Gbe for storage access and dual Gigabit (bonding active-backup) for clustering.

Each node it's connected with a 48 18Gbits ports switch (Cisco Nexus 3000 series).
 
I've managed to upgrade the cluster without any downtime to version 4.0.2-2. I will let you know if other crashes happen.
If I don't see any crash for a week, I will mark this thread as solved.

Thank you for your support.
 
The important thing is that you always use the same version on the server and the client side.
The Debian packages are not that well maintained as the upstream ones. The recommendation is to use the upstream packages.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!