KVM / CentOS 6.8 random hang with 100% CPU

amace

Renowned Member
Dec 17, 2012
24
3
68
Armpit, Hell
I have a problematic VM that randomly hangs with CPU usage pegged to the max. the load average goes 30+ and though I can ping the machine as normal, the console never responds and all users are locked out. Until the crash, the system runs under 1.00 load average. I have to stop/start to get it going again. It will run from several hours to a week or two before the same thing happens.

The server is a Dell with E5-2630L v3 and 32GB ram on 4.4-13, kernel 4.4.44-1-pve. We have two virtio disk images which were encrypted during the installation and 24GB RAM allocated.

root@pve:~# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

We keep the CentOS OS current with all updates. The problem persists over many kernel updates from both Proxmox and CentOS.

I've tried a few things to resolve this error. Set the CPU to "host" from qemu64 and back, Turned off/on the ballooning feature (early on it seemed like a memory leak), changed the video type from cirrus to standard vga and back. I can't find anything else to go on.

Getting out of my field of expertise, I tried some things suggested in other posts on this forum. I've collected a backtrace, a core dump and have the following excerpt from strace.

Process 120508 attached with 35 threads
[pid 120566] futex(0x7f72873cdc84, FUTEX_WAIT_PRIVATE, 1313, NULL <unfinished ...>
[pid 120564] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigtimedwait resumed> 0x7f72677fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120564] rt_sigpending( <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigpending resumed> [], 8) = 0
[pid 120564] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120564] ioctl(56, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120560] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... rt_sigtimedwait resumed> 0x7f726a7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120559] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigpending([], 8) = 0
[pid 120561] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... futex resumed> ) = 1
[pid 120562] <... futex resumed> ) = 0
[pid 120561] ioctl(53, KVM_RUN <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120558] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigtimedwait resumed> 0x7f726e7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigpending( <unfinished ...>
[pid 120556] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigpending resumed> [], 8) = 0
[pid 120555] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120555] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] <... futex resumed> ) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120557] ioctl(49, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigtimedwait([BUS USR1], 0x7f72707fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120554] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigpending( <unfinished ...>
[pid 120553] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
...

The latest change I've made is to set the clocksource to acpi_pm as the kernel reported "Clocksource tsc unstable (delta = -17179829681 ns)"

Now I wait to see if the clocksource change helps. Not sure where to go with this from here. Any suggestions are greatly appreciated!
 
We have narrowed this down to hylafax. When we disable hylafax during peak usage the server operates without crashing. Anyone have problem with hylafax in KVM? We are running hylafax+ 5.5.9 which connects to modems via DigiPort terminal server.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!