I have a problematic VM that randomly hangs with CPU usage pegged to the max. the load average goes 30+ and though I can ping the machine as normal, the console never responds and all users are locked out. Until the crash, the system runs under 1.00 load average. I have to stop/start to get it going again. It will run from several hours to a week or two before the same thing happens.
The server is a Dell with E5-2630L v3 and 32GB ram on 4.4-13, kernel 4.4.44-1-pve. We have two virtio disk images which were encrypted during the installation and 24GB RAM allocated.
root@pve:~# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
We keep the CentOS OS current with all updates. The problem persists over many kernel updates from both Proxmox and CentOS.
I've tried a few things to resolve this error. Set the CPU to "host" from qemu64 and back, Turned off/on the ballooning feature (early on it seemed like a memory leak), changed the video type from cirrus to standard vga and back. I can't find anything else to go on.
Getting out of my field of expertise, I tried some things suggested in other posts on this forum. I've collected a backtrace, a core dump and have the following excerpt from strace.
Process 120508 attached with 35 threads
[pid 120566] futex(0x7f72873cdc84, FUTEX_WAIT_PRIVATE, 1313, NULL <unfinished ...>
[pid 120564] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigtimedwait resumed> 0x7f72677fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120564] rt_sigpending( <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigpending resumed> [], 8) = 0
[pid 120564] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120564] ioctl(56, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120560] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... rt_sigtimedwait resumed> 0x7f726a7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120559] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigpending([], 8) = 0
[pid 120561] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... futex resumed> ) = 1
[pid 120562] <... futex resumed> ) = 0
[pid 120561] ioctl(53, KVM_RUN <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120558] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigtimedwait resumed> 0x7f726e7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigpending( <unfinished ...>
[pid 120556] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigpending resumed> [], 8) = 0
[pid 120555] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120555] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] <... futex resumed> ) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120557] ioctl(49, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigtimedwait([BUS USR1], 0x7f72707fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120554] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigpending( <unfinished ...>
[pid 120553] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
...
The latest change I've made is to set the clocksource to acpi_pm as the kernel reported "Clocksource tsc unstable (delta = -17179829681 ns)"
Now I wait to see if the clocksource change helps. Not sure where to go with this from here. Any suggestions are greatly appreciated!
The server is a Dell with E5-2630L v3 and 32GB ram on 4.4-13, kernel 4.4.44-1-pve. We have two virtio disk images which were encrypted during the installation and 24GB RAM allocated.
root@pve:~# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
We keep the CentOS OS current with all updates. The problem persists over many kernel updates from both Proxmox and CentOS.
I've tried a few things to resolve this error. Set the CPU to "host" from qemu64 and back, Turned off/on the ballooning feature (early on it seemed like a memory leak), changed the video type from cirrus to standard vga and back. I can't find anything else to go on.
Getting out of my field of expertise, I tried some things suggested in other posts on this forum. I've collected a backtrace, a core dump and have the following excerpt from strace.
Process 120508 attached with 35 threads
[pid 120566] futex(0x7f72873cdc84, FUTEX_WAIT_PRIVATE, 1313, NULL <unfinished ...>
[pid 120564] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigtimedwait resumed> 0x7f72677fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120564] rt_sigpending( <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120564] <... rt_sigpending resumed> [], 8) = 0
[pid 120564] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120564] ioctl(56, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120560] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... rt_sigtimedwait resumed> 0x7f726a7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120559] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] rt_sigpending([], 8) = 0
[pid 120561] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120561] <... futex resumed> ) = 1
[pid 120562] <... futex resumed> ) = 0
[pid 120561] ioctl(53, KVM_RUN <unfinished ...>
[pid 120562] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120558] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigtimedwait([BUS USR1], <unfinished ...>
[pid 120558] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigtimedwait resumed> 0x7f726e7fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] rt_sigpending( <unfinished ...>
[pid 120556] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] <... rt_sigpending resumed> [], 8) = 0
[pid 120555] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120557] futex(0x564da1555500, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 120555] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120557] <... futex resumed> ) = 1
[pid 120563] <... futex resumed> ) = 0
[pid 120557] ioctl(49, KVM_RUN <unfinished ...>
[pid 120563] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigtimedwait([BUS USR1], 0x7f72707fe840, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
[pid 120554] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 120555] rt_sigpending( <unfinished ...>
[pid 120553] futex(0x564da1555500, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
...
The latest change I've made is to set the clocksource to acpi_pm as the kernel reported "Clocksource tsc unstable (delta = -17179829681 ns)"
Now I wait to see if the clocksource change helps. Not sure where to go with this from here. Any suggestions are greatly appreciated!