I got one more hang this morning. CPU load avg > 1200 (never seen it before).
There were 3 lxcfs processes on the server. One was eating huge amounts of memory.
It was not possible to debug. See attached session:
Code:
~# ps auxw|grep lxcfs
root 2067 0.0 0.0 751588 1412 ? S 03:49 0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root 2068 0.0 0.0 751720 1412 ? S 03:49 0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root 2474 0.3 0.1 28841028 45420 ? Ssl Mar14 64:32 /usr/bin/lxcfs /var/lib/lxcfs/
~# uptime
09:40:47 up 13 days, 14:39, 2 users, load average: 1221.00, 1172.77, 1120.05
(gdb) attach 2474
Attaching to process 2474
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb)
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x8831ad00:
#0 0x84ff3050 in ?? ()
Cannot access memory at address 0x8831ad00
(gdb) attach 2067
Attaching to process 2067
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7d7f9750:
#0 0x84ff44c9 in ?? ()
Cannot access memory at address 0x7d7f9750
[1177216.490435] Memory cgroup out of memory: Kill process 9637 (mysqld) score 166 or sacrifice child
[1177216.490485] Killed process 9637 (mysqld) total-vm:2311652kB, anon-rss:324280kB, file-rss:0kB
[1177317.132890] systemd-journald[289]: /dev/kmsg buffer overrun, some messages lost.
[1177317.472785] do_general_protection: 1 callbacks suppressed
[1177317.472790] traps: sh[15643] general protection ip:7f7eaef3b2fc sp:7ffce35ccec0 error:0 in libc-2.13.so[7f7eaef06000+184000]
[1177317.808201] traps: sh[15634] general protection ip:7f47cdaf82fc sp:7ffe652a60a0 error:0 in libc-2.13.so[7f47cdac3000+184000]
[1177318.773376] traps: sh[15648] general protection ip:7f350d4fe2fc sp:7ffe0ea4dd10 error:0 in libc-2.13.so[7f350d4c9000+184000]
[1177321.036155] traps: sh[15669] general protection ip:7fe8b5c312fc sp:7ffc5184f0c0 error:0 in libc-2.13.so[7fe8b5bfc000+184000]
Here's pveversion output:
# pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.8-1-pve: 4.2.8-39
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
fence-agents-pve: not correctly installed
openvswitch-switch: 2.3.2-2