High load for no reason - containers can not shutdown

SPQRInc

Member
Jul 27, 2015
57
1
6
Hello,

I got a huge problem. Some of my Proxmox-based LXC-containers are not responding since 2 days if I do not reboot the node.

This happens always at the same time in the night (I guess there is something happening on a container which causes heavy load).

The problem is: top/atop/htop are not showing anything. The proxmox-node reacts without problems to ssh commands, but 2 of 5 nodes are not really responding (I can login with SSH but I can not enter a command).

I also have to do a "hard" reboot, because the reboot does not work (LXC-containers are not stopping after 40min).


This is my PVE-Version:
pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-2.6.32-43-pve: 2.6.32-166
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
Unfortunately the logs are not showing anything.

Syslog:
Mar 15 04:32:31 server pvedaemon[4061]: worker exit
Mar 15 04:32:31 server pvedaemon[1192]: worker 4061 finished
Mar 15 04:32:31 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:32:31 server pvedaemon[1192]: worker 24675 started
Mar 15 04:33:05 server pvedaemon[6601]: worker exit
Mar 15 04:33:05 server pvedaemon[1192]: worker 6601 finished
Mar 15 04:33:05 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:33:05 server pvedaemon[1192]: worker 25112 started
Mar 15 04:34:57 server systemd-timesyncd[559]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.021s/0.001s/+1ppm
Mar 15 04:36:08 server pveproxy[17238]: worker exit
Mar 15 04:36:08 server pveproxy[1212]: worker 17238 finished
Mar 15 04:36:08 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:36:08 server pveproxy[1212]: worker 28231 started
Mar 15 04:39:48 server pvedaemon[572]: worker exit
Mar 15 04:39:48 server pvedaemon[1192]: worker 572 finished
Mar 15 04:39:48 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:39:48 server pvedaemon[1192]: worker 31498 started
Mar 15 04:40:40 server pveproxy[31690]: worker exit
Mar 15 04:40:40 server pveproxy[1212]: worker 31690 finished
Mar 15 04:40:40 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:40:40 server pveproxy[1212]: worker 32442 started
Mar 15 04:45:02 server pvedaemon[25112]: <root@pam> successful auth for user 'root@pam'
Mar 15 04:46:27 server pveproxy[28231]: worker exit
Mar 15 04:46:27 server pveproxy[1212]: worker 28231 finished
Mar 15 04:46:27 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:46:27 server pveproxy[1212]: worker 5082 started
Mar 15 04:48:45 server pveproxy[17122]: worker exit
Mar 15 04:48:45 server pveproxy[1212]: worker 17122 finished
Mar 15 04:48:45 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:48:45 server pveproxy[1212]: worker 6924 started
Mar 15 04:51:28 server pvedaemon[25112]: worker exit
Mar 15 04:51:28 server pvedaemon[1192]: worker 25112 finished
Mar 15 04:51:28 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:51:28 server pvedaemon[1192]: worker 9770 started
Mar 15 04:51:38 server pveproxy[32442]: worker exit
Mar 15 04:51:38 server pveproxy[1212]: worker 32442 finished
Mar 15 04:51:38 server pveproxy[1212]: starting 1 worker(s)
Mar 15 04:51:38 server pveproxy[1212]: worker 9911 started
Mar 15 04:52:45 server pvedaemon[31498]: worker exit
Mar 15 04:52:45 server pvedaemon[1192]: worker 31498 finished
Mar 15 04:52:45 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:52:45 server pvedaemon[1192]: worker 10794 started
Mar 15 04:55:46 server pvedaemon[24675]: worker exit
Mar 15 04:55:46 server pvedaemon[1192]: worker 24675 finished
Mar 15 04:55:46 server pvedaemon[1192]: starting 1 worker(s)
Mar 15 04:55:46 server pvedaemon[1192]: worker 13187 started
Mar 15 04:57:32 server rrdcached[972]: flushing old values
Mar 15 04:57:32 server rrdcached[972]: rotating journals
Mar 15 04:57:32 server rrdcached[972]: started new journal /var/lib/rrdcached/journal/rrd.journal.1458014252.151024
Mar 15 04:57:32 server rrdcached[972]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1458007052.150971
Mar 15 04:57:40 server puppet-agent[14639]: Finished catalog run in 0.53 seconds
 
I suspect you will have to look in the backup logs. I am guessing you are using a suspend mode backup, and containers are getting stuck in "freezing" state.

You can look at the status of containers with lxc-info, and stop the freezing process with lxc-unfreeze. But there may still be hanging backup processes then. A lot of these problems have been solved with updates to lxcfs. Do you have them installed?
 
Hello, there are no backup-tasks done by proxmox in this time. It's always a bacula-backup on the single VPS.

LXCFS's version is lxcfs: 2.0.0-pve1 .
 
Ok, can it be that your container comes under memory pressure during backup? If you cannot execute commands and the container is not freezing or frozen then maybe there is a lack of memory or disk IO is stuck. Can you use ps listings from the host of the container processes? Can you execute shell commands in the container that do not use IO?

Does your bacula backup avoid directories like /proc and /sys?

Maybe also look in this thread to see if there is something useful there: https://forum.proxmox.com/threads/pve-suddunly-stopped-working-all-cts-unrecheable.26458/
 
Please provide the following debugging output after installing lxcfs-dbg and gdb ("apt install gdb lxcfs-dbg") on the nodes to help us find out more about the issue at hand.

general:
  1. output of "pveversion -v" on each node
when the issue occurs and you are able to connect via ssh/have a serial console/other access:
  1. complete output of "ps faxl" on the affected node
  2. note the pids of all lxcfs processes
  3. execute the following steps with PID replaced by the PID of on lxcfs process, and repeat for all of them. please collect the gdb output!
    1. "gdb" to start gdb
    2. enter "attach PID" to attach to the lxcfs process
    3. enter "bt" to print the backtrace
    4. enter "detach" to detach from the lxcfs process
    5. enter "quit" to quit gdb
  4. collect system logs with "journalctl -b" and save the output