LXC cgroups not cleaned up on container shutdown, can't restart

UrkoM

New Member
Oct 15, 2014
17
0
1
Hello,
One of our hosts is not cleaning up the cgroups for shutdown containers, and it prevents them from starting again. Here is a snippet of the log file I obtained by starting the container with:
/usr/bin/lxc-start -F --logfile=/root/135.log --logpriority=DEBUG -n 135
Code:
lxc-start 135 20180116013019.550 INFO     lxc_cgroup - cgroups/cgroup.c:cgroup_init:67 - cgroup driver cgroupfs-ng initing for 135
      lxc-start 135 20180116013019.550 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1337 - Path "/sys/fs/cgroup/cpu//lxc/135" already existed.
      lxc-start 135 20180116013019.550 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1433 - Failed to create "/sys/fs/cgroup/cpu//lxc/135"

When we try to start one of these containers, the web interface becomes unresponsive for that host. 2 additional VMs on the same host are running well, totally unaffected.

I've found some conversations online about similar issues when doing a container restart that doesn't give the system enough time to cleanup, but in this case the containers were off for over 10 minutes.

How can I force the cleanup of the cgroups, at least as a workaround? Where can we look for more clues to what may be causing the problem?
 
Using commands from this page:
I've been able to clean up all cgroups for the container ID.
Going to /sys/fs/cgroups, and running this line gets rid of all cgroups:
Code:
find <container id> -depth -type d -print -exec rmdir {} \;
Then I have found that the network interface stays configured on the vswitch. Used this command to clear it:
Code:
ovs-vsctl del-port <port name>
and then restart the openvswitch service, to get rid of the hidden veth port (possibly not the best way to do it):
Code:
systemctl restart openvswitch.service

After all this, starting the container still fails. Log files from running it with this line:
Code:
/usr/bin/lxc-start -F --logfile=/root/115.log --logpriority=DEBUG -n 115
are not showing any errors, but the lxc process dies, the container does not respond, and I need to forcefully kill it.

I am really open for ideas... :)