PVE Crashing Starting LXC Container (5.1)

haxxa

Active Member
Jun 26, 2015
37
6
28
Hello Folks,

I have encountered an annoying issue since upgrading to the 5.1 release. When starting a Debian LXC Container, on one of hosts it gets stuck and freezes. A short time later, the container timeouts and PVE crashes.

This only happens after stopping and starting the container, however on boot it works fine. The container runs OpenVPN as a client and has some file services setup.

I can't seem to figure out the issue here; and I am looking for some insight on how to debug this. I have attached relevant CLI output below:

PVE Interface (after crash):

XwKBokO.png


PVE Interface Output:

ZljCKh3.png


Systemctl lxc Status:

D6nf3Dj.png


'journalctl -xe' Output:

c3VhM9P.png


I Should also mention 1 CPU core gets pegged at 100% and 'systemctl poweroff', 'halt', and 'reboot' no longer work the only way to reboot the server after restarting the container is to reset or hold the power button physically. It seems to me like the lxc container is failing to stop and causing all the issues encountered.

Any suggestions appreciated,

Thanks - Harrison
 
Last edited:
I should also mention none of the other containers have this issue. So, I think it relates to the configuration of this container. Not sure how to debug this? Any help appreciated...
 
Attempted moving the container to another cluster, still no luck. Crashes both Machines. Any ideas?
 
I tried to uninstall the pve-manager and install it again. That works 1 day but today the issue has come back.

Edit: More info: The problem is with the web interface and the lxc containers. lxc commands hangs. VM's are working.
Edit2: Is only one container. 1 of 4 cores is using 100% permanently. I think can be about resources distribution, because the fail starts after a stop backup.
 
Last edited:
I've been having the same issues, without the 100% CPU issue. I find that when I reboot the system, it works normally, but if I start and stop more than one or two containers after the initial startup sequence, that container I'm start/restarting dies. When that happens, I get the same (?) icons next to my containers on the GUI, even though they're working normally (except the one that won't start) and I can't start or shut down the others. It also keeps me from logging into the GUI, though I can still ssh into the server. All status information is also missing. I've tried restarting all of the PVE daemons to no avail.

I've found that I have a D process (uninterruptible sleep due to I/O) that seems to be the root cause, but I don't know what's locking it. For example, if container 103 is the one that stuck initially:

root@server:/# ps aux |grep 103

root 103 0.0 0.0 0 0 ? S 12:45 0:00 [ksoftirqd/15]
root 10652 0.0 0.0 50216 3844 ? Ss 16:50 0:00 [lxc monitor] /var/lib/lxc 103
root 10876 0.1 0.0 50216 684 ? D 16:50 0:01 [lxc monitor] /var/lib/lxc 103
root 13265 0.0 0.0 41772 4128 ? S 17:12 0:00 lxc-info -n 103 -p
root 13413 0.0 0.0 41772 4180 ? S 17:13 0:00 lxc-info -n 103 -p
root 13419 0.0 0.0 41772 4104 ? S 17:13 0:00 lxc-info -n 103 -p
root 13575 0.0 0.0 41772 4072 ? S 17:15 0:00 lxc-info -n 103 -p
root 13598 0.0 0.0 12788 976 pts/4 S+ 17:15 0:00 grep 103


If I do pct list it hangs until I kill the lxc-info processes, then it works. The lxc-info processes always restart. But, I think the issue is process 10876 in the example above. I just haven't found what is causing that to occur, there are no syslog messages that indicate a lock file or some other IO problem.

The container in question does have a lxc.mount.entry for a USB device that I'm passing through to the container. The device is /dev/ttyACM0 (a z-wave stick) and I set it to
a+rw mode so that the unprivileged container can write to it. But, I've had this happen with other containers.

Today, on a fresh dist-upgrade and reboot, I shut this container down, unplugged the USB stick, plugged it back in, and tried to start the container. There were at least 5 minutes between unplugging and reconnecting the USB device, and the syslog showed no errors. The containers are all on local-lvm storage; other containers keep working normally including those on iSCSI storage and accessing NFS shares from a FreeNAS system.
 
It might be related to a timeout during shutdown of the container causing the frozen monitor task?

I cut some long-running processes out of 103 for test and was able to shut down and restart this container (which is usually my problem child) multiple times with no issue. I'll keep testing...
 
Hi again. I have redistributed the resources, i think the problem was about that. If the lxc container has to start and another vm was using the resources that the lxc needs... it hangs, hanging the pve-manager too. Because that, it possible that when you restart the node, works fine. At the moment, this works for me, but i can't say that this is the problem 100% .
 
@Ale_IF - I'm not sure what you mean by redistributing the resources. For my system, I'm running nowhere near the hardware's limits. I do see that Proxmox adds specific CPU identifiers to the LXC containers when I limit it to X cores, and some of them overlap with other containers, but it doesn't seem like it's an issue - especially since none of them are running at max capacity.

After extending the timeout for shutdown and reducing the container's time to shut down, it seems to have helped - so far. But, like you, I'm not ready to call this one closed yet.
 
I think my problem with resources was the RAM, not the CPU's. I was using the dynamic assignation for the VM's. The sum of minimums was the real RAM, but if one VM exceed that minimum, Proxmox does't free the necessary quantity of RAM for start the LXC with its RAM assigned. I don't know if i explained well the point, sorry about my english.
 
@Ale_IF - Thanks, I understand. Your English is good, I just wasn't sure which resource was the issue.

I'm not anywhere near the RAM or CPU utilization, so for me that isn't a reason for the lockup problem.
 
Ok, ok. Thank you too. Anyway, as i said before, i'm not sure about the solution. I'll post here how continues.
 
Just had another occurrence of this happening. Created an LXC container a few days ago, and has been well. Apache DS LDAP system, container is Debian. It had been sitting idle, waiting for me to get home to configure it.

Today, I went to use the console via the control panel, but it was unresponsive to anything I typed. So far, otherwise, everything was working.

I SSH'd into the hypervisor, and then used pct enter to get into the container's shell. I ran the command I needed to run. It seemed to execute, but the result wasn't what I wanted. I tried to reboot the container from the inside instead of resetting it from the hypervisor button.

That locked up lxc_monitor again, which is now back to a D process status (waiting indefinitely), and the hypervisor is in the same state as above; the entire machine will require a reboot to fix it.

Is there no way to reset pct/lxc to get it out of this forever frozen state?
 
Once again this occurred. This time, I had restarted an LXC container a few times over the past week, but all seemed well, except the fact that there was network information on the dashboard, but the system showed 0 CPU and Memory utilization (even though the container was in fact running).

I had to make a change to the configuration, so I shut it down successfully, changed the configuration, and tried to restart it. It didn't come up. Once again, I have an lxc_monitor process in a D status, with the same conditions.

Am I the only one with these problems? Where might I check to diagnose this issue? Are there any logs or configuration messages that might help?
 
This sounds like the issue I posted here:
https://forum.proxmox.com/threads/lxc-container-reboot-fails-lxc-becomes-unusable.41264/#post-201351

The LXC bug report I filed at the same time (tl;dr -- it's a kernel bug):
https://github.com/lxc/lxc/issues/2141

And other users reporting the same issue:
https://forum.proxmox.com/threads/p...ntainer-101-service-failed.41878/#post-201388

Unfortunately, the only resolution currently is to build your own 4.14.20+ kernel. Whatever the bug is, it's present in pve-kernel 4.13.13-6 (4.13.13-41) which is current at the time of writing.
 
Last edited:
Thanks! I followed those threads too. It does seem like the same issue - filesystem locking makes sense if I'm seeing frozen processes.

I am running Proxmox in a homelab, so it's not as mission critical. I'll probably wait for the official fix. It's just irritating to have to reboot the server every time I want to tweak a container. Thankfully the rest of the containers stay running when this happens.
 
We have a same problem, that can not start any of the LXC containers anymore:

pveversion -v
=====================================================
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
openvswitch-switch: 2.6.2~pre+git20161223-3
=====================================================
 
We have a same problem, that can not start any of the LXC containers anymore:

The other threads mentioned above have a new kernel in test which solves this problem. You can either choose to use that new kernel now or wait until it comes out in the stable branch as an update. Unfortunately, until then, it seems you need a hard reboot of the host to fix the problem temporarily. It will continue due to the kernel bug.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!