LXC container reboot fails - LXC becomes unusable

Vasu Sreekumar · Mar 7, 2018

I found a work around to avoid node restart to solve the issue in 4.13. kernel.

It is not a very good one, but it avoids node restart.

I will use this until next Proxmox comes.

(I posted this in another thread also)

fabian · Mar 8, 2018

denos said:
Fabian: I do use NFS inside containers on my home server but not on two of the hypervisors that have had a network namespace lockup at work. It has been very easy to duplicate the issue at home (minutes - likely the NFS patch listed) but much harder on the hypervisors at work (up to 3 days of reboots before it occurs). I was excited to have made some progress but I agree with your assessment - the patch looks it's only addressing an NFS namespace issue. I think we're looking at more than one network namespace kernel issue that has been addressed somewhere in 4.14. Which is frustrating for everyone.

I appreciate that this is a very difficult issue to investigate, especially without steps to duplicate and am grateful for everyone's effort trying to pin it down.

If you have landed on this thread and want to confirm that it's relevant, wait for the issue to occur then run this command:

Code:

grep copy_net_ns /proc/*/stack

If that returns anything, this thread will be relevant. If not, you have a different issue.

As noted earlier in this thread, Docker users have reported an error with similar symptoms and a similar stack trace (hang on copy_net_ns):
https://github.com/coreos/bugs/issues/254
The bottom post in that thread carries on to this thread:
https://github.com/moby/moby/issues/5618
where they indicate kernel patches introduced as recently as Feb / 2018 may be relevant.

To reiterate, any server running a plain 4.14.20 kernel or later has had no further recurrence of this issue.

all the linked commits are already cherry-picked/backported into our 4.13.13-6-pve /-41 kernel (except for the NFS one you bisected to, which we will include in the next round of upgrades). we cannot reproduce this issue at all - we've been trying using various setups and machines without any luck. unfortunately this area is one where we have seen frequent regressions in the past, so it is likely there is some race/refcount leak that is triggered by some yet-unknown factor.

there will likely be a preview/test kernel based on Ubuntu Bionic's 4.15 kernel some time next week, it should fix this specific regression based on the information reported so far.

Vasu Sreekumar · Mar 8, 2018

I have 4.13.13-6-pve .

No NFS. Just ZFS.

I have 25 nodes live, everyday atelast 3 nodes going down with issue.

It is a nightmare.

root@Q172:~# systemctl status pve-container@103.service
● pve-container@103.service - PVE LXC Container: 103
Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2018-03-08 01:11:40 HST; 14min ago
Docs: man:lxc-start
man:lxc
man

ct
Process: 18338 ExecStart=/usr/bin/lxc-start -n 103 (code=killed, signal=TERM)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@103.service

Mar 08 01:10:10 Q172 systemd[1]: Starting PVE LXC Container: 103...
Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Start operation timed out. Terminating.
Mar 08 01:11:40 Q172 systemd[1]: Failed to start PVE LXC Container: 103.
Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Unit entered failed state.
Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Failed with result 'timeout'.

Vasu Sreekumar · Mar 8, 2018

I have this result also.

root@Q172:~# grep copy_net_ns /proc/*/stack
/proc/10436/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/10464/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/11425/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/11470/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/11475/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/11887/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/12256/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/12487/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/1252/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/12865/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/12957/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/13459/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/13516/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/13708/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/13964/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/14339/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/14372/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/14389/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
/proc/1469/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220

Vasu Sreekumar · Mar 9, 2018

Anyway I have a temporary work around.

When node fails, migrate guest to another node in cluster.

In 1-2 days node heals itself from this error.

FibreFoX · Mar 9, 2018

Hm ... maybe this is correlated to the system, where proxmox is running? Currently we are running proxmox only on one node, but are planning to move other systems to proxmox too. There were no special customizations done after using the installation-wizzard (just added additional network-settings and adding some LDAP for login-auth, .. oh, and creating cluster-configuration).

Here are the specs of the vhost, where I'm running into LXC-containers not responding sometimes:

CPU
8 x AMD Opteron(tm) Processor 4284 (1 Socket)

Kernelversion:
Linux 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100)

PVE Manager Version:
pve-manager/5.1-46/ae8241d4

Drives:
2x 4TB WesternDigital drives connected via SATA (without any additional RAID-controller inbetween)

Storage is configured with the installer-wizzard to run ZFS in mirror-mode (raid1). First I created some VMs (QEMU/KVM), they were running normally, doing their work. Later on I created some containers using the "debian-9.0-standard_9.3-1_amd64"-template provided by proxmox-download-servers.

Trying to reproduce this on a separate toy-machine, which has some Intel-CPU does not result in hanging LXC-containers.

There is another user/thread, within the german part of this forum (https://forum.proxmox.com/threads/seit-neuesten-failed-to-start-pve-lxc-container.41815/) who DOES NOT use ZFS, but seems to have the same problems, so this might not be ZFS-related afterall.

Is there anything I can do to provide more details?

denos · Mar 9, 2018

I've seen this on multiple Xeon based servers. The problem is occurring in the kernel's network namespace management so I don't think it's going to be an issue with the filesystem choice, CPU architecture, etc. These may play into a timing component that is exacerbated by particular combinations but it is fully resolved in recent kernels so that has to be where the bug exists (kernel).

Fabian mentioned a 4.15 based kernel is on the way and it will have everything from the 4.14 line so that should take care of the problem. My offer still stands to provide packaged 4.14 kernels to anyone that wants to resolve the issue immediately.

Vasu Sreekumar · Mar 10, 2018

Yes, I also confirm it.

It has nothing to do with file system.

I am still having sleepless nights with the issue with 25 live nodes.

It is very easy to reproduce the issue, with a simple plain node with 5 LXC, and by running a cron to stop and start each LXC.

Sub7 · Mar 11, 2018

how to install 4.14.20?

denos · Mar 11, 2018

UPDATE: Proxmox has released a 4.15 kernel (see below). Please focus your testing on that kernel.

You can download the kernel (required) and headers (optional) from here:
https://www.dropbox.com/sh/k7ad34tvwadsjpv/AABGqx036UWpht1mYDMJTh5Ea?dl=0

To install:

Code:

dpkg -i linux-image-4.14.25-lxcfix-2_amd64.deb

Then reboot. If you have any trouble, you can boot back into any of your existing kernels by selecting the Advanced option on the boot menu. To confirm you are now using 4.14.25:

Code:

uname -a

To uninstall:

Code:

dpkg --purge linux-image-4.14.25-lxcfix

sha256 checksums:

Code:

a19f3b2228be2cbd64aad41049e25e73c42a64f71f7fd0f96ac1b0046f6e99ee  linux-headers-4.14.25-lxcfix-2_amd64.deb
ecfe10018a7093af3d2757041867d65b6954c0be77bb96d74e9c791de53bee0d  linux-image-4.14.25-lxcfix-2_amd64.deb

For compatibility with pve-kernel-4.13.13-6-pve I have included ZFS at version 0.7.6. This kernel is also fully patched against Meltdown and Spectre variants (full generic retpoline).

IMPORTANT: This kernel isn't Proxmox supported so don't ask Proxmox for help with any issues. The kernel is the single most critical package on your system and you assume all risk if you proceed with the install. If you decide to use this kernel, please report whether your issue was resolved on this thread.

I will remove the kernel once there is a Proxmox kernel that is confirmed to work -- or if Proxmox asks me to take it down.

fabian · Mar 12, 2018

4.15 based test/preview kernel is available on pvetest, see https://forum.proxmox.com/threads/4-15-based-test-kernel-for-pve-5-x-available.42097/ for details

Vasu Sreekumar · Mar 13, 2018

We loaded new Kernel 4.15.

Created 5 LXC guests, created cron to stop and start all 5 guests every 5 minutes.

Now it passed 6 hours, no errors yet. We are still running the test.

CPU(s)24 x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz (2 Sockets)
Kernel Version Linux 4.15.3-1-pve #1 SMP PVE 4.15.3-1 (Fri, 9 Mar 2018 14:45:34 +0100)
PVE Manager Version pve-manager/5.1-46/ae8241d4

With same setup 4.13 kernal Proxmox produced error within 30-40 minutes.

denos · Mar 13, 2018

Initial testing looks good. Everything that was in the 4.14 branch should be in 4.15 so I'm expecting that this issue will be fully resolved by the new kernel. Thanks everyone.

Vasu Sreekumar · Mar 14, 2018

Now 18 hours passed, every 5 minutes all 5 guests are getting stopped and started.

No errors yet.

SO yes, this Kernel solves the issue.

FibreFoX · Mar 20, 2018

Even if my test-bed isn't that big, but I can confirm this new kernel not having any negative effect for me, and solves my problem. Thanks a lot to @denos for the amazing research! Thanks to PROXMOX-team too!

denos · Apr 18, 2018

NOTE: This is just a caution for the Proxmox kernel team and anyone that might be building their own kernels. As far as I know, the problematic change is not present in the official Proxmox 4.15 kernel.

An issue with identical symptoms has emerged in 4.16 and is patched in 4.17. See: https://github.com/lxc/lxd/issues/4468

fireon · Jun 30, 2018

Question: This only if you have ZFS as storage? Why i ask this... because on us cluster in the office (qcow2) we haven't this issue. Only on ZFS machines.

denos · Jun 30, 2018

As far as I know, it's not related to ZFS. All of the patches are in the mainline kernel (which doesn't contain any ZFS code) related to network namespacing, NFS and cgroups.

I suspect it's related to specific workloads within the containers as we only ever saw it on a hypervisors in one location (out of 5). We did see the issue once more on a 4.15.17-1-pve kernel (possibly related to my comment #56 above), but it has otherwise been good.

fireon · Jun 30, 2018

Really, really strange... hopefully this will be solved soon

mattlach · Nov 2, 2018

So,

I am on the following kernel:

Code:

Linux proxmox 4.15.18-5-pve #1 SMP PVE 4.15.18-24 (Thu, 13 Sep 2018 09:15:10 +0200) x86_64 GNU/Linux

I just shut down a container today using "pct stop 200".

I went to start it back up again with "pct start 200" and this process just sits there doing nothing until it times out.

Running the following debug command just sits at the command line forever, and nothing ever comes out in the output:

Code:

lxc-start -n 200 -F -l DEBUG -o /root/lxc-200.log

"pct list" will just sit at the command line waiting forever without providing any output.

Doing a systemctl status on 200.service does not give me anything useful to work with:

Code:

systemctl status lxc@200.service
● lxc@200.service - LXC Container: 200
   Loaded: loaded (/lib/systemd/system/lxc@.service; disabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/lxc@.service.d
           └─pve-reboot.conf
   Active: inactive (dead)
     Docs: man:lxc-start
           man:lxc

Nor does:

Code:

# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
   Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2018-10-07 16:44:26 EDT; 3 weeks 4 days ago
  Process: 6972 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
 Main PID: 7102 (pvestatd)
    Tasks: 2 (limit: 4915)
   Memory: 88.2M
      CPU: 1d 1h 58min 48.573s
   CGroup: /system.slice/pvestatd.service
           ├─ 7102 pvestatd
           └─11278 lxc-info -n 200 -p

Oct 07 16:44:25 proxmox systemd[1]: Starting PVE Status Daemon...
Oct 07 16:44:26 proxmox pvestatd[7102]: starting server
Oct 07 16:44:26 proxmox systemd[1]: Started PVE Status Daemon.
Oct 08 14:51:57 proxmox pvestatd[7102]: modified cpu set for lxc/200: 1-2,4,13
Oct 08 15:02:06 proxmox pvestatd[7102]: unable to get PID for CT 225 (not running?)
Oct 08 15:02:06 proxmox pvestatd[7102]: unable to get PID for CT 225 (not running?)

(That CT 225 reference is weird, as no 225 container exists, and I don't recall ever having one)

Seein lxc-info -n 200 here made me sspect maybe this was hung and causing trouble.

Code:

root@proxmox:~/container_config# ps -Af |grep -i lxc-info
root     11278  7102  0 14:28 ?        00:00:00 lxc-info -n 200 -p

I tried killing it, but it just instantly comes back with a new PID.

Is this the same issue, or am I dealing with something completely different?

Much appreciated

LXC container reboot fails - LXC becomes unusable

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Active Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Renowned Member

We value your privacy