[SOLVED] LXC container reboot fails - LXC becomes unusable

Discussion in 'Proxmox VE: Installation and configuration' started by denos, Feb 7, 2018.

  1. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I found a work around to avoid node restart to solve the issue in 4.13. kernel.

    It is not a very good one, but it avoids node restart.

    I will use this until next Proxmox comes.

    (I posted this in another thread also)
     
  2. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,183
    Likes Received:
    492
    all the linked commits are already cherry-picked/backported into our 4.13.13-6-pve /-41 kernel (except for the NFS one you bisected to, which we will include in the next round of upgrades). we cannot reproduce this issue at all - we've been trying using various setups and machines without any luck. unfortunately this area is one where we have seen frequent regressions in the past, so it is likely there is some race/refcount leak that is triggered by some yet-unknown factor.

    there will likely be a preview/test kernel based on Ubuntu Bionic's 4.15 kernel some time next week, it should fix this specific regression based on the information reported so far.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    FibreFoX likes this.
  3. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I have 4.13.13-6-pve .

    No NFS. Just ZFS.

    I have 25 nodes live, everyday atelast 3 nodes going down with issue.

    It is a nightmare.

    root@Q172:~# systemctl status pve-container@103.service
    pve-container@103.service - PVE LXC Container: 103
    Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
    Active: failed (Result: timeout) since Thu 2018-03-08 01:11:40 HST; 14min ago
    Docs: man:lxc-start
    man:lxc
    man:pct
    Process: 18338 ExecStart=/usr/bin/lxc-start -n 103 (code=killed, signal=TERM)
    Tasks: 0 (limit: 4915)
    CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@103.service

    Mar 08 01:10:10 Q172 systemd[1]: Starting PVE LXC Container: 103...
    Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Start operation timed out. Terminating.
    Mar 08 01:11:40 Q172 systemd[1]: Failed to start PVE LXC Container: 103.
    Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Unit entered failed state.
    Mar 08 01:11:40 Q172 systemd[1]: pve-container@103.service: Failed with result 'timeout'.
     
    #43 Vasu Sreekumar, Mar 8, 2018
    Last edited: Mar 8, 2018
  4. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I have this result also.

    root@Q172:~# grep copy_net_ns /proc/*/stack
    /proc/10436/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/10464/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/11425/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/11470/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/11475/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/11887/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/12256/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/12487/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/1252/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/12865/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/12957/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/13459/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/13516/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/13708/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/13964/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/14339/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/14372/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/14389/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
    /proc/1469/stack:[<ffffffff9cddfd1b>] copy_net_ns+0xab/0x220
     
  5. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Anyway I have a temporary work around.

    When node fails, migrate guest to another node in cluster.

    In 1-2 days node heals itself from this error.
     
  6. FibreFoX

    FibreFoX New Member

    Joined:
    Feb 26, 2018
    Messages:
    10
    Likes Received:
    2
    Hm ... maybe this is correlated to the system, where proxmox is running? Currently we are running proxmox only on one node, but are planning to move other systems to proxmox too. There were no special customizations done after using the installation-wizzard (just added additional network-settings and adding some LDAP for login-auth, .. oh, and creating cluster-configuration).

    Here are the specs of the vhost, where I'm running into LXC-containers not responding sometimes:

    CPU
    8 x AMD Opteron(tm) Processor 4284 (1 Socket)

    Kernelversion:
    Linux 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100)

    PVE Manager Version:
    pve-manager/5.1-46/ae8241d4

    Drives:
    2x 4TB WesternDigital drives connected via SATA (without any additional RAID-controller inbetween)

    Storage is configured with the installer-wizzard to run ZFS in mirror-mode (raid1). First I created some VMs (QEMU/KVM), they were running normally, doing their work. Later on I created some containers using the "debian-9.0-standard_9.3-1_amd64"-template provided by proxmox-download-servers.

    Trying to reproduce this on a separate toy-machine, which has some Intel-CPU does not result in hanging LXC-containers.

    There is another user/thread, within the german part of this forum (https://forum.proxmox.com/threads/seit-neuesten-failed-to-start-pve-lxc-container.41815/) who DOES NOT use ZFS, but seems to have the same problems, so this might not be ZFS-related afterall.

    Is there anything I can do to provide more details?
     
    #46 FibreFoX, Mar 9, 2018
    Last edited: Mar 9, 2018
  7. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    I've seen this on multiple Xeon based servers. The problem is occurring in the kernel's network namespace management so I don't think it's going to be an issue with the filesystem choice, CPU architecture, etc. These may play into a timing component that is exacerbated by particular combinations but it is fully resolved in recent kernels so that has to be where the bug exists (kernel).

    Fabian mentioned a 4.15 based kernel is on the way and it will have everything from the 4.14 line so that should take care of the problem. My offer still stands to provide packaged 4.14 kernels to anyone that wants to resolve the issue immediately.
     
    Vasu Sreekumar likes this.
  8. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Yes, I also confirm it.

    It has nothing to do with file system.

    I am still having sleepless nights with the issue with 25 live nodes.

    It is very easy to reproduce the issue, with a simple plain node with 5 LXC, and by running a cron to stop and start each LXC.
     
  9. Sub7

    Sub7 New Member

    Joined:
    Jan 13, 2018
    Messages:
    6
    Likes Received:
    1
    how to install 4.14.20?
     
  10. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    UPDATE: Proxmox has released a 4.15 kernel (see below). Please focus your testing on that kernel.

    You can download the kernel (required) and headers (optional) from here:
    https://www.dropbox.com/sh/k7ad34tvwadsjpv/AABGqx036UWpht1mYDMJTh5Ea?dl=0

    To install:
    Code:
    dpkg -i linux-image-4.14.25-lxcfix-2_amd64.deb
    Then reboot. If you have any trouble, you can boot back into any of your existing kernels by selecting the Advanced option on the boot menu. To confirm you are now using 4.14.25:
    Code:
    uname -a
    To uninstall:
    Code:
    dpkg --purge linux-image-4.14.25-lxcfix
    sha256 checksums:
    Code:
    a19f3b2228be2cbd64aad41049e25e73c42a64f71f7fd0f96ac1b0046f6e99ee  linux-headers-4.14.25-lxcfix-2_amd64.deb
    ecfe10018a7093af3d2757041867d65b6954c0be77bb96d74e9c791de53bee0d  linux-image-4.14.25-lxcfix-2_amd64.deb
    For compatibility with pve-kernel-4.13.13-6-pve I have included ZFS at version 0.7.6. This kernel is also fully patched against Meltdown and Spectre variants (full generic retpoline).

    IMPORTANT: This kernel isn't Proxmox supported so don't ask Proxmox for help with any issues. The kernel is the single most critical package on your system and you assume all risk if you proceed with the install. If you decide to use this kernel, please report whether your issue was resolved on this thread.

    I will remove the kernel once there is a Proxmox kernel that is confirmed to work -- or if Proxmox asks me to take it down.
     
    #50 denos, Mar 11, 2018
    Last edited: Mar 13, 2018
    FibreFoX likes this.
  11. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,183
    Likes Received:
    492
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    FibreFoX likes this.
  12. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    We loaded new Kernel 4.15.

    Created 5 LXC guests, created cron to stop and start all 5 guests every 5 minutes.

    Now it passed 6 hours, no errors yet. We are still running the test.

    CPU(s)24 x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz (2 Sockets)
    Kernel Version Linux 4.15.3-1-pve #1 SMP PVE 4.15.3-1 (Fri, 9 Mar 2018 14:45:34 +0100)
    PVE Manager Version pve-manager/5.1-46/ae8241d4

    With same setup 4.13 kernal Proxmox produced error within 30-40 minutes.
     
    #52 Vasu Sreekumar, Mar 13, 2018
    Last edited: Mar 13, 2018
    fireon and FibreFoX like this.
  13. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    Initial testing looks good. Everything that was in the 4.14 branch should be in 4.15 so I'm expecting that this issue will be fully resolved by the new kernel. Thanks everyone.
     
    FibreFoX likes this.
  14. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Now 18 hours passed, every 5 minutes all 5 guests are getting stopped and started.

    No errors yet.

    SO yes, this Kernel solves the issue.
     
    FibreFoX likes this.
  15. FibreFoX

    FibreFoX New Member

    Joined:
    Feb 26, 2018
    Messages:
    10
    Likes Received:
    2
    Even if my test-bed isn't that big, but I can confirm this new kernel not having any negative effect for me, and solves my problem. Thanks a lot to @denos for the amazing research! Thanks to PROXMOX-team too!
     
  16. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    NOTE: This is just a caution for the Proxmox kernel team and anyone that might be building their own kernels. As far as I know, the problematic change is not present in the official Proxmox 4.15 kernel.

    An issue with identical symptoms has emerged in 4.16 and is patched in 4.17. See: https://github.com/lxc/lxd/issues/4468
     
    FibreFoX likes this.
  17. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,832
    Likes Received:
    162
    Question: This only if you have ZFS as storage? Why i ask this... because on us cluster in the office (qcow2) we haven't this issue. Only on ZFS machines.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  18. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    As far as I know, it's not related to ZFS. All of the patches are in the mainline kernel (which doesn't contain any ZFS code) related to network namespacing, NFS and cgroups.

    I suspect it's related to specific workloads within the containers as we only ever saw it on a hypervisors in one location (out of 5). We did see the issue once more on a 4.15.17-1-pve kernel (possibly related to my comment #56 above), but it has otherwise been good.
     
  19. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,832
    Likes Received:
    162
    Really, really strange... hopefully this will be solved soon :)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  20. mattlach

    mattlach Member

    Joined:
    Mar 23, 2016
    Messages:
    145
    Likes Received:
    12
    So,

    I am on the following kernel:

    Code:
    Linux proxmox 4.15.18-5-pve #1 SMP PVE 4.15.18-24 (Thu, 13 Sep 2018 09:15:10 +0200) x86_64 GNU/Linux
    I just shut down a container today using "pct stop 200".

    I went to start it back up again with "pct start 200" and this process just sits there doing nothing until it times out.

    Running the following debug command just sits at the command line forever, and nothing ever comes out in the output:
    Code:
    lxc-start -n 200 -F -l DEBUG -o /root/lxc-200.log
    "pct list" will just sit at the command line waiting forever without providing any output.

    Doing a systemctl status on 200.service does not give me anything useful to work with:

    Code:
    systemctl status lxc@200.service
    ● lxc@200.service - LXC Container: 200
       Loaded: loaded (/lib/systemd/system/lxc@.service; disabled; vendor preset: enabled)
      Drop-In: /lib/systemd/system/lxc@.service.d
               └─pve-reboot.conf
       Active: inactive (dead)
         Docs: man:lxc-start
               man:lxc
    Nor does:

    Code:
    # systemctl status pvestatd
    ● pvestatd.service - PVE Status Daemon
       Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
       Active: active (running) since Sun 2018-10-07 16:44:26 EDT; 3 weeks 4 days ago
      Process: 6972 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
     Main PID: 7102 (pvestatd)
        Tasks: 2 (limit: 4915)
       Memory: 88.2M
          CPU: 1d 1h 58min 48.573s
       CGroup: /system.slice/pvestatd.service
               ├─ 7102 pvestatd
               └─11278 lxc-info -n 200 -p
    
    Oct 07 16:44:25 proxmox systemd[1]: Starting PVE Status Daemon...
    Oct 07 16:44:26 proxmox pvestatd[7102]: starting server
    Oct 07 16:44:26 proxmox systemd[1]: Started PVE Status Daemon.
    Oct 08 14:51:57 proxmox pvestatd[7102]: modified cpu set for lxc/200: 1-2,4,13
    Oct 08 15:02:06 proxmox pvestatd[7102]: unable to get PID for CT 225 (not running?)
    Oct 08 15:02:06 proxmox pvestatd[7102]: unable to get PID for CT 225 (not running?)
    (That CT 225 reference is weird, as no 225 container exists, and I don't recall ever having one)

    Seein lxc-info -n 200 here made me sspect maybe this was hung and causing trouble.

    Code:
    root@proxmox:~/container_config# ps -Af |grep -i lxc-info
    root     11278  7102  0 14:28 ?        00:00:00 lxc-info -n 200 -p
    I tried killing it, but it just instantly comes back with a new PID.

    Is this the same issue, or am I dealing with something completely different?

    Much appreciated
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice