[SOLVED] Proxmox 5.1.46 LXC cluster error Job for pve-container@101.service failed

Discussion in 'Proxmox VE: Installation and configuration' started by Vasu Sreekumar, Mar 3, 2018.

  1. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Hi,


    We have a very serious issue with Proxmox 5.1.46 LXC cluster with ZFS and we need urgent help.


    When somebody stops a LXC container and restart it will not restart but gives the following error.


    Job for pve-container@101.service failed because the control process exited with error code.

    See "systemctl status pve-container@101.service" and "journalctl -xe" for details.

    TASK ERROR: command 'systemctl start pve-container@101' failed: exit code 1


    At this point Proxmox node pings and all other LXC containers pings, but GUI will show grey for Proxmox node and the containers.


    1. From Proxmox node console run ps aux |grep xxx

    2. Then locate process ID with lxc monitor yyyy

    3. Then run kill -9 yyyy

    4. This will make Proxmox node and containers come green in GUI. But the error container will still be grey and shows same error when we start the it.


    If we restart Proxmox node all errors will get cleared, and we can start the error container also.


    Please investigate and let us know a solution.


    Thanks,


    Vasu
     
    #1 Vasu Sreekumar, Mar 3, 2018
    Last edited: Mar 3, 2018
    afsal and Manohar like this.
  2. LnxBil

    LnxBil Well-Known Member

    Joined:
    Feb 21, 2015
    Messages:
    3,789
    Likes Received:
    344
    You use the term 'cloud' a lot and this term does not exist in the Proxmox VE terminology. Please use container, node and VM to clearify what you mean.

    As far as I can interpret what you mean, the reboot on containers got stuck, is that right?
     
    afsal and Saroop P V like this.
  3. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Yes container. LXC based.
    Yes after the stop the at start of the LXC container I get this message.

    Job for pve-container@101.service failed because the control process exited with error code.
    See "systemctl status pve-container@101.service" and "journalctl -xe" for details.
    TASK ERROR: command 'systemctl start pve-container@101' failed: exit code 1

    At this point Proxmox IP pings and all other LXC containers ping, but GUI will show grey for both Proxmox and the LXC containers

    I have 4 Proxmox clusters each with 5 servers each, all have same issue.

    Once I reboot the proxmox server, everything is normal.
     
    Manohar likes this.
  4. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    74
    Likes Received:
    34
  5. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I like your idea.

    But, Is it better to wait until a new version comes with the issue solved?

    There should be many people using LXC with proxmox 5.1, they must be facing same issue.
     
    afsal and Manohar like this.
  6. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    74
    Likes Received:
    34
    I expected to see a lot of other people running into this issue but there haven't been many "me too" posts. There could be many reasons:
    * LXC container reboots may be infrequent in other environments.
    * The issue may be specific to containers migrated from OpenVZ.
    * Many people may be on earlier versions of Proxmox and/or may not have the latest patches.
    * Other reasons I haven't thought of...

    Proxmox is aware of the issue as they have responded to my original thread. I do expect that whatever resolved the issue in the 4.14.20+ kernels will eventually be backported into 4.13, but I can't wait as I'm responsible for ~ 1000 instances (LXC and KVM) and this was causing regular outages (ie. having to reboot the hypervisors).

    I will be checking new kernels as they're released but since we don't know the exact patch that resolved the issue, it's hard to say when pve-kernel will have the fix.
     
    afsal likes this.
  7. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Issue is there on fresh containers also. I took a fresh node and just installed 4 containers, left it like that. And when i tried to stop and start, issue happened.

    I have 25 nodes, issue happened on all nodes atleast once in last 5 days time.
     
    Manohar likes this.
  8. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    Searching more on internet, it is a kernel bug, not an LXC bug. Kernel 4.14.20 has issue fixed. We have to wait and see when we get updated patch from Proxmox.

    https://github.com/lxc/lxc/issues/2141
     
    Manohar and aslamds like this.
  9. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    74
    Likes Received:
    34
    That's my bug report at LXC for this issue. It doesn't seem difficult for me to duplicate the issue either so it's strange that more people aren't complaining. Anyway, Proxmox is aware that the newer kernels resolve the issue so hopefully an updated pve-kernel in the future will take care of this.
     
    aslamds, Vasu Sreekumar and afsal like this.
  10. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    It is very useful.

    I reproduced same error, it never matters LXC is converted from openvz or new LXC, all has issue.

    And it is random.

    We need to wait for the next release.
     
  11. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
  12. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I changed ZFS cache from 8GB to 1GB.

    Then issue didn't happen YET on all 25 nodes for last 36 hours.
     
  13. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    I found a work around to avoid node restart to solve the issue.

    It is not a very good one, but it avoids node restart.

    I will use this until next Proxmox comes.
     
    Saroop P V likes this.
  14. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    We loaded new Kernel 4.15.

    Created 5 LXC guests, created cron to stop and start all 5 guests every 5 minutes.

    Now it passed 6 hours, no errors yet. We are still running the test.

    CPU(s)24 x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz (2 Sockets)
    Kernel Version Linux 4.15.3-1-pve #1 SMP PVE 4.15.3-1 (Fri, 9 Mar 2018 14:45:34 +0100)
    PVE Manager Version pve-manager/5.1-46/ae8241d4

    With same setup 4.13 kernal Proxmox produced error within 30-40 minutes.
     
    #14 Vasu Sreekumar, Mar 13, 2018
    Last edited: Mar 13, 2018
  15. CashewCaliphate

    Joined:
    Jun 19, 2017
    Messages:
    44
    Likes Received:
    0
    For what it's worth, yeah, me too.

    Single node setup on pve 5.1 with the 4.13 kernel. All of my LXCs remain responsive and reachable, it's just the node itself for whatever reason.
     
  16. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    For me 18 hours passed with all 5 guests getting restarted every 5 minutes. No issues.
     
  17. CashewCaliphate

    Joined:
    Jun 19, 2017
    Messages:
    44
    Likes Received:
    0
    Did you try any LXC backups that involved an NFS share or ZFS?
     
  18. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    No NFS, I only have ZFS. Backup runs fine with ZFS, no issues with kernel 4.15.

    I have issues in 25 live nodes with ZFS running kernel 4.13, even with no backups running.
     
    CashewCaliphate likes this.
  19. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    36 hours passed, no issues.
     
  20. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    48 hours passed, no issues.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice