[SOLVED] LXC container reboot fails - LXC becomes unusable

Discussion in 'Proxmox VE: Installation and configuration' started by denos, Feb 7, 2018.

  1. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    When the startup is hung, do
    Code:
    grep copy_net_ns /proc/*/stack
    If that returns anything, you're having this issue. I can confirm that the issue is still present (but less frequent) on recent 4.15.x Proxmox PVE kernels. The Ubuntu kernel team acknowledged this bug:
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

    The only solution provided was to use kernel 4.18. I was able to build and boot 4.18 with Proxmox but it breaks the current version of AppArmor with multiple "Profile doesn't conform to protocol" errors. The solution is build a new AppArmor but I haven't gotten that far yet.
     
    mattlach likes this.
  2. mattlach

    mattlach Member

    Joined:
    Mar 23, 2016
    Messages:
    145
    Likes Received:
    12
    Hmm.

    I will have to check this a little later.

    Does a reboot temporarily solve the issue? I could probably do that overnight, and then go another few months without running into it again.

    My use case doesn't require restarting containers regularly. They start once when the server goes up, and then pretty much run until the server needs to reboot for some reason (like a kernel update)
     
  3. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    Yes, a reboot will clear it up -- I'm not aware of any way to recover a system in this state without a reboot. My experience has been the same as in that Ubuntu kernel bug report; it's an infrequent condition that presents like a deadlock. We typically go months between incidents on 4.15 kernels which is much better than 4.13. It's unfortunate that 4.18 seems to be the only way to completely resolve the issue.
     
    mattlach likes this.
  4. mattlach

    mattlach Member

    Joined:
    Mar 23, 2016
    Messages:
    145
    Likes Received:
    12
    Thanks for the help.

    I rebooted the server today, and it appears to be running normally again.

    Hopefully a 4.18+ PVE Kernel that fixes this issue will be made available quickly.

    I mean, I could easily either compile, download a mainline binary kernel or add the sources for the kernel from Ubuntu's cosmic-updates repository which is now at 4.18.0-10, I believe, but I don't know what magic the Proxmox team does to their kernel releases to make them well suited to being used in a Hypervisor. I don't want to cause any issues by doing this.
     
  5. Matteo Italia

    Matteo Italia New Member
    Proxmox Subscriber

    Joined:
    Feb 5, 2019
    Messages:
    4
    Likes Received:
    0
    I'm still experiencing this exact same issue on 4.15.18-10-pve; it keeps happening with the same symptoms, so the node is borderline unusable. Trying to start the containers with straight lxc-start results in the dreaded
    Code:
    lxc-start 101 20190214084514.606 ERROR    network - network.c:instantiate_veth:106 - Operation not permitted - Failed to create veth pair "veth101i0" and "vethHP9QUR"
    lxc-start 101 20190214084514.606 ERROR    network - network.c:lxc_create_network_priv:2462 - Failed to create network device
    
    I don't know if it actually started to happen in the same time frame, but the "strangest" thing in our network config is that the vmbr0 is marked as VLAN-aware.

    Is there any solution yet? Given that the general consensus seem to be that 4.18 should have mostly fixed this, is there any PVE 4.18 kernel available? I looked in the testing repo but it was nowhere to be found. Would installing an Ubuntu "mainline kernel" be problematic? Or are there any older kernels in the repo which don't exhibit the problem?
     
  6. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    As far as I know, this issue is only resolved by 4.18+. You may be able to use a kernel from Ubuntu or Debian Backports, but I didn't have any luck due to missing ZFS support and/or hardware modules in those kernels. I'm currently building my own kernels to track 4.19 + ZFS + hardware I need. Any kernel 4.18+ that I tried breaks AppArmor. To fix AppArmor I compiled and installed libapparmor and apparmor_parser from source for version 2.13.2.

    The good news is that I haven't had a single problem on the 4.18+ kernels -- mirroring the findings of that Ubuntu kernel bug report above. Unfortunately, this isn't a practical process for many Proxmox users. It would be ideal if Proxmox would package these changes but 4.18+ kernels aren't core for Debian Stretch (more work to maintain) and this is a rare race condition so I don't think enough of their user base is complaining. We were seeing it on about 10% of our infrastructure over the course of 3 months.
     
    Matteo Italia and fireon like this.
  7. Matteo Italia

    Matteo Italia New Member
    Proxmox Subscriber

    Joined:
    Feb 5, 2019
    Messages:
    4
    Likes Received:
    0
    Thank you for the reply... missing ZFS isn't a big issue, and I think that our node has "normal enough" hardware to be supported by a stock kernel. My main fear is indeed about AppArmor, but wouldn't it be enough to "steal" libapparmor and apparmor_parser from latest LTS Ubuntu?
     
  8. denos

    denos Member

    Joined:
    Jul 27, 2015
    Messages:
    72
    Likes Received:
    32
    If AppArmor doesn't work you can boot back into an your current kernel and it will be fine. I suspect you won't be able to take libapparmor and apparmor (the package that contains apparmor_parser) from Ubuntu's repos without breaking a bunch of dependencies.

    If you decide to try the Ubuntu kernel, please report back here on your findings.
     
  9. Matteo Italia

    Matteo Italia New Member
    Proxmox Subscriber

    Joined:
    Feb 5, 2019
    Messages:
    4
    Likes Received:
    0
    Didn't try it yet, although today I updated the packages and I saw a new kernel (4.15.18-11-pve AKA 4.15.18-33), is there anywhere where I could see a changelog?
     
  10. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,183
    Likes Received:
    492
    "apt changelog PACKAGE", or on the Updates tab on the web interface, or next to the binary .deb file on our repository servers ;)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    Matteo Italia likes this.
  11. Matteo Italia

    Matteo Italia New Member
    Proxmox Subscriber

    Joined:
    Feb 5, 2019
    Messages:
    4
    Likes Received:
    0
    I'm used to pressing `C` in aptitude, that didn't turn up anything, so I imagined the builtin apt changelog mechanism being broken/unused for that package. Hovever, apt changelog pve-kernel-4.15.18-11-pve did work - bizarre!

    That being said, the new pve-kernel updates the sources to Ubuntu-4.15.0-46.49, whose changelog in turn is https://launchpad.net/ubuntu/+source/linux/4.15.0-46.49; it looks like there are some fixes to the network stack, but nothing that seems particularly related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678.
     
  12. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    22
    Likes Received:
    3
    We're experiencing this same issue with various Proxmox v5.3 servers and have paid Proxmox support that hasn't been able to provide a solution for us. What is the solution here since this post is marked as solved?
     
    weehooey likes this.
  13. weehooey

    weehooey New Member
    Proxmox Subscriber

    Joined:
    Mar 11, 2019
    Messages:
    3
    Likes Received:
    1
    We are having the same issue. Running this code:
    Code:
    grep copy_net_ns /proc/*/stack
    Provides this output:
    Code:
    /proc/30588/stack:[<0>] copy_net_ns+0xab/0x220
    Our version:
    Code:
    pve-manager/5.3-11/d4907f84 (running kernel: 4.15.18-11-pve)
    Following these steps:
    1. Create CentOS-7 LXC
    2. Try to log-in via console, and can't.
    3. Destroy LXC
    4. Cannot log in to GUI.
    I think "SOLVED" should be removed from this thread.
     
    seneca214 likes this.
  14. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    22
    Likes Received:
    3
    Proxmox's support unfortunately hasn't been any help with this. We've been trying to at least find a workaround on our Proxmox servers. Simply as a test, referencing this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765980, we setup ip6tables on one of our nodes to block all IPv6 traffic. Since the change, we haven't been able to reproduce the issue on our test server with automated LXC shutdown/startup of containers (this server previously exhibited the issue). I'd be curious to hear if anyone else is able to try a similar approach and see if they can reproduce the issue afterwards.

    It would be helpful to get an update from Proxmox about this issue and remove 'SOLVED' from the thread assuming there isn't a solution here I'm missing.
     
  15. weehooey

    weehooey New Member
    Proxmox Subscriber

    Joined:
    Mar 11, 2019
    Messages:
    3
    Likes Received:
    1
    We would be happy to try it on our test node. Can you share the exact ip6tables rules you implemented?
     
  16. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    22
    Likes Received:
    3
    /etc/network/ip6tables.up.rules contains:

    *filter
    :INPUT DROP [0:0]
    :FORWARD DROP [0:0]
    :OUTPUT DROP [1:56]
    COMMIT

    After applying, running ip6tables -L shows:

    Chain INPUT (policy DROP)
    target prot opt source destination

    Chain FORWARD (policy DROP)
    target prot opt source destination

    Chain OUTPUT (policy DROP)
    target prot opt source destination
     
  17. weehooey

    weehooey New Member
    Proxmox Subscriber

    Joined:
    Mar 11, 2019
    Messages:
    3
    Likes Received:
    1
    @seneca214 two things:
    1. After updating the ip6tables, we do not seem to be having the issue.
    2. We are having trouble getting the ip6tables changes to stick after a reboot. Usually, we use ip6tables-save but none of the file names/locations seem to work. Thinking it was the Proxmox firewall, I cannot see an easy way to only block IPv6 except to disable it: https://forum.proxmox.com/threads/what-do-i-need-to-do-to-disable-ipv6.42466/
    Running:
    Code:
    ip6tables -L
    we get:
    Code:
    Chain INPUT (policy DROP)
    target     prot opt source               destination        
    
    Chain FORWARD (policy DROP)
    target     prot opt source               destination        
    
    Chain OUTPUT (policy DROP)
    target     prot opt source               destination     
    How do you get your changes to stick after rebooting the node?
     
  18. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    22
    Likes Received:
    3
    @weehooey

    1) That's great to hear. Let's hope others can help test and confirm.

    2) This was a change we've only tested today and have been running them manually. I'll get back to you on how we set it up to persist across reboots assuming this does indeed work until Proxmox comes out with an updated kernel or better solution.
     
    weehooey likes this.
  19. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    22
    Likes Received:
    3
    @weehooey on our test server, we've installed the iptables-persistent package. This keeps a file in /etc/iptables/rules.v6 that automatically loads on boot.
     
    weehooey likes this.
  20. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    797
    Likes Received:
    65
    Thanks for picking that issue up again - since it obviously is affecting users!
    The part with blocking ipv6 mitigating the issue, by not triggering it is quite helpful, and could explain, why only rather few reports and reproducers came in w.r.t. this case.

    I'll try to reproduce the problem locally and update the bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=1943
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice