LXC container reboot fails - LXC becomes unusable

When the startup is hung, do
Code:
grep copy_net_ns /proc/*/stack

If that returns anything, you're having this issue. I can confirm that the issue is still present (but less frequent) on recent 4.15.x Proxmox PVE kernels. The Ubuntu kernel team acknowledged this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

The only solution provided was to use kernel 4.18. I was able to build and boot 4.18 with Proxmox but it breaks the current version of AppArmor with multiple "Profile doesn't conform to protocol" errors. The solution is build a new AppArmor but I haven't gotten that far yet.
 
  • Like
Reactions: mattlach
When the startup is hung, do
Code:
grep copy_net_ns /proc/*/stack

If that returns anything, you're having this issue. I can confirm that the issue is still present (but less frequent) on recent 4.15.x Proxmox PVE kernels. The Ubuntu kernel team acknowledged this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

The only solution provided was to use kernel 4.18. I was able to build and boot 4.18 with Proxmox but it breaks the current version of AppArmor with multiple "Profile doesn't conform to protocol" errors. The solution is build a new AppArmor but I haven't gotten that far yet.

Hmm.

I will have to check this a little later.

Does a reboot temporarily solve the issue? I could probably do that overnight, and then go another few months without running into it again.

My use case doesn't require restarting containers regularly. They start once when the server goes up, and then pretty much run until the server needs to reboot for some reason (like a kernel update)
 
Hmm.

I will have to check this a little later.

Does a reboot temporarily solve the issue? I could probably do that overnight, and then go another few months without running into it again.

My use case doesn't require restarting containers regularly. They start once when the server goes up, and then pretty much run until the server needs to reboot for some reason (like a kernel update)

Yes, a reboot will clear it up -- I'm not aware of any way to recover a system in this state without a reboot. My experience has been the same as in that Ubuntu kernel bug report; it's an infrequent condition that presents like a deadlock. We typically go months between incidents on 4.15 kernels which is much better than 4.13. It's unfortunate that 4.18 seems to be the only way to completely resolve the issue.
 
  • Like
Reactions: mattlach
Yes, a reboot will clear it up -- I'm not aware of any way to recover a system in this state without a reboot. My experience has been the same as in that Ubuntu kernel bug report; it's an infrequent condition that presents like a deadlock. We typically go months between incidents on 4.15 kernels which is much better than 4.13. It's unfortunate that 4.18 seems to be the only way to completely resolve the issue.

Thanks for the help.

I rebooted the server today, and it appears to be running normally again.

Hopefully a 4.18+ PVE Kernel that fixes this issue will be made available quickly.

I mean, I could easily either compile, download a mainline binary kernel or add the sources for the kernel from Ubuntu's cosmic-updates repository which is now at 4.18.0-10, I believe, but I don't know what magic the Proxmox team does to their kernel releases to make them well suited to being used in a Hypervisor. I don't want to cause any issues by doing this.
 
I'm still experiencing this exact same issue on 4.15.18-10-pve; it keeps happening with the same symptoms, so the node is borderline unusable. Trying to start the containers with straight lxc-start results in the dreaded
Code:
lxc-start 101 20190214084514.606 ERROR    network - network.c:instantiate_veth:106 - Operation not permitted - Failed to create veth pair "veth101i0" and "vethHP9QUR"
lxc-start 101 20190214084514.606 ERROR    network - network.c:lxc_create_network_priv:2462 - Failed to create network device

I don't know if it actually started to happen in the same time frame, but the "strangest" thing in our network config is that the vmbr0 is marked as VLAN-aware.

Is there any solution yet? Given that the general consensus seem to be that 4.18 should have mostly fixed this, is there any PVE 4.18 kernel available? I looked in the testing repo but it was nowhere to be found. Would installing an Ubuntu "mainline kernel" be problematic? Or are there any older kernels in the repo which don't exhibit the problem?
 
Is there any solution yet? Given that the general consensus seem to be that 4.18 should have mostly fixed this, is there any PVE 4.18 kernel available? I looked in the testing repo but it was nowhere to be found. Would installing an Ubuntu "mainline kernel" be problematic? Or are there any older kernels in the repo which don't exhibit the problem?

As far as I know, this issue is only resolved by 4.18+. You may be able to use a kernel from Ubuntu or Debian Backports, but I didn't have any luck due to missing ZFS support and/or hardware modules in those kernels. I'm currently building my own kernels to track 4.19 + ZFS + hardware I need. Any kernel 4.18+ that I tried breaks AppArmor. To fix AppArmor I compiled and installed libapparmor and apparmor_parser from source for version 2.13.2.

The good news is that I haven't had a single problem on the 4.18+ kernels -- mirroring the findings of that Ubuntu kernel bug report above. Unfortunately, this isn't a practical process for many Proxmox users. It would be ideal if Proxmox would package these changes but 4.18+ kernels aren't core for Debian Stretch (more work to maintain) and this is a rare race condition so I don't think enough of their user base is complaining. We were seeing it on about 10% of our infrastructure over the course of 3 months.
 
Thank you for the reply... missing ZFS isn't a big issue, and I think that our node has "normal enough" hardware to be supported by a stock kernel. My main fear is indeed about AppArmor, but wouldn't it be enough to "steal" libapparmor and apparmor_parser from latest LTS Ubuntu?
 
Thank you for the reply... missing ZFS isn't a big issue, and I think that our node has "normal enough" hardware to be supported by a stock kernel. My main fear is indeed about AppArmor, but wouldn't it be enough to "steal" libapparmor and apparmor_parser from latest LTS Ubuntu?
If AppArmor doesn't work you can boot back into an your current kernel and it will be fine. I suspect you won't be able to take libapparmor and apparmor (the package that contains apparmor_parser) from Ubuntu's repos without breaking a bunch of dependencies.

If you decide to try the Ubuntu kernel, please report back here on your findings.
 
Didn't try it yet, although today I updated the packages and I saw a new kernel (4.15.18-11-pve AKA 4.15.18-33), is there anywhere where I could see a changelog?

"apt changelog PACKAGE", or on the Updates tab on the web interface, or next to the binary .deb file on our repository servers ;)
 
  • Like
Reactions: Matteo Italia
"apt changelog PACKAGE", or on the Updates tab on the web interface, or next to the binary .deb file on our repository servers ;)
I'm used to pressing `C` in aptitude, that didn't turn up anything, so I imagined the builtin apt changelog mechanism being broken/unused for that package. Hovever, apt changelog pve-kernel-4.15.18-11-pve did work - bizarre!

That being said, the new pve-kernel updates the sources to Ubuntu-4.15.0-46.49, whose changelog in turn is https://launchpad.net/ubuntu/+source/linux/4.15.0-46.49; it looks like there are some fixes to the network stack, but nothing that seems particularly related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678.
 
We're experiencing this same issue with various Proxmox v5.3 servers and have paid Proxmox support that hasn't been able to provide a solution for us. What is the solution here since this post is marked as solved?
 
  • Like
Reactions: weehooey
We are having the same issue. Running this code:
Code:
grep copy_net_ns /proc/*/stack
Provides this output:
Code:
/proc/30588/stack:[<0>] copy_net_ns+0xab/0x220
Our version:
Code:
pve-manager/5.3-11/d4907f84 (running kernel: 4.15.18-11-pve)

Following these steps:
  1. Create CentOS-7 LXC
  2. Try to log-in via console, and can't.
  3. Destroy LXC
  4. Cannot log in to GUI.
I think "SOLVED" should be removed from this thread.
 
  • Like
Reactions: seneca214
Proxmox's support unfortunately hasn't been any help with this. We've been trying to at least find a workaround on our Proxmox servers. Simply as a test, referencing this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765980, we setup ip6tables on one of our nodes to block all IPv6 traffic. Since the change, we haven't been able to reproduce the issue on our test server with automated LXC shutdown/startup of containers (this server previously exhibited the issue). I'd be curious to hear if anyone else is able to try a similar approach and see if they can reproduce the issue afterwards.

It would be helpful to get an update from Proxmox about this issue and remove 'SOLVED' from the thread assuming there isn't a solution here I'm missing.
 
/etc/network/ip6tables.up.rules contains:

*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT DROP [1:56]
COMMIT

After applying, running ip6tables -L shows:

Chain INPUT (policy DROP)
target prot opt source destination

Chain FORWARD (policy DROP)
target prot opt source destination

Chain OUTPUT (policy DROP)
target prot opt source destination
 
@seneca214 two things:
  1. After updating the ip6tables, we do not seem to be having the issue.
  2. We are having trouble getting the ip6tables changes to stick after a reboot. Usually, we use ip6tables-save but none of the file names/locations seem to work. Thinking it was the Proxmox firewall, I cannot see an easy way to only block IPv6 except to disable it: https://forum.proxmox.com/threads/what-do-i-need-to-do-to-disable-ipv6.42466/
Running:
Code:
ip6tables -L
we get:
Code:
Chain INPUT (policy DROP)
target     prot opt source               destination        

Chain FORWARD (policy DROP)
target     prot opt source               destination        

Chain OUTPUT (policy DROP)
target     prot opt source               destination

How do you get your changes to stick after rebooting the node?
 
@weehooey

1) That's great to hear. Let's hope others can help test and confirm.

2) This was a change we've only tested today and have been running them manually. I'll get back to you on how we set it up to persist across reboots assuming this does indeed work until Proxmox comes out with an updated kernel or better solution.
 
  • Like
Reactions: weehooey
Thanks for picking that issue up again - since it obviously is affecting users!
The part with blocking ipv6 mitigating the issue, by not triggering it is quite helpful, and could explain, why only rather few reports and reproducers came in w.r.t. this case.

I'll try to reproduce the problem locally and update the bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=1943
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!