LXC container reboot fails - LXC becomes unusable

denos · Nov 2, 2018

When the startup is hung, do

Code:

grep copy_net_ns /proc/*/stack

If that returns anything, you're having this issue. I can confirm that the issue is still present (but less frequent) on recent 4.15.x Proxmox PVE kernels. The Ubuntu kernel team acknowledged this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

The only solution provided was to use kernel 4.18. I was able to build and boot 4.18 with Proxmox but it breaks the current version of AppArmor with multiple "Profile doesn't conform to protocol" errors. The solution is build a new AppArmor but I haven't gotten that far yet.

mattlach · Nov 2, 2018

denos said:
When the startup is hung, do

Code:

grep copy_net_ns /proc/*/stack

If that returns anything, you're having this issue. I can confirm that the issue is still present (but less frequent) on recent 4.15.x Proxmox PVE kernels. The Ubuntu kernel team acknowledged this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

The only solution provided was to use kernel 4.18. I was able to build and boot 4.18 with Proxmox but it breaks the current version of AppArmor with multiple "Profile doesn't conform to protocol" errors. The solution is build a new AppArmor but I haven't gotten that far yet.

Hmm.

I will have to check this a little later.

Does a reboot temporarily solve the issue? I could probably do that overnight, and then go another few months without running into it again.

My use case doesn't require restarting containers regularly. They start once when the server goes up, and then pretty much run until the server needs to reboot for some reason (like a kernel update)

denos · Nov 3, 2018

mattlach said:
Hmm.

I will have to check this a little later.

Does a reboot temporarily solve the issue? I could probably do that overnight, and then go another few months without running into it again.

My use case doesn't require restarting containers regularly. They start once when the server goes up, and then pretty much run until the server needs to reboot for some reason (like a kernel update)

Yes, a reboot will clear it up -- I'm not aware of any way to recover a system in this state without a reboot. My experience has been the same as in that Ubuntu kernel bug report; it's an infrequent condition that presents like a deadlock. We typically go months between incidents on 4.15 kernels which is much better than 4.13. It's unfortunate that 4.18 seems to be the only way to completely resolve the issue.

mattlach · Nov 3, 2018

denos said:
Yes, a reboot will clear it up -- I'm not aware of any way to recover a system in this state without a reboot. My experience has been the same as in that Ubuntu kernel bug report; it's an infrequent condition that presents like a deadlock. We typically go months between incidents on 4.15 kernels which is much better than 4.13. It's unfortunate that 4.18 seems to be the only way to completely resolve the issue.

Thanks for the help.

I rebooted the server today, and it appears to be running normally again.

Hopefully a 4.18+ PVE Kernel that fixes this issue will be made available quickly.

I mean, I could easily either compile, download a mainline binary kernel or add the sources for the kernel from Ubuntu's cosmic-updates repository which is now at 4.18.0-10, I believe, but I don't know what magic the Proxmox team does to their kernel releases to make them well suited to being used in a Hypervisor. I don't want to cause any issues by doing this.

Matteo Italia · Feb 14, 2019

I'm still experiencing this exact same issue on 4.15.18-10-pve; it keeps happening with the same symptoms, so the node is borderline unusable. Trying to start the containers with straight lxc-start results in the dreaded

Code:

lxc-start 101 20190214084514.606 ERROR    network - network.c:instantiate_veth:106 - Operation not permitted - Failed to create veth pair "veth101i0" and "vethHP9QUR"
lxc-start 101 20190214084514.606 ERROR    network - network.c:lxc_create_network_priv:2462 - Failed to create network device

I don't know if it actually started to happen in the same time frame, but the "strangest" thing in our network config is that the vmbr0 is marked as VLAN-aware.

Is there any solution yet? Given that the general consensus seem to be that 4.18 should have mostly fixed this, is there any PVE 4.18 kernel available? I looked in the testing repo but it was nowhere to be found. Would installing an Ubuntu "mainline kernel" be problematic? Or are there any older kernels in the repo which don't exhibit the problem?

denos · Feb 14, 2019

Matteo Italia said:
Is there any solution yet? Given that the general consensus seem to be that 4.18 should have mostly fixed this, is there any PVE 4.18 kernel available? I looked in the testing repo but it was nowhere to be found. Would installing an Ubuntu "mainline kernel" be problematic? Or are there any older kernels in the repo which don't exhibit the problem?

As far as I know, this issue is only resolved by 4.18+. You may be able to use a kernel from Ubuntu or Debian Backports, but I didn't have any luck due to missing ZFS support and/or hardware modules in those kernels. I'm currently building my own kernels to track 4.19 + ZFS + hardware I need. Any kernel 4.18+ that I tried breaks AppArmor. To fix AppArmor I compiled and installed libapparmor and apparmor_parser from source for version 2.13.2.

The good news is that I haven't had a single problem on the 4.18+ kernels -- mirroring the findings of that Ubuntu kernel bug report above. Unfortunately, this isn't a practical process for many Proxmox users. It would be ideal if Proxmox would package these changes but 4.18+ kernels aren't core for Debian Stretch (more work to maintain) and this is a rare race condition so I don't think enough of their user base is complaining. We were seeing it on about 10% of our infrastructure over the course of 3 months.

Matteo Italia · Feb 18, 2019

Thank you for the reply... missing ZFS isn't a big issue, and I think that our node has "normal enough" hardware to be supported by a stock kernel. My main fear is indeed about AppArmor, but wouldn't it be enough to "steal" libapparmor and apparmor_parser from latest LTS Ubuntu?

denos · Feb 18, 2019

Matteo Italia said:
Thank you for the reply... missing ZFS isn't a big issue, and I think that our node has "normal enough" hardware to be supported by a stock kernel. My main fear is indeed about AppArmor, but wouldn't it be enough to "steal" libapparmor and apparmor_parser from latest LTS Ubuntu?

If AppArmor doesn't work you can boot back into an your current kernel and it will be fine. I suspect you won't be able to take libapparmor and apparmor (the package that contains apparmor_parser) from Ubuntu's repos without breaking a bunch of dependencies.

If you decide to try the Ubuntu kernel, please report back here on your findings.

Matteo Italia · Feb 27, 2019

Didn't try it yet, although today I updated the packages and I saw a new kernel (4.15.18-11-pve AKA 4.15.18-33), is there anywhere where I could see a changelog?

fabian · Feb 28, 2019

Matteo Italia said:
Didn't try it yet, although today I updated the packages and I saw a new kernel (4.15.18-11-pve AKA 4.15.18-33), is there anywhere where I could see a changelog?

"apt changelog PACKAGE", or on the Updates tab on the web interface, or next to the binary .deb file on our repository servers

Matteo Italia · Mar 1, 2019

fabian said:
"apt changelog PACKAGE", or on the Updates tab on the web interface, or next to the binary .deb file on our repository servers

I'm used to pressing `C` in aptitude, that didn't turn up anything, so I imagined the builtin apt changelog mechanism being broken/unused for that package. Hovever, apt changelog pve-kernel-4.15.18-11-pve did work - bizarre!

That being said, the new pve-kernel updates the sources to Ubuntu-4.15.0-46.49, whose changelog in turn is https://launchpad.net/ubuntu/+source/linux/4.15.0-46.49; it looks like there are some fixes to the network stack, but nothing that seems particularly related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678.

seneca214 · Mar 7, 2019

We're experiencing this same issue with various Proxmox v5.3 servers and have paid Proxmox support that hasn't been able to provide a solution for us. What is the solution here since this post is marked as solved?

weehooey-bh · Mar 11, 2019

We are having the same issue. Running this code:

Code:

grep copy_net_ns /proc/*/stack

Provides this output:

Code:

/proc/30588/stack:[<0>] copy_net_ns+0xab/0x220

Our version:

Code:

pve-manager/5.3-11/d4907f84 (running kernel: 4.15.18-11-pve)

Following these steps:

Create CentOS-7 LXC
Try to log-in via console, and can't.
Destroy LXC
Cannot log in to GUI.

I think "SOLVED" should be removed from this thread.

seneca214 · Mar 11, 2019

Proxmox's support unfortunately hasn't been any help with this. We've been trying to at least find a workaround on our Proxmox servers. Simply as a test, referencing this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765980, we setup ip6tables on one of our nodes to block all IPv6 traffic. Since the change, we haven't been able to reproduce the issue on our test server with automated LXC shutdown/startup of containers (this server previously exhibited the issue). I'd be curious to hear if anyone else is able to try a similar approach and see if they can reproduce the issue afterwards.

It would be helpful to get an update from Proxmox about this issue and remove 'SOLVED' from the thread assuming there isn't a solution here I'm missing.

weehooey-bh · Mar 11, 2019

seneca214 said:
we setup ip6tables on one of our nodes to block all IPv6 traffic.

We would be happy to try it on our test node. Can you share the exact ip6tables rules you implemented?

seneca214 · Mar 11, 2019

/etc/network/ip6tables.up.rules contains:

*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT DROP [1:56]
COMMIT

After applying, running ip6tables -L shows:

Chain INPUT (policy DROP)
target prot opt source destination

Chain FORWARD (policy DROP)
target prot opt source destination

Chain OUTPUT (policy DROP)
target prot opt source destination

weehooey-bh · Mar 11, 2019

@seneca214 two things:

After updating the ip6tables, we do not seem to be having the issue.
We are having trouble getting the ip6tables changes to stick after a reboot. Usually, we use ip6tables-save but none of the file names/locations seem to work. Thinking it was the Proxmox firewall, I cannot see an easy way to only block IPv6 except to disable it: https://forum.proxmox.com/threads/what-do-i-need-to-do-to-disable-ipv6.42466/

Running:

Code:

ip6tables -L

we get:

Code:

Chain INPUT (policy DROP)
target     prot opt source               destination        

Chain FORWARD (policy DROP)
target     prot opt source               destination        

Chain OUTPUT (policy DROP)
target     prot opt source               destination

How do you get your changes to stick after rebooting the node?

seneca214 · Mar 11, 2019

@weehooey

1) That's great to hear. Let's hope others can help test and confirm.

2) This was a change we've only tested today and have been running them manually. I'll get back to you on how we set it up to persist across reboots assuming this does indeed work until Proxmox comes out with an updated kernel or better solution.

seneca214 · Mar 12, 2019

@weehooey on our test server, we've installed the iptables-persistent package. This keeps a file in /etc/iptables/rules.v6 that automatically loads on boot.

Stoiko Ivanov · Mar 12, 2019

Thanks for picking that issue up again - since it obviously is affecting users!
The part with blocking ipv6 mitigating the issue, by not triggering it is quite helpful, and could explain, why only rather few reports and reproducers came in w.r.t. this case.

I'll try to reproduce the problem locally and update the bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=1943

LXC container reboot fails - LXC becomes unusable

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Proxmox Staff Member

New Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Proxmox Staff Member