veth does not come up anymore on several LXC containers

nacaze

New Member
Jan 13, 2020
2
0
1
33
I have a crazy problem and been searching for hours, I'm stuck and can't make sens of it. Any help to troubleshoot and figure out this further would be greatly appreciated.

This is my situation:

On a Proxmox VE 6.3-1 (amd64) I have various LXC containers, created from the official archlinux-base template, all attached to vmbr1 which is a virtual LAN. The gateway for this vLAN is a VM with pfSesne, that is also attached to vmbr1 and vmbr0, which is bound to the physical NIC on the machine.

This setup has been working nicely for several weeks, with me doing various configuration changes here and there. Then I did the following (trying to recollect as good as I can)((all from the web GUI):

I discovered that an LXC container can be converted to a template, so I wanted to try that out. I made a new arch template based container, did the initial steps I always do for each of them (basically just making pacman functional and do a full system update), then converted that container to a Template. Then I cloned that template to have a new container for usage. But I couldn't ssh into that new container and when entering it from the host cli I discovered that the veth did not have the configured static IP and was in fact down:

Bash:
sh-5.0# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
75: eth0@if76: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:e0:3f:ff:8e:20 brd ff:ff:ff:ff:ff:ff link-netnsid 0

It was not clear what was going on here, but then I discovered from that point on, whenever I would reboot, or just start one of my existing LXC containers (that had been running and restarting without issues for weeks) exhibited the very same issue! All of them but one! I tried various things to find the cause, eventually deleting the newly made container template, and the container that I cloned from it, rebooting the whole system, all to no avail.

There is a workaround though: after the lxc is up, I can change the static ip config via the web GUI, it will immediately apply it and the interface goes up, network works. Then change it back to the original ip address, also works. This is doable but needless to say pretty annoying. (The same can probably done with success via the cli on the host, and there it could be easily automated - but first I would prefer to understand what the hell is going on there.

One container that existed before still works normaly. Newly created lxc containers also work normally. All others preexisting exhibit the same described problem.
The logs I can get from within the containers (journalctl and dmesg) are pretty much identical for the broken and functional containers!

But comparing the journal on the host for starting the container shows that on a newly created container there is more log, and the veth eventually enters forwarding state (which is what we want I guess), which never happens on the broken one.

Code:
Jan 13 14:04:34 hypervisor kernel: fwbr802i0: port 2(veth802i0) entered forwarding state

(I tried attaching the whole relevant journal section but was above message size limit)

No indicative error messages as far as I can tell. But I also can't say that I fully understand everything going on there. My current gut feeling is that it may be related to apparmor, though the messages it prints don't seem related to the problem.

Please let me know what else I can look into, or what other information I should provide. Any help welcome!
 
hi,

you mentioned archlinux templates. there have been some problems with newer version of systemd and containers (especially arch who gets updates early). maybe you somehow stumbled on that?

is the container privileged or unprivileged?
 
hi, thanks for your quick response,

the preexisting container that is still working normally is privileged indeed. I did not think of that. that probably plays a role. All other containers are unprivileged.

and to test the systemd hypothesis, I did a full system update on the newly created working arch linux container (so now it's on systemd-244.1-1-x86_64) performed a reboot and voila: network gone. (only the privileged container works with the same latest versions)

You were right on, thank you very much! Gives me a little peace of mind already. I guess I might be better off with the debian template for now.

If there is anything I can contribute to resolve this issue though, let me know. Even with all the troubles of living on the bleeding edge, I just like arch linux :)


PS: I just realized the workaround I described does not suffice. To reach the internet one has to manually add the default gateway as well.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!