LXC loses network interface after a few days - possible solution found, but why?

virtManager

Member
Jun 11, 2020
28
4
8
44
Hi,

I have installed a Turnkey-fileserver (LXC) to create SMB/CIFS and NFS-shares with low resources and fast disk I/O instead of having to create a VM (also because I've read it's a bad idea to install those things directly on the host). The problem is that after a few days, I usually cannot ping the NIC anymore and only the loop-back device is left:

Code:
root@turnkey-fileserver ~# ip -4 a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever

The error message I see is:

Code:
root@turnkey-fileserver ~# systemctl status networking
* networking.service - Raise network interfaces
   Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2022-05-28 20:14:42 CEST; 2 days ago
     Docs: man:interfaces(5)
  Process: 77 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
 Main PID: 77 (code=exited, status=1/FAILURE)
      CPU: 118ms

May 28 20:14:32 turnkey-fileserver ifup[77]: udhcpc: sending discover
May 28 20:14:35 turnkey-fileserver ifup[77]: udhcpc: sending discover
May 28 20:14:38 turnkey-fileserver ifup[77]: udhcpc: sending discover
May 28 20:14:42 turnkey-fileserver ifup[77]: /etc/udhcpc/default.script: Lease failed:
May 28 20:14:42 turnkey-fileserver ifup[77]: udhcpc: no lease, failing
May 28 20:14:42 turnkey-fileserver ifup[77]: ifup: failed to bring up eth0
May 28 20:14:42 turnkey-fileserver systemd[1]: networking.service: Main process exited, code=exited, sta
May 28 20:14:42 turnkey-fileserver systemd[1]: networking.service: Failed with result 'exit-code'.
May 28 20:14:42 turnkey-fileserver systemd[1]: Failed to start Raise network interfaces.
May 28 20:14:42 turnkey-fileserver systemd[1]: networking.service: Consumed 118ms CPU time.

I've tried to search for similar posts and found this: https://forum.proxmox.com/threads/lxc-lose-sometimes-network-connection.68686/ - with answer: "I guess the DHCP lease is not renewed any more. I would set the IP in a static way." - but no: I use pfSense (virtualized on the Proxmox server) and all other devices get their DHCP lease renewed. So I don't think the problem lies with the DHCP-server... I can bring up (consistently) the network again using (tedious, requires me to login to the LXC and tedious because this shouldn't be necessary every now and then a few days, it should always be up like my other devices):

Code:
root@turnkey-fileserver ~# systemctl restart networking 
root@turnkey-fileserver ~# systemctl status networking 
* networking.service - Raise network interfaces
   Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
   Active: active (exited) since Mon 2022-05-30 22:09:23 CEST; 5s ago
     Docs: man:interfaces(5)
  Process: 2646 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
 Main PID: 2646 (code=exited, status=0/SUCCESS)
    Tasks: 1 (limit: 17848)
   Memory: 340.0K
      CPU: 256ms
   CGroup: /system.slice/networking.service
           `-2684 /sbin/udhcpc -n -p /run/udhcpc.eth0.pid -i eth0

May 30 22:09:23 turnkey-fileserver systemd[1]: Starting Raise network interfaces...
May 30 22:09:23 turnkey-fileserver ifup[2646]: udhcpc: started, v1.30.1
May 30 22:09:23 turnkey-fileserver ifup[2646]: udhcpc: sending discover
May 30 22:09:23 turnkey-fileserver ifup[2646]: udhcpc: sending select for 192.168.100.10
May 30 22:09:23 turnkey-fileserver ifup[2646]: udhcpc: lease of 192.168.100.10 obtained, lease time 7200
May 30 22:09:23 turnkey-fileserver ifup[2646]: /etc/udhcpc/default.script: Resetting default routes
May 30 22:09:23 turnkey-fileserver ifup[2646]: SIOCDELRT: No such process
May 30 22:09:23 turnkey-fileserver ifup[2646]: /etc/udhcpc/default.script: Adding DNS 192.168.100.1
May 30 22:09:23 turnkey-fileserver ifup[2646]: /etc/resolvconf/update.d/libc: Warning: /etc/resolv.conf
May 30 22:09:23 turnkey-fileserver systemd[1]: Started Raise network interfaces.

root@turnkey-fileserver ~# ip -4 a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netnsid 0
    inet 192.168.100.10/24 brd 192.168.100.255 scope global eth0
       valid_lft forever preferred_lft forever

I've also found this thread and answer: https://askubuntu.com/a/1026911 - which made me directly modify the /etc/network/interfaces file such that instead of "auto eth0" I have "allow-hotplug eth0" (followed by "iface eth0 inet dhcp"). This really seems to bring back network stability! Let me elaborate, because this is really the question I want to ask (the above is just the context):

BEFORE - /etc/network/interfaces (on the LXC fileserver):
Code:
# UNCONFIGURED INTERFACES
# remove the above line if you edit this file

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

AFTER - /etc/network/interfaces (on the LXC fileserver):
Code:
# UNCONFIGURED INTERFACES
# remove the above line if you edit this file

auto lo
iface lo inet loopback

# "auto eth0" was replaced by "allow-hotplug eth0" but
#   it automatically inserts "auto eth0" after reboot:
#   (attempt to avoid that occasionally eth0 disappears)
allow-hotplug eth0
auto eth0
iface eth0 inet dhcp

Yes, I know I actually didn't remove the top line even though I modified the file. But why does it seem to help or make a difference with "allow-hotplug"? I didn't read this in any guides, but the ask-ubuntu-site (link above) does discuss this, but maybe I just don't understand it. Is this typically recommended, i.e. to use "allow-hotplug" for LXC or does it have anything to do with eth0 being a virtual bridge (if so I haven't seen that recommendation)?

Finally a few extra details: My system is a small lowpower hp t730, running pfsense. The NIC is configured to simply be a vmbr0 "Linux Bridge" interface, which in the LXC-"Network tab is setup as a network bridge with IP address "dhcp". That same "Linux Bridge"-interface is also made available to my pfSense-VM which has that vmbr0-bridge configured in the "Hardware"-tab, where it's a network device with "no VLAN", model = "VirtIO (paravirtualized)" and with unchecked boxes at "Firewall", "Disconnect" and nothing at "Rate limit" (=unlimit.) and multiqueue is empty. I think the issue lies with the LXC, although the bridge is also used in pfSense - i.e. I'm just telling this to let you know that the NIC is not a physical NIC...

Anyone has any ideas? I would be grateful, if I could know for sure that this is in fact the (recommended) solution (and it would be nice to understand why it works also), thanks!
 
I just started with ProxMox lxc containers and was facing this same exact issue. In my case DHCP is from dnsmasq running in Asus merlin router. I was using a hack of issuing `dhclient -v -r eth0 && dhclient -v eth0` which was enough for it to renew the lease. Just now applied the hot-plug eth0 fix you mentioned. Let me get back if it actually fixes it. Unfortunately I can't really answer your actual question though...
 
Hi @adystech. Unfortunately I don't think that modified /etc/network/interfaces-file completed fixed it, because I found out that the problem persisted. My only - last - "solution" was/is to have a cronjob that pings my server every 5 minutes. If it cannot ping, it reboots the LXC... Basically it's a bash-script called from a cronjob (where srv="192.168.xx.xx"):

Code:
ping -q -c1 "$srv" >/dev/null || pct reboot 103 --timeout 10

Just before rebooting, I also log the problem to /tmp/reboot.txt or similar so I can see when this appeared. I think it's more an LXC-issue than it's a proxmox-issue, unfortunately I still don't know and the solution I came up with is really bad - but at least my fileserver is only offline for up to 5 minutes now and I don't want to ping every minute... I hope this help - and thanks for participating in this discussion, maybe one day someone who knows what's wrong will write the solution! :)
 
@virtManager well, so far for 5 days my two lxc containers are well and talking.. going by the networking service logs, it has renewed the dhcp leases multiple times. I will still give you credit for the hotplug idea :)
 
I'm glad that worked. It's nice that "pct reboot" for my LXC fileserver only takes a few seconds (maybe 5-8 secs and it's up again). My LXC fileserver container is this one: https://www.turnkeylinux.org/fileserver - now it doesn't even reboot for up to 3 weeks. But in the beginning it was much worser, could be a few times per week...

Question: You also use a https://www.turnkeylinux.org LXC container, right? It probably isn't the same as mine, is it? Just trying to see if there's a pattern because the best fix is if we could completely avoid this situation and we didn't have to apply such an ugly "patch" / solution... Maybe one of us/someone should one day raise a support request or something at https://www.turnkeylinux.org/, maybe that would be better... So the real problem can be fixed... But thanks for letting me know you took the same solution as I did, it's also a fast "solution" - or "work-around" is probably a better term to use... :)
 
@virtManager I am using the plain debian-11-standard as my container but the symptom (of networking service failing to renew DHCP lease and losing external connectivity) exists there too.
 
Okay, thanks... hmm, I don't have more ideas. If more people experience this problem, please write which LXC container you use/see this problem with, maybe we'll see a pattern that could lead to a real/proper fix instead of using this "automatic restart"-work-around...
 
I have exactly the same problem with two Debian 11 containers running on two different nodes of a cluster. Interestingly, both seem to lose their lease at approximately the same time if they have been started together. Since they are primary and secondary internal nameservers, they tend to be started together. They are receiving their leases from a Ubiquiti edgerouter. No other dhcp leases seem to be having this problem. Since the failure is infrequent (several days to a couple of weeks) it is really hard to debug.
 
I agree - extremely difficult to debug... But, I think it's an LXC-issue and not a proxmox-issue, found a few threads that sounds similar:
I have an LXC fileserver that used to have this problem often. I almost don't use it now. It's been running for 3 weeks without a reboot now. It's difficult to track when it happens rarely and it makes it difficult to try to experiment. Also I really prefer using DHCP instead of static IP. Maybe all of us are using DHCP and maybe there's an issue with DHCP for LXC. It would be great if we could describe the problem better in e.g. https://github.com/lxc/lxd/issues/ but seems we're still guessing... I'm still interested in this topic, but luckily only see this rarely now (also weird, I didn't change the config)...
 
I agree - extremely difficult to debug... But, I think it's an LXC-issue and not a proxmox-issue, found a few threads that sounds similar:
I have an LXC fileserver that used to have this problem often. I almost don't use it now. It's been running for 3 weeks without a reboot now. It's difficult to track when it happens rarely and it makes it difficult to try to experiment. Also I really prefer using DHCP instead of static IP. Maybe all of us are using DHCP and maybe there's an issue with DHCP for LXC. It would be great if we could describe the problem better in e.g. https://github.com/lxc/lxd/issues/ but seems we're still guessing... I'm still interested in this topic, but luckily only see this rarely now (also weird, I didn't change the config)...
the suggestion of disabling IPV6 on proxmox config (set to static without any value for IPV6) seems to be a workable fix, after applying it I haven't lost the networking on 3 of my containers for almost a month now.
 
To chime in here, it's the same issue here on my LXC reverse proxy, which is a vanilla debian 11 installation, just with nginx and lego. Disabling IPv6 did not fix it, as it was already disabled on router and proxmox. It happens once a day. A reboot fixes it. I have a DHCP lease time of 24 hours, which could in deed point in the right direction – once a day. The router log (which is the DHCP server) does not show any DHCP error. Also nothing in the LXC's or proxmox's logs.
 
Is there any way to debug a container other than via command line? (Which makes no sense due to long time waiting.)
 
I am having problems with this too. Something is stopping DHCP renewals.

Logging the DHCP server (a pi running pi-hole) and trying to renew the lease with dhclient results in no DHCP packets being received at the server.

Restart the LXC and it works, the packets get through.

Debian 11 based LXC's.
 
Does anyone have a solution? I have the same problem. DHCLIENT does not renewing IP address.

PROXMOX 7.4-14 and Debian 11 with LXC.
 
Does anyone have a solution? I have the same problem. DHCLIENT does not renewing IP address.

PROXMOX 7.4-14 and Debian 11 with LXC.
As mentioned, I'm using the incredibly bad solution with a cron-script that runs maybe every 5 minutes or something (cannot remember) and then it pings to see if the network works: BAD_SOLUTION - and automatically restarts the LXC, if no ping reply. If you mean a "proper/real" solution, I'm also hoping one day this issue is better understood, a proper solution comes up or maybe the issue is even fixed properly so this doesn't happen...
 
I had this exact issue. Although the problem kept persisting after changed ipv6 to static with empty ip. I tried then setting the ipv4 address to static but to my surprise the issue kept persisting. I checked the /etc/network/interfaces file and there were old interfaces in there that I had removed earlier using the proxmox interface.
F.e.:
Code:
auto eth1
iface eth1 inet dhcp

Apparently it does not clean those up. I removed them and the issue is resolved for me.
 
I had this exact issue. Although the problem kept persisting after changed ipv6 to static with empty ip. I tried then setting the ipv4 address to static but to my surprise the issue kept persisting. I checked the /etc/network/interfaces file and there were old interfaces in there that I had removed earlier using the proxmox interface.
F.e.:
Code:
auto eth1
iface eth1 inet dhcp

Apparently it does not clean those up. I removed them and the issue is resolved for me.
Hi,

I also have several CTs with dhcp, fore many years and I did not have any problems. As I know, dhcp is very dependent by time date on both sides, dhcp server and dhcp client.

For debugging the problem, a good ideea is to run a tcp dump on both sides.

From what I see in the past, dhcp client is failing if the network has problems(discard/drop/error packages)

Good luck/Bafta !
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!