Network down on all LXC Containers

axion.joey

Active Member
Dec 29, 2009
78
2
28
Hey Guys,

We just experienced something that shook our confidence on Proxmox. We just set up a 5-node Proxmox cluster. All of the nodes are running the exact same version of Proxmox:

root@Proxmox:/var/log# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80

On Wednesday we created around 30 Centos 6 nodes per host. They worked fine for a few days. Then yesterday we started getting alarms. For some reason every LXC host in this cluster lost it's network connectivity.

pct config VMID shows the correct network info. Also if you enter the container and look at /etc/sysconfig/network-scripts/ifcfg-eth0 then that appears to be correct too. But if you run route -n or ifconfig within the container then it shows no information. Doing a service network restart within the containers restores service.

Luckily we had only put two of these containers into production. We have two other clusters using containers based on this exact same template. And they've never had this issue. The other clusters are running this version of Proxmox.

proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie

I'm really hoping someone can point us to a fix for this. We've used Proxmox for years and have never experienced any problems this severe.

Thanks in advance for your help.
 
Thanks for responding. Network Manager isn't installed on any of the containers. It's got to be something on the proxmox host. This doesn't happen on any of our other clusters, and the only difference is the version of Proxmox.
 
Unfortunately no, but I see this in /var/log/messages

Oct 22 18:08:08 affinitytarzana kernel: [15518.074006] device veth161i0 entered promiscuous mode
Oct 22 18:08:09 affinitytarzana kernel: [15519.083450] vmbr0: port 17(veth164i0) entered forwarding state
Oct 22 18:08:09 affinitytarzana kernel: [15519.207097] eth0: renamed from veth5Q1G5A
Oct 22 18:08:28 affinitytarzana kernel: [15537.799906] IPv6: ADDRCONF(NETDEV_UP): veth169i0: link is not ready
Oct 22 18:08:28 affinitytarzana kernel: [15538.727904] vmbr0: port 21(veth169i0) entered disabled state
Oct 22 18:08:29 affinitytarzana kernel: [15539.328511] IPv6: ADDRCONF(NETDEV_CHANGE): veth166i0: link becomes ready
Oct 22 18:08:42 affinitytarzana kernel: [15552.595032] vmbr0: port 27(veth173i0) entered disabled state
Oct 22 18:08:51 affinitytarzana kernel: [15561.729794] audit_printk_skb: 57 callbacks suppressed
Oct 22 18:08:58 affinitytarzana kernel: [15568.080694] vmbr0: port 26(veth172i0) entered forwarding state
Oct 22 18:08:58 affinitytarzana kernel: [15568.254434] eth0: renamed from vethNPN355
Oct 22 18:08:59 affinitytarzana kernel: [15569.238174] eth0: renamed from veth62AHBM
Oct 22 18:09:00 affinitytarzana kernel: [15570.186061] vmbr0: port 28(veth175i0) entered forwarding state
Oct 22 18:09:01 affinitytarzana kernel: [15570.831457] audit: type=1400 audit(1477184941.060:567): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/cpuinfo" pid=19655 comm="mount" flags="ro, remount"
Oct 22 18:09:10 affinitytarzana kernel: [15580.513929] audit: type=1400 audit(1477184950.744:598): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/sys/net/" pid=21487 comm="mount" flags="ro, remount"
Oct 22 18:09:10 affinitytarzana kernel: [15580.532101] audit: type=1400 audit(1477184950.760:605): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/cpuinfo" pid=21494 comm="mount" flags="ro, remount"


It just happened again. on all containers on every host in the cluster.
 
could you try starting one container with debug logging in foreground mode (depending on how often it triggers, it is probably a good idea to wrap it in a tmux/screen/.. session): "lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID-debug.log", where ID is your container's ID. it should print the boot process to stdout, and a lot more debug information to the log file. when the issue has occured, you can shut the containre down (via "pct shutdown ID" or from within the container). since it seems to affect all containers, feel free to create a new test container for it.
 
That failed right away. Here's the output

lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_time' (25)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_module' (16)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_rawio' (17)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_nice' (23)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_pacct' (20)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_rawio' (17)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2066 - capabilities have been setup
lxc-start 20161025074154.576 NOTICE lxc_conf - conf.c:lxc_setup:3855 - '108' is setup.
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.deny' set to 'a'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c *:* m'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'b *:* m'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:3 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:5 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:7 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:0 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:1 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:2 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:8 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:9 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 136:* rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 10:229 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'memory.limit_in_bytes' set to '536870912'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'memory.memsw.limit_in_bytes' set to '1073741824'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.cfs_period_us' set to '100000'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.cfs_quota_us' set to '100000'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.shares' set to '1024'
lxc-start 20161025074154.576 INFO lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1660 - cgroup has been setup
lxc-start 20161025074154.577 INFO lxc_apparmor - lsm/apparmor.c:apparmor_process_label_set:238 - changed apparmor profile to lxc-container-default-cgns
lxc-start 20161025074154.584 NOTICE lxc_start - start.c:start:1436 - exec'ing '/sbin/init'
lxc-start 20161025074154.601 NOTICE lxc_start - start.c:post_start:1447 - '/sbin/init' started with pid '48254'
lxc-start 20161025074154.601 WARN lxc_start - start.c:signal_handler:338 - invalid pid for SIGCHLD
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.669 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.669 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.288 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.288 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.842 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.847 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.847 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:signal_handler:342 - container init process exited
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1382 - Container halting
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1397 - Pushing physical nics back to host namespace
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1400 - Tearing down virtual network devices used by container
lxc-start 20161025074203.787 WARN lxc_conf - conf.c:lxc_delete_network:2924 - failed to remove interface 77 'eth0'
lxc-start 20161025074203.787 INFO lxc_error - error.c:lxc_error_set_and_log:55 - child <48254> ended on signal (2)
lxc-start 20161025074203.787 WARN lxc_conf - conf.c:lxc_delete_network:2924 - failed to remove interface 77 'eth0'
lxc-start 20161025074203.818 INFO lxc_conf - conf.c:run_script_argv:367 - Executing script '/usr/share/lxc/hooks/lxc-pve-poststop-hook' for container '108', config section 'lxc'
lxc-start 20161025074204.731 INFO lxc_conf - conf.c:run_script_argv:367 - Executing script '/usr/share/lxcfs/lxc.reboot.hook' for container '108', config section 'lxc'
 
Anyone got any ideas. At this point all we can think of is trying to roll back to an older version of proxmox.
 
Unfortunately we couldn't wait for a fix. So we wiped out all of the servers and installed an older, working version.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!