Network down on all LXC Containers

axion.joey · Oct 24, 2016

Hey Guys,

We just experienced something that shook our confidence on Proxmox. We just set up a 5-node Proxmox cluster. All of the nodes are running the exact same version of Proxmox:

root@Proxmox:/var/log# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80

On Wednesday we created around 30 Centos 6 nodes per host. They worked fine for a few days. Then yesterday we started getting alarms. For some reason every LXC host in this cluster lost it's network connectivity.

pct config VMID shows the correct network info. Also if you enter the container and look at /etc/sysconfig/network-scripts/ifcfg-eth0 then that appears to be correct too. But if you run route -n or ifconfig within the container then it shows no information. Doing a service network restart within the containers restores service.

Luckily we had only put two of these containers into production. We have two other clusters using containers based on this exact same template. And they've never had this issue. The other clusters are running this version of Proxmox.

proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie

I'm really hoping someone can point us to a fix for this. We've used Proxmox for years and have never experienced any problems this severe.

Thanks in advance for your help.

mir · Oct 24, 2016

Is NetworkManager installed on the CentOS hosts?
If so remove it since it will control your network interfaces as well.

axion.joey · Oct 24, 2016

Thanks for responding. Network Manager isn't installed on any of the containers. It's got to be something on the proxmox host. This doesn't happen on any of our other clusters, and the only difference is the version of Proxmox.

mir · Oct 24, 2016

Is there anything interesting in the syslog in the container when it looses network?

axion.joey · Oct 24, 2016

Unfortunately no, but I see this in /var/log/messages

Oct 22 18:08:08 affinitytarzana kernel: [15518.074006] device veth161i0 entered promiscuous mode
Oct 22 18:08:09 affinitytarzana kernel: [15519.083450] vmbr0: port 17(veth164i0) entered forwarding state
Oct 22 18:08:09 affinitytarzana kernel: [15519.207097] eth0: renamed from veth5Q1G5A
Oct 22 18:08:28 affinitytarzana kernel: [15537.799906] IPv6: ADDRCONF(NETDEV_UP): veth169i0: link is not ready
Oct 22 18:08:28 affinitytarzana kernel: [15538.727904] vmbr0: port 21(veth169i0) entered disabled state
Oct 22 18:08:29 affinitytarzana kernel: [15539.328511] IPv6: ADDRCONF(NETDEV_CHANGE): veth166i0: link becomes ready
Oct 22 18:08:42 affinitytarzana kernel: [15552.595032] vmbr0: port 27(veth173i0) entered disabled state
Oct 22 18:08:51 affinitytarzana kernel: [15561.729794] audit_printk_skb: 57 callbacks suppressed
Oct 22 18:08:58 affinitytarzana kernel: [15568.080694] vmbr0: port 26(veth172i0) entered forwarding state
Oct 22 18:08:58 affinitytarzana kernel: [15568.254434] eth0: renamed from vethNPN355
Oct 22 18:08:59 affinitytarzana kernel: [15569.238174] eth0: renamed from veth62AHBM
Oct 22 18:09:00 affinitytarzana kernel: [15570.186061] vmbr0: port 28(veth175i0) entered forwarding state
Oct 22 18:09:01 affinitytarzana kernel: [15570.831457] audit: type=1400 audit(1477184941.060:567): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/cpuinfo" pid=19655 comm="mount" flags="ro, remount"
Oct 22 18:09:10 affinitytarzana kernel: [15580.513929] audit: type=1400 audit(1477184950.744:598): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/sys/net/" pid=21487 comm="mount" flags="ro, remount"
Oct 22 18:09:10 affinitytarzana kernel: [15580.532101] audit: type=1400 audit(1477184950.760:605): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/proc/cpuinfo" pid=21494 comm="mount" flags="ro, remount"

It just happened again. on all containers on every host in the cluster.

mir · Oct 25, 2016

There is no suspicious in these log entries.

fabian · Oct 25, 2016

could you try starting one container with debug logging in foreground mode (depending on how often it triggers, it is probably a good idea to wrap it in a tmux/screen/.. session): "lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID-debug.log", where ID is your container's ID. it should print the boot process to stdout, and a lot more debug information to the log file. when the issue has occured, you can shut the containre down (via "pct shutdown ID" or from within the container). since it seems to affect all containers, feel free to create a new test container for it.

axion.joey · Oct 25, 2016

doing that now. I'll post the output as soon as it happens again.

axion.joey · Oct 25, 2016

That failed right away. Here's the output

lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_time' (25)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_module' (16)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_rawio' (17)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_nice' (23)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_pacct' (20)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2057 - drop capability 'sys_rawio' (17)
lxc-start 20161025074154.576 DEBUG lxc_conf - conf.c:setup_caps:2066 - capabilities have been setup
lxc-start 20161025074154.576 NOTICE lxc_conf - conf.c:lxc_setup:3855 - '108' is setup.
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.deny' set to 'a'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c *:* m'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'b *:* m'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:3 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:5 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:7 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:0 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:1 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 5:2 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:8 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 1:9 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 136:* rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'devices.allow' set to 'c 10:229 rwm'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'memory.limit_in_bytes' set to '536870912'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'memory.memsw.limit_in_bytes' set to '1073741824'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.cfs_period_us' set to '100000'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.cfs_quota_us' set to '100000'
lxc-start 20161025074154.576 DEBUG lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1656 - cgroup 'cpu.shares' set to '1024'
lxc-start 20161025074154.576 INFO lxc_cgfsng - cgroups/cgfsng.c:cgfsng_setup_limits:1660 - cgroup has been setup
lxc-start 20161025074154.577 INFO lxc_apparmor - lsm/apparmor.c:apparmor_process_label_set:238 - changed apparmor profile to lxc-container-default-cgns
lxc-start 20161025074154.584 NOTICE lxc_start - start.c:start:1436 - exec'ing '/sbin/init'
lxc-start 20161025074154.601 NOTICE lxc_start - start.c

ost_start:1447 - '/sbin/init' started with pid '48254'
lxc-start 20161025074154.601 WARN lxc_start - start.c:signal_handler:338 - invalid pid for SIGCHLD
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.668 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.669 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074159.669 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.288 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.288 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074200.289 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.842 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.843 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.847 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074201.847 DEBUG lxc_commands - commands.c:lxc_cmd_handler:893 - peer has disconnected
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:signal_handler:342 - container init process exited
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1382 - Container halting
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1397 - Pushing physical nics back to host namespace
lxc-start 20161025074203.786 DEBUG lxc_start - start.c:__lxc_start:1400 - Tearing down virtual network devices used by container
lxc-start 20161025074203.787 WARN lxc_conf - conf.c:lxc_delete_network:2924 - failed to remove interface 77 'eth0'
lxc-start 20161025074203.787 INFO lxc_error - error.c:lxc_error_set_and_log:55 - child <48254> ended on signal (2)
lxc-start 20161025074203.787 WARN lxc_conf - conf.c:lxc_delete_network:2924 - failed to remove interface 77 'eth0'
lxc-start 20161025074203.818 INFO lxc_conf - conf.c:run_script_argv:367 - Executing script '/usr/share/lxc/hooks/lxc-pve-poststop-hook' for container '108', config section 'lxc'
lxc-start 20161025074204.731 INFO lxc_conf - conf.c:run_script_argv:367 - Executing script '/usr/share/lxcfs/lxc.reboot.hook' for container '108', config section 'lxc'

axion.joey · Oct 26, 2016

Anyone got any ideas. At this point all we can think of is trying to roll back to an older version of proxmox.

myman03 · Nov 17, 2016

i'm experience this problems too... hope someone can help to solve the problem

axion.joey · Nov 17, 2016

Unfortunately we couldn't wait for a fix. So we wiped out all of the servers and installed an older, working version.

Search

Search

Network down on all LXC Containers

axion.joey

Active Member

mir

Famous Member

axion.joey

Active Member

mir

Famous Member

axion.joey

Active Member

mir

Famous Member

fabian

Proxmox Staff Member

axion.joey

Active Member

axion.joey

Active Member

axion.joey

Active Member

myman03

Active Member

axion.joey

Active Member