PVE 4.2 10gbit bonds, lxc network issues??

tycoonbob

Member
Aug 25, 2014
67
0
6
I have a fresh PVE 4.2 install on a decent box (dual Xeon E5-2670, 192GB RAM, SSD ZFS mirror for VM/LXC storage, dual 10Gbit SFP+) and I have both 10Gbit links connected in a static LAG (configured on my Dell X1052 switch), and using ovs in PVE. I can access the gui, and LXC connectivity seemed to have worked fine to begin with.

So far, I have about 10 LXC containers (all CentOS 7) created, and was using vmbr0 as the network bridge for them (the dual 10gbit links), with an id of net0 and name eth0. All LXC's, as expected, worked just fine from my main workstation (Macbook Pro). One of these LXC is running named (BIND), and I migrated to use it as primary DNS for this network (single flat /24 network). Interestingly, I have a Windows 7 box, and an iMac that could not ping the IP of 3 of the LXC. One of those happened to be .101, which is the named instance, so no DNS was resolving. Rebuilt lxc, issue still persisted.

Created a second NIC on my PVE host, just an ovs bridge with a single GbE link, powered off that LXC, change the bridge to vmbr1, powered it on...and it's now working and reachable from all devices.

Thoughout all of this, the PVE host itself could ping/reach all LXC instances.

So with that, I'd like to say that I am new to LXC, and containers in general. Is there some networking magic I should know about here? I notice when I create a LXC, in the Network section, there is a drop-down of 10 items for "ID" (net0-net9). What are these? Is there a limit to how many LXC can be on the same ID?

I'm at a complete loss here, and want to have all instances use the dual 10gbit connection for traffic.

Please let me know any information I can provide and I will quickly do so. Here is some stuff that may be useful:

Code:
root@jormungandr:~# pveversion --verbose
proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.3.2-3

Code:
root@jormungandr:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eth4 inet manual

iface eth5 inet manual

allow-vmbr1 eth0
iface eth0 inet manual
    ovs_type OVSPort
    ovs_bridge vmbr1

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

allow-vmbr0 bond0
iface bond0 inet manual
    ovs_bonds eth4 eth5
    ovs_type OVSBond
    ovs_bridge vmbr0
    ovs_options bond_mode=balance-slb

auto vmbr0
iface vmbr0 inet static
    address  172.16.1.202
    netmask  255.255.255.0
    gateway  172.16.1.254
    ovs_type OVSBridge
    ovs_ports bond0

auto vmbr1
iface vmbr1 inet manual
    ovs_type OVSBridge
    ovs_ports eth0

Code:
root@jormungandr:~# cat /etc/pve/nodes/jormungandr/lxc/101.conf
arch: amd64
cpulimit: 2
cpuunits: 1024
hostname: dns01
memory: 1024
net0: bridge=vmbr1,gw=172.16.1.254,hwaddr=66:32:62:33:65:36,ip=172.16.1.101/24,name=eth0,type=veth
onboot: 1
ostype: centos
rootfs: zfs_lxc:subvol-101-disk-1,size=8G
startup: up=5
swap: 512

Thanks!
 
the "netX" is just the configuration key in our container configuration file - it has no meaning for the network itself. it is perfectly fine to use net0 for the first network device of all your containers ;)
 
the "netX" is just the configuration key in our container configuration file - it has no meaning for the network itself. it is perfectly fine to use net0 for the first network device of all your containers ;)

Thanks for confirming what I thought.

Try adding this to ovs_options for bond0:
vlan_mode=native-untagged

I added this, rebooted the host, and still experiencing the same behavior. It's a flat network, no vlan's configured and should just work. However, my Windows client (172.16.1.14/24) can't ping half of my LXC (172.16.1.101/24, .103, .104, .105, .106) where as my Macbook Pro can ping them all.

However, I'm seeing other odd behavior between apparmor (PVE host) and systemd (CentOS 7 LXC), so I think I'm going to ditch containers and stick with KVM. I suspect that once I rebuild these instances on KVM, all pings will work just like it does on my other PVE 3.4 host. Maybe LXC will be more stable/straightforward in the future.

EDIT:
Well, I was wrong. Spun up a couple KVM instance and I'm having the same network issues. Clearly it's not a LXC problem, and I'm starting to think my 10Gbit is to blame. I have an Intel X520-DA2 (dual SFP+) card in the box (showing as eth4 and eth5, while eth0-3 are GbE onboard links), connected with 2 3m 10Gbit rated twinax DAC cables.

As I type this, I got to thinking about my switch config and I think that might be my problem. 2 of my SFP+ ports on the switch (Dell X1052) were configured in a static LAG, while the ovs bond for those 2 interfaces was set to "balance-slb" and not "LACP (balance-slb)". I'm not sure if the latter will even work with a static LAG, so I reconfigured those switch ports to NOT be in a LAG and instead be standalone ports. I still have network connectivity to the host, and will go through my testing again to see if I still have issues.
 
Last edited: