Network failure

jmz · Feb 3, 2012

I have VMs on Proxmox 2.0 hosts that occasionally lose networking. Bringing the interface up/down doesn't help and I don't see any traffic on the host if I tcpdump the tap device. I had this problem with virtio networking and high traffic rates in the past and using e1000 resolved my issues, but now I am experiencing the same problem on e1000 as well as virtio.

Code:

pve-manager: 2.0-10 (pve-manager/2.0/7a10f3e6)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-52
pve-kernel-2.6.32-6-pve: 2.6.32-52
lvm2: 2.02.86-1pve1
clvm: 2.02.86-1pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.7-1
pve-cluster: 1.0-11
qemu-server: 2.0-6
pve-firmware: 1.0-13
libpve-common-perl: 1.0-8
libpve-access-control: 1.0-2
libpve-storage-perl: 2.0-6
vncterm: 1.0-2
vzctl: 3.0.29-3pve3
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.1-1

tom · Feb 3, 2012

test with the latest packages. run aptitude update && aptitude full-upgrade.

Mr.Embedded · Feb 4, 2012

Im having this issue as well but the host seems fine but the VMs intermittently lose the network just like the OP. The only thing that seems to work is to ping the gateway from the VM when this happens and after 2-3sec it catches and I get a reply and the route is there again.

Here is my network config. Note that its a bonded setup so maybe that configuration is causing this. The logs are not showing anything exciting at all so I am not sure what the story is here.

Code:

auto lo
iface lo inet loopback

iface eth0 inet manual
iface eth1 inet manual

auto bond0
iface bond0 inet manual
        bond_miimon 100
        bond_mode balance-alb
        bond_downdelay 200
        bond_updelay 200
        slaves eth0 eth1

 auto vmbr0
iface vmbr0 inet static
        address  192.168.xxx.xxx
        netmask  255.255.255.0
        gateway  192.168.xxx.xxx
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0
        bridge_maxwait 0
        bridge_maxage 0
        bridge_aging 0

The NICs are Intel so they shouldn't have issues with the bond setup at all.

Code:

06:07.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)
07:08.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)

And here is the the version output:

Code:

pve-manager: 2.0-18 (pve-manager/2.0/16283a5a)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-55
pve-kernel-2.6.32-6-pve: 2.6.32-55
lvm2: 2.02.88-2pve1
clvm: 2.02.88-2pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.8-3
pve-cluster: 1.0-17
qemu-server: 2.0-13
pve-firmware: 1.0-14
libpve-common-perl: 1.0-11
libpve-access-control: 1.0-5
libpve-storage-perl: 2.0-9
vncterm: 1.0-2
vzctl: 3.0.29-3pve8
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-1
ksm-control-daemon: 1.1-1

Mr.Embedded · Feb 4, 2012

I have a little more info on my problem.

I have changed the

Code:

bridge_stp on

and immediately after rebooting the host I started to get LOTS of these messages

Code:

Feb  4 16:00:03 host1 kernel: vmbr0: topology change detected, propagating
Feb  4 16:00:05 host1 kernel: vmbr0: neighbor 8000.xx:xx:xx:xx:xx:xx lost on port 1(bond0

The thing is, 8000.xx:xx:xx:xx:xx:xx matches nothing on the host or VMs at all. The first 3 character pairs do match the first 3 characters on my bridge and bond macs so they could be related some how.

As a test I turned the stp back off and also removed the 2nd link

Code:

slaves eth0[COLOR=#3E3E3E][FONT=monospace]
[/FONT][/COLOR]

After a reboot so far so good after 45mins. So in my case, the problem seems to be bond related as the bond is still there with only one active link. I can't tell if its happening at the Proxmox routing layer of things or if its at the linux kernel layer but I am leaning toward the former as in my case the host never has issues, only the VMs do.

The host is a Dell Poweredge 2850 and the switch that both ethernet links connect to is a Dell Powerconnect 3348 which has been reset to factory defaults with no other changes.

I have setup bonds many times and have not seen this issue until using 2.0. The only issues I have personally seen in the past is that certian combinations of NICs do not support promiscuous mode and you have to use active-backup instead of balance-alb for the bond to function properly. The NICs in this case fully support promiscuous mode so that is not the issue here.

I'll post back if I can figure this out further or if the problem comes back with the single slave in place.

Mr.Embedded · Jun 13, 2012

I have found examples of this issue with the bonded interface (alb) in v1.9 as well using OpenVZ containers. What seems to be the issue for me is that when I have a VM that is using a IP address and gateway on a different subnet than the host machine is using, It will lose connectivity after a few minutes of inactivity. What I mean by this is you will not be able to access the VM remotely. It seems like the VM has stopped listening or the network has gone to sleep.

You can enter it via vzctl (I haven't tried SSH) from the host and initiate a ping to your gateway and it wakes up until the next iteration of the problem. Usually the first few pings are dead and then it catches afterward.

I have no idea what is causing this but its bugging the hell out of me. I will try to put a ping job that pings the gateway 4 times every 4 minutes in cron as a potential workaround for now until I figure this out.

Mr.Embedded · Jun 20, 2012

Just an update to this. The issue is with the balance-alb bond type. I was trying to use this to split the bond over 2 switches for redundancy. I setup another test using 2x Netgear gs716t units and the only way balance-alb works is if the bridges that the VMs are using all are in the same subnet as the host. There will be no issue with anything this way.

However if you try to use VLANs or try to directly place the VMs in a different subnet, they work but will randomly lose the connection. Pinging from the VM to the gateway will bring it back up again in most cases and the connection loss will repeat.

I have concluded to stay away from balance-alb if you need to use multiple subnets with your VMs.

What I have done is setup the 802.3ad bond mode using 2 slaves on the Proxmox host and have set a LAG with LACP containing one port on each of the gs716t units. The switches are also connected to each other with a 2 port LACP style LAG. Then I have placed the physical link of each of the 2 Proxmox slaves into the single LAG port of each of the switches. Those LAG ports have been added to the main VLAN1 untagged and have been tagged in all the other VLANs associated with the VMs on the Proxmox host. This is working well with regards to the connection drops.

I haven't tested it fully yet with regards to if both inbound and outbound traffic is being properly balanced and if the speed is actually aggregated. I am hoping this will work exactly as if both slave ports were plugged into a 2 port LAG on the same switch.

Mr.Embedded · Jun 25, 2012

I haven't completed the tests for the 802.3ad bond but I have had great success with this setup:

2x Netgear GS716T connected to each other with a 2 port LACP LAG
Proxmox server with 2x PCI-E network cards, RTL8111/8168B based Gigabit based
One PCI-E network card plugged into each swich
No special switch configuration
A single IP address associated to the host bridge
Additional VLANs routed through the bond using distinct bridges for each
Bond mode balance-tlb

It seems the balance-tlb does not have the ARP issue like the balance-alb does. You will not get the use of both ports for uploading to the server but will have the outgoing traffic balanced. I did a ping test to both the management IP of the first bridge that the host is using and to a container that had an IP from a completely different subnet that was routed via the 2nd VLAN bridge. Here is the test:

Unplug eth0 - at this point /proc/net/bonding/bond0 and the syslog showed a switchover to eth1, pings did not stop for both host and container
Waited 3 minutes and replugged eth0 - the system was still using eth1 as master, pings did not stop for both host and container
Waited 3 minutes and unplugged eth1 - the system switched to eth0 as master, pings did not stop for both host and container

So this was a great test for me. One thing I did notice is if you unplug and plug both eth0 and eth1 in too short of a timeframe (how short I cannot say) the ping to the container on the 2nd VLAN bridge would die but the ping to the host did not. I figure this is just because the slave did not have the time to recover at all network levels before the other slave was unplugged. In this case I had to initiate a ping from the container to the gateway and the connectivity restored.

For reference here is my network config:

Code:

# network interface settings


auto lo
iface lo inet loopback


iface eth0 inet manual


iface eth1 inet manual


auto bond0
iface bond0 inet manual
    slaves eth0 eth1
    bond_miimon 100
    bond_mode balance-tlb
    bond_downdelay 200
    bond_updelay 200
    mtu 9000


auto bond0.100
iface bond0.100 inet manual
    vlan-raw-device bond0


auto vmbr0
iface vmbr0 inet static
    address  xxx.xxx.xxx.xxx
    netmask  xxx.xxx.xxx.xxx
    gateway  xxx.xxx.xxx.xxx
    bridge_ports bond0
    bridge_stp off
    bridge_fd 0
    bridge_maxwait 0
    bridge_maxage 0
    bridge_aging 0


auto vmbr100
iface vmbr100 inet manual
    bridge_ports bond0.100
    bridge_stp off
    bridge_fd 0
    bridge_maxwait 0
    bridge_maxage 0
    bridge_aging 0

And output of /proc/net/bonding/bond0:

Code:

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)


Bonding Mode: transmit load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200


Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: xx:xx:xx:xx:xx:xx
Slave queue ID: 0


Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: xx:xx:xx:xx:xx:xx
Slave queue ID: 0

I hope this will help those with low budget setups who want to successfully play with bonding.

Search

Search

Network failure

jmz

New Member

tom

Proxmox Staff Member

Mr.Embedded

Well-Known Member

Mr.Embedded

Well-Known Member

Mr.Embedded

Well-Known Member

Mr.Embedded

Well-Known Member

Mr.Embedded

Well-Known Member