Bonding issue, connection lost after a few minutes of traffic

mrmartin

Member
Oct 11, 2015
3
0
21
Hi,

I have a problem with my single node proxmox 4.0 (same issue on 3.4). I have two intel nics in a bonding setup, and a Cisco sg-300 with a LAG group of the two nics with LACP. On boot everything seems fine. After about 10 minutes of use with 4 guests with about 100 Mpbs off traffic in total, I loose network connectivity with my proxmox node. On the switch i see that one of the NICs enter slave mode. Reboot and everything works again for about 10 minutes.

Do any of you know where to begin to figure out this issue?

Any tips or ways forward is much appreciated!
 
My dual intel nic is this one:
INTEL-EXPI9402PT-PRO-1000-Dual-Port-Server-Adapter-PCI-E
Interfaces has this setup:

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

auto bond0
iface bond0 inet manual
slaves eth1 eth2
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
address 10.13.37.5
netmask 255.255.255.0
gateway 10.13.37.1
bridge_ports bond0
bridge_stp off
bridge_fd 0

auto vmbr1
iface vmbr1 inet manual
bridge_ports none
bridge_stp off
bridge_fd 0

auto vmbr2
iface vmbr2 inet manual
bridge_ports none
bridge_stp off
bridge_fd 0

I am not sure of the below log output is relevant or not but maybe. I think 10:38 is about the time i loose connectivity:

In syslog, right before loosing connectivity I see:

Oct 11 12:38:03 cloud pvedaemon[8055]: start VM 105: UPID:cloud:00001F77:00045382:561A3C0B:qmstart:105:root@pam:
Oct 11 12:38:03 cloud pvedaemon[3210]: starting task UPID:cloud:00001F77:00045382:561A3C0B:qmstart:105:root@pam:
Oct 11 12:38:03 cloud kernel: device tap105i0 entered promiscuous mode
Oct 11 12:38:03 cloud ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap105i0
Oct 11 12:38:03 cloud ovs-vsctl: ovs|00002|vsctl|ERR|no port named tap105i0
Oct 11 12:38:03 cloud kernel: vmbr0: port 4(tap105i0) entering forwarding state
Oct 11 12:38:05 cloud ntpd[2885]: Listen normally on 13 tap105i0 fe80::38ce:6bff:fee5:a8c6 UDP 123
Oct 11 12:38:05 cloud ntpd[2885]: peers refreshedOct 11 12:38:13 cloud kernel: tap105i0: no IPv6 routers present


In messages, right before loosing connectivity, I see:

Oct 11 12:09:24 cloud kernel: vmbr1: port 1(tap104i1) entering forwarding state
Oct 11 12:09:24 cloud kernel: device tap104i2 entered promiscuous mode
Oct 11 12:09:24 cloud kernel: vmbr2: port 1(tap104i2) entering forwarding state
Oct 11 12:38:03 cloud kernel: device tap105i0 entered promiscuous mode
Oct 11 12:38:03 cloud kernel: vmbr0: port 4(tap105i0) entering forwarding state


Any tips or ways forward is much appreciated!
 
Last edited:
bond_xmit_hash_policy layer2+3

The transmit hash policy you've used for slave selection might not be fully LACP (802.3ad) compliant. This might throw LACP with the Cisco and your NIC out of sync due to hash mismatch.

Kernel module docs list only options "0" or "layer2" as fully 802.3ad compliant.

I'd go with

bond_xmit_hash_policy layer2

or

bond_xmit_hash_policy 0

or better yet leaving out the bond_xmit_hash_policy and let the kernel bond module decide which is the most compatible option for the link.

Could be something else - but worth a try?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!