Many TCP Retransmissions and TCP Dup ACKs: Wrong link aggregation configuration?

Oct 28, 2013
306
47
93
www.nadaka.de
Hi there,

on our three node Proxmox/Ceph cluster we discovered many of the above TCP errors.
We tracked it down to: Only outgoing traffic from a VM to any destination which is not on the same Proxmox node is affected.
Each node is connected via 2x 10G to a switch. The related network configuration on the node looks like this (VMs are connected to vmbr0):
Code:
auto bond2
iface bond2 inet manual
        bond-slaves enp42s0f3np3 enp61s0f3np3
        bond-miimon 100
        bond-mode 802.3ad
        mtu 9000

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond2
        bridge-stp off
        bridge-fd 0
        mtu 9000

The switch is a Juniper EX3400-24P, its related configuration looks like this:
Code:
> show configuration interfaces xe-0/2/1
ether-options {
    802.3ad ae2;
}

> show configuration interfaces xe-1/2/1
ether-options {
    802.3ad ae2;
}

> show configuration interfaces ae2
mtu 9200;
aggregated-ether-options {
    minimum-links 1;
    link-speed 10g;
    lacp {
        active;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members somevlan;
        }
    }
}

Did I misconfigure something? Or do I look in the wrong direction?

Thanks for help and greets
Stephan
 
Thanks for your reply!
Hm, (R)STP is configured on all switches, and referring to our bandwidth monitoring there is no loop between the switches. What can I do to debug this? But I feel this is not a Proxmox issue anymore... o_O
I'll not be easy, but I think you should check mac address table on different switches, check that the mac of your proxmox node is not flapping between ports or something like that.
 
Haven't played around much with bonds (other then active-failover ones), so not sure how much I can help.
That said though, I do notice that it says you have STP turned off on your vmbr0 "switch" (unlike what you mention being the case on your physical switches).
One other suggestion, which might allow you to see if it is caused by proxmox, would be to temporarily disable/remove/disconnect one of the slaves from the bond or turn it into an active-failover one, in a time where the avarage speed would not get near the single-link speed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!