802.3ad failover time when bringing a slave interface down

dirks · Apr 9, 2021

I noticed something strange recently. If I bring a slave interface of a bond with mode 802.3ad down, the bond loses network connectivity for about 70-80s. If I unplug the cable I lose connectivity for less than one second (2 lost pings when intervalis 0.1s) which is what I would expect. I can replicate this on another node running windows server 2019 by disabling one of its team nics. Therefore I guess this is a switch (pair of nexus 93180) or switch config problem, but can not help wondering if such behavior has been observed by anybody else.

spirit · Apr 10, 2021

check that lacp is setup with fast mode

something like :

Code:

switch# configure terminal
switch (config)# interface ethernet 1/4
switch(config-if)# lacp rate fast

default is slow mode, around 30s.

and on proxmox side, check that your have "bond-lacp-rate fast" option in your bond interface in /etc/network/interfaces

dirks · Apr 12, 2021

Thanks, I already had it to fast on host side, but not on switch side. With lacp rate fast on switch side the problem is dampened: 2 lost pings with interval 1s.

The problem is the switch sees link is up, despite on host side

Code:

ip link set eno33 down

Therefore it waits until enough (3 I think is standard) lacp pdu packages are lost.

I will hunt for a NIC from another vendor (this is mellanox) and check if the behavior is the same there.

dirks · Apr 15, 2021

It seems at last some nics do not actually take the link down, when told so via ifdown/ip link down. [1], [2] I played around with mlxconfig yesterday to query and set the parameters. It did not resolve the issue yet, as KEEP_ETH_LINK_UP_P1=0 would result in a non functional network connection via that nic after the required reboot (despite the OS reporting the link was up).

1: https://community.mellanox.com/s/qu...he-link-down-on-a-port-on-a-connectx4-en-card
2: https://docs.mellanox.com/display/MFTv4110/Supported+Configurations+and+Parameters

dirks · Jun 8, 2021

Just to finish this up if anyone ever comes across something similar:

A link is not necessarily powered down by ip link set $NIC down. We found that across multiple cards from various vendors - and an equivalent for the windows version of disabling of a link.

For our Mellanox cards we could change this by updating the firmware (Dell 16.28.4512), installing the upstream driver (5.3-1.0.0) for Debian 10.5 and finally setting KEEP_ETH_LINK_UP_PX=False. I also opted for the pve-5.11 kernel since the interfaces got renamed by the driver anyway, but IIRC we tested with pve-5.4 and Ubuntu 20.04 as well, and there was no difference.

An ip link set $NIC down is now directly recognized by the switch. Failover without fast lacp rate still takes several seconds. From dmesg messages on bond state and ip command taking its time to return that seems to be a kernel/driver problem.

Another effect is that boot takes longer as networking.service waits +40s for the links to come up:

Code:

[    7.260820] mlx5_core 0000:63:00.1 eno34np1: renamed from eth1
[    7.302216] mlx5_core 0000:63:00.0 eno33np0: renamed from eth0
[    7.362683] mlx5_core 0000:a1:00.0 ens6f0np0: renamed from eth2
[    7.398189] mlx5_core 0000:a1:00.1 ens6f1np1: renamed from eth3
[   14.332262] mlx5_core 0000:63:00.0 eno33np0: Link down
[   14.340860] bond1: (slave eno33np0): Enslaving as a backup interface with a down link
[   14.920388] mlx5_core 0000:a1:00.1 ens6f1np1: Link down
[   14.928837] bond1: (slave ens6f1np1): Enslaving as a backup interface with a down link
[   17.912342] mlx5_core 0000:a1:00.1 ens6f1np1: Link up
[   21.980021] mlx5_core 0000:63:00.0 eno33np0: Link up
[   29.317990] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   29.322988] bond1: (slave eno33np0): link status definitely up, 25000 Mbps full duplex
[   29.323780] bond1: active interface up!
[   29.325058] bond1: (slave ens6f1np1): link status definitely up, 25000 Mbps full duplex
[   29.335601] IPv6: ADDRCONF(NETDEV_CHANGE): bond1: link becomes ready
[   29.935566] mlx5_core 0000:a1:00.0 ens6f0np0: Link down
[   29.944543] bond0: (slave ens6f0np0): Enslaving as a backup interface with a down link
[   30.548840] mlx5_core 0000:63:00.1 eno34np1: Link down
[   30.556701] bond0: (slave eno34np1): Enslaving as a backup interface with a down link
[   30.574925] vmbr0: port 1(bond0) entered blocking state
[   30.575943] vmbr0: port 1(bond0) entered disabled state
[   32.772163] mlx5_core 0000:a1:00.0 ens6f0np0: Link up
[   33.291796] mlx5_core 0000:63:00.1 eno34np1: Link up
[   56.796432] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   56.807420] bond0: (slave ens6f0np0): link status definitely up, 25000 Mbps full duplex
[   56.814744] bond0: active interface up!
[   56.816227] bond0: (slave eno34np1): link status definitely up, 25000 Mbps full duplex
[   56.832818] vmbr0: port 1(bond0) entered blocking state
[   56.833704] vmbr0: port 1(bond0) entered forwarding state
[   56.865236] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready

Would love to investigate this further, but time.

Search

Search

802.3ad failover time when bringing a slave interface down

dirks

Member

spirit

Distinguished Member

dirks

Member

dirks

Member

dirks

Member

We value your privacy