Strange network behavior on one port in a bond

brucexx · Sep 12, 2023

proxmox-ve: 7.4-1
ceph: 17.2.6-pve1

I have 4 node Ceph cluster using Proxmox as a host. Public and Private networks are separated using 2 x 10Gbps ports each (2 cards per node , 4 ports total). All nodes are setup in exactly the same way. Here is an example of Ceph Private config:

auto enp4s0f0
iface enp4s0f0 inet manual
mtu 9000

auto enp5s0f0
iface enp5s0f0 inet manual
mtu 9000

auto bond1
iface bond1 inet manual
bond-slaves enp4s0f0 enp5s0f0
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
mtu 9000
#Ceph Private

auto vmbr1
iface vmbr1 inet static
address 10.221.2.70/24
bridge-ports bond1
bridge-stp off
bridge-fd 0
mtu 9000
#Ceph Private

I was just testing a new storage and looking at the bandwidth with nload. On one node and ONLY on one node It seems that enp4s0f0 is not sending (outgoing) any traffic or very little. The incoming traffic is working properly.

NODE 1 enp4s0f0
Outgoing
Curr: 0.00 Bit/s
Avg: 32.00 Bit/s
Min: 0.00 Bit/s
Max: 976.00 Bit/s
Ttl: 6.43 MByte
- the 6.42 MByte is there only because I unplugged the enp5s0f0 to see if this port or direction of the traffic is even active/working and it was working!

NODE 1 enp5s0f0
Outgoing:
Curr: 849.15 MBit/s
Avg: 1.00 GBit/s
Min: 3.42 MBit/s
Max: 4.75 GBit/s
Ttl: 1254.14 GByte

The other 3 nodes are working the way I would expect using two ports per bonded interface , here is an example from node 3.

NODE 3 enp4s0f0
Outgoing
Curr: 3.40 GBit/s
Avg: 847.84 MBit/s
Min: 1.74 MBit/s
Max: 3.40 GBit/s
Ttl: 576.82 GByte

NODE 3 enp5s0f0
Outgoing
Curr: 1.07 GBit/s
Avg: 424.29 MBit/s
Min: 732.96 kBit/s
Max: 1.43 GBit/s
Ttl: 272.50 GByte

Even it is up to the server how the traffic is send out I checked the switch and all ports are configured the same way. I looked at the port configuration side by side and it is the same on all 4 nodes (except for ip addresses). I also checked and all the network cards are on the same firmware.
Is there anything about 1st node that would cause that ?
Has anybody seen that behavior on one node only in the cluster ?

Thank you.

jsterr · Sep 12, 2023

I would recommend to check / test the network with iperf or iperf3. For example:

Code:

## IPERF Server
iperf -s -P 64

## IPERF CLIENT
iperf -c 192.168.99.31 -P 64 -t 3600

Please post /etc/pve/ceph.conf
Please post ethtool enp4s0f0 and ethtool enp5s0f0
Please post cat /etc/kernel/cmdline and or /etc/default/grub

brucexx · Sep 12, 2023

Hi,

I am testing by uploading the VMs to the storage and generating traffic that way which is when I took the reading from nload, I can see clearly that the presence on all other ports and or servers for outgoing traffic and absence of that traffic on that particular interface in outgoing direction only (incoming is perfectly fine).

I also made sure that interface in outgoing direction is working by unplugging the enp5s0f0 that is bonded with enp4s0f0 which resulted in immediate picking up of the traffic by enp4s0f0 in outgoing direction - so it is not that the port or cable is defective. It seems like the underlying logic is not sending traffic in outgoing direction on that single port. Would there be any difference in doing it via iperf ?

Ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.221.2.70/24
fsid = 3022ce3e-2622-43e9-a10b-fc6f573b5c69
mon_allow_pool_delete = true
mon_host = 10.221.1.70 10.221.1.71 10.221.1.72 10.221.1.73
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.221.1.70/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.ceph01-01]
public_addr = 10.221.1.70

[mon.ceph01-02]
public_addr = 10.221.1.71

[mon.ceph01-03]
public_addr = 10.221.1.72

[mon.ceph01-04]
public_addr = 10.221.1.73

ethtool enp4s0f0
Settings for enp4s0f0:
Supported ports: [ TP ]
Supported link modes: 100baseT/Full
1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 100baseT/Full
1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
MDI-X: Unknown
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

ethtool enp5s0f0
Settings for enp5s0f0:
Supported ports: [ TP ]
Supported link modes: 100baseT/Full
1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 100baseT/Full
1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
MDI-X: Unknown
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

--I also confirmed that the other servers in the cluster have the same output as above.

cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.116-1-pve root=/dev/mapper/pve-root ro quiet

cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

Thank you for any sugestions.

jsterr · Sep 13, 2023

You might need amd_iommu=on iommu=pt OR intel_iommu=on iommu=pt for the best network performance. But doing iperf is essential to know what you can archieve,

brucexx · Sep 15, 2023

Cannot test at the moment as the node is in production with 4 node cluster having about 200 VMs on Ceph storage.

Don't really have a maintenance window scheduled for now. It is working on the second slave interface. I looked at the /proc/net/bonding/bond1 across the 4 nodes and they look the same with only node 2 (which is working properly) having aggregator ID 2 vs. 1 for bond1 interface but that is the only difference (mac addresses aside).

I looked at tcpdump src 10.221.2.70 --interface enp4s0f0 and I see 0 (zero) packers of any kind and plenty of packets on enp5s0f0.

Is there a place where I could see the negotiation of the LACP ? Would that tell me anything having in mind I already see the /proc/net/bonding/bond1 configuration ?

The dmesg returns this:

dmesg | grep bond1
[ 16.451242] bond1: (slave enp4s0f0): Enslaving as a backup interface with a down link
[ 16.647253] bond1: (slave enp5s0f0): Enslaving as a backup interface with a down link
[ 17.675842] vmbr1: port 1(bond1) entered blocking state
[ 17.675849] vmbr1: port 1(bond1) entered disabled state
[ 17.676776] device bond1 entered promiscuous mode
[ 22.713465] bond1: (slave enp5s0f0): link status definitely up, 10000 Mbps full duplex
[ 22.713476] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 22.713484] bond1: active interface up!
[ 22.713517] vmbr1: port 1(bond1) entered blocking state
[ 22.713520] vmbr1: port 1(bond1) entered forwarding state
[ 23.457471] bond1: (slave enp4s0f0): link status definitely up, 10000 Mbps full duplex
[11012.838008] bond1: (slave enp4s0f0): speed changed to 0 on port 1
[11012.881520] bond1: (slave enp4s0f0): link status definitely down, disabling slave
[11035.297318] bond1: (slave enp4s0f0): link status definitely up, 10000 Mbps full duplex
[11035.325135] bond1: (slave enp4s0f0): speed changed to 0 on port 1
[11035.401325] bond1: (slave enp4s0f0): link status definitely down, disabling slave
[11037.281315] bond1: (slave enp4s0f0): link status definitely up, 10000 Mbps full duplex
[11130.221838] bond1: (slave enp5s0f0): speed changed to 0 on port 2
[11130.261393] bond1: (slave enp5s0f0): link status definitely down, disabling slave
[11130.261443] bond1: active interface up!
[11164.789181] bond1: (slave enp5s0f0): link status definitely up, 10000 Mbps full duplex

Starting with [11012.xxxxxx] it is me unplugging interfaces to check if restarting the enp4s0f0 would change anything which it did not , then I unplugged the enp5s0f0 and this caused the enp4s0f0 to start working untill the enp5s0f0 was reconnected again whiched killed traffic again on enp4s0f0.

Anything else I could check for the time being ? The 3 other nodes are working properly and I cannot seem to find what is different about node1.

Thank you

Deadpan110 · Sep 15, 2023

Suggestion: Check your managed switch for correct bond/LAG configuration.

brucexx · Sep 15, 2023

Please note , the traffic TO node1 from the switch (Incoming) is fine and working across both interfaces enp4s0f0 and enp5s0f0. The problem is with sending from node1 (Outgoing per nload) , only enp5s0f0 is sending unless it is unplugged and then enp4s0f0 takes over otherwise the enp4s0f0 is not sending traffic.

Other then that the switch config seems to be correct, enp4s0f0 connected to Te1/0/19 and enp5s0f0 connected to Te2/0/19 , vlan31 is the ceph cluster:

interface Te1/0/19
channel-group 19 mode active
switchport access vlan 31
exit

interface Te2/0/19
channel-group 19 mode active
switchport access vlan 31
exit

interface port-channel 19
switchport access vlan 31
exit

console#show interfaces port-channel 19
Channel Ports Ch-Type Hash Type Min-links Local Prf
------- ----------------------------- -------- --------- --------- ---------
Po19 Active: Te1/0/19, Te2/0/19 Dynamic 7 1 Disabled

console#show lacp tengigabitethernet 1/0/19
port Te1/0/19 LACP parameters:
Actor:
system priority: 1
port Admin key: 0
port oper key: 836
port oper priority: 1
port oper timeout: LONG
port Admin timeout: LONG
LACP Activity: ACTIVE
Aggregation: AGGREGATABLE
synchronization: TRUE
collecting: TRUE
distributing: TRUE
expired: FALSE
Partner:
port Admin key: 0
port oper key: 15
port Admin priority: 0
port oper priority: 255
port Oper timeout: LONG
LACP Activity: ACTIVE
Aggregation: AGGREGATABLE
synchronization: TRUE
collecting: TRUE
distributing: TRUE
expired: FALSE
port Te1/0/19 LACP Statistics:
LACP PDUs send: 184553
LACP PDUs received: 175214

console#show lacp tengigabitethernet 2/0/19
port Te2/0/19 LACP parameters:
Actor:
system priority: 1
port Admin key: 0
port oper key: 836
port oper priority: 1
port oper timeout: LONG
port Admin timeout: LONG
LACP Activity: ACTIVE
Aggregation: AGGREGATABLE
synchronization: TRUE
collecting: TRUE
distributing: TRUE
expired: FALSE
Partner:
port Admin key: 0
port oper key: 15
port Admin priority: 0
port oper priority: 255
port Oper timeout: LONG
LACP Activity: ACTIVE
Aggregation: AGGREGATABLE
synchronization: TRUE
collecting: TRUE
distributing: TRUE
expired: FALSE
port Te2/0/19 LACP Statistics:
LACP PDUs send: 184553
LACP PDUs received: 175215

Thank you

brucexx · Sep 18, 2023

Is there anything else that can be checked ?

I cannot do iperf for now. I was able to apply changes to a description of an interface which restarted the network on node1. Now the traffic outbound is using the enp4s0f0 and the enp5s0f0 is not passing any traffic out (outbound) , both interfaces are receiving incoming traffic simultaneously. It does look like the 802.3ad with layer 2+3 is not working as intended, the tcpdump shows me that the traffic out (outbound) is going to 10.221.2.71 , 10.221.2.72 , 10.221.2.73 but it is doing it only over one interface while the other interface in the bond is idle (for outbound only).

Thank you

brucexx · Sep 25, 2023

The issue has been fixed. The Ceph cluster network even with 4 nodes needs to use hash 3+4.

Search

Search

Strange network behavior on one port in a bond

brucexx

Renowned Member

jsterr

Renowned Member

brucexx

Renowned Member

jsterr

Renowned Member

brucexx

Renowned Member

Deadpan110

New Member

brucexx

Renowned Member

brucexx

Renowned Member

brucexx

Renowned Member

We value your privacy