Ceph Cluster Bonded 10GBe error

psionic · Jun 7, 2019

Seeing the following errors on Proxmox Node Terminal screen:

ens1f2 and ens1f3 are LACP bonded with Layer 2 Hash policy, using XS728T smart switch.

ethtool -i bond0
driver: bonding
version: 3.7.1
firmware-version: 2
expansion-rom-version:
bus-info:
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

dmesg | grep -i ethernet
[ 1.745517] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
[ 6.545528] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Any suggestions please?

sb-jw · Jun 9, 2019

Could you post the related Configs?

psionic · Jun 9, 2019

ProxMox Network Config, exact same on all 4 nodes except for IPs of course.

sb-jw · Jun 9, 2019

Could you please post the config file and not a Screenshot? Not all parameters are visible in the GUI. And please post the bond status output too.

psionic · Jun 10, 2019

cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual

auto ens1f0
iface ens1f0 inet static
address 10.10.1.40
netmask 255.255.255.0
#Prox-Cluster-R0

auto ens1f1
iface ens1f1 inet static
address 10.10.2.40
netmask 255.255.255.0
#Prox-Cluster-R1

iface ens1f2 inet manual

iface eno2 inet manual

iface ens1f3 inet manual

auto bond0
iface bond0 inet static
address 10.10.3.40
netmask 24
bond-slaves ens1f2 ens1f3
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2
#Ceph-Cluster

auto vmbr0
iface vmbr0 inet static
address 192.168.1.140
netmask 255.255.0.0
gateway 192.168.2.3
bridge-ports eno1
bridge-stp off
bridge-fd 0
#VM-LAN

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:ea:34:be
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 15
Partner Key: 1009
Partner Mac Address: 28:80:88:f9:49:dd

Slave Interface: ens1f2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:ea:34:be
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 0c:c4:7a:ea:34:be
port key: 15
port priority: 255
port number: 1
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: 28:80:88:f9:49:dd
oper key: 1009
port priority: 128
port number: 23
port state: 61

Slave Interface: ens1f3
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 0c:c4:7a:ea:34:bf
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 0c:c4:7a:ea:34:be
port key: 15
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: 28:80:88:f9:49:dd
oper key: 1009
port priority: 128
port number: 24
port state: 61

sb-jw · Jun 10, 2019

Could you try to swap Cables? Maybe one of your cables or ports are broken.

psionic · Jun 10, 2019

Yes, tried cable swap and different ports. Trying a LACP Layer 3+4 capable switch next. Have seen several articles eluding to STP Layer 2 issues with supermicro X710 NICs that could possibly be fixed with the latest Intel firmware that Supermicro hasn't approved and released yet. I also purchased a Intel X710 NIC that can be updated to see if that could solve the issue.

alexskysilk · Jun 11, 2019

you had 2 link failures on ens1f3 at the time you posted. How many link faults is it reporting now? the liklihood is that you have a bad port, cable, or switchport.

Alwin · Jun 11, 2019

James Pass said:
ens1f2 and ens1f3 are LACP bonded with Layer 2 Hash policy, using XS728T smart switch.

The culprit lies here. At a previous work, we also used the XS728T, it dropped the speed to 0, while happily showing 10GbE on the port view. Setting the port to 1 GbE (form auto-neg), applying it and back to auto-neg. restored the link speed. This is a bug of that Netgear switch and I would not recommend its use in production.

psionic · Jun 12, 2019

Alwin said:
The culprit lies here. At a previous work, we also used the XS728T, it dropped the speed to 0, while happily showing 10GbE on the port view. Setting the port to 1 GbE (form auto-neg), applying it and back to auto-neg. restored the link speed. This is a bug of that Netgear switch and I would not recommend its use in production.

Thanks for the info, I am currently upgrading switches supporting cluster to M4300 series...

Search

Search

Ceph Cluster Bonded 10GBe error

psionic

Member

sb-jw

Famous Member

psionic

Member

sb-jw

Famous Member

psionic

Member

sb-jw

Famous Member

psionic

Member

alexskysilk

Distinguished Member

Alwin

Proxmox Retired Staff

psionic

Member

We value your privacy