Ceph Cluster Bonded 10GBe error

psionic

Member
May 23, 2019
75
9
13
Seeing the following errors on Proxmox Node Terminal screen:
upload_2019-6-7_9-43-0.png

ens1f2 and ens1f3 are LACP bonded with Layer 2 Hash policy, using XS728T smart switch.

ethtool -i bond0
driver: bonding
version: 3.7.1
firmware-version: 2
expansion-rom-version:
bus-info:
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

dmesg | grep -i ethernet
[ 1.745517] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
[ 6.545528] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Any suggestions please?
 
Could you please post the config file and not a Screenshot? Not all parameters are visible in the GUI. And please post the bond status output too.
 
cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual

auto ens1f0
iface ens1f0 inet static
address 10.10.1.40
netmask 255.255.255.0
#Prox-Cluster-R0

auto ens1f1
iface ens1f1 inet static
address 10.10.2.40
netmask 255.255.255.0
#Prox-Cluster-R1

iface ens1f2 inet manual

iface eno2 inet manual

iface ens1f3 inet manual

auto bond0
iface bond0 inet static
address 10.10.3.40
netmask 24
bond-slaves ens1f2 ens1f3
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2
#Ceph-Cluster

auto vmbr0
iface vmbr0 inet static
address 192.168.1.140
netmask 255.255.0.0
gateway 192.168.2.3
bridge-ports eno1
bridge-stp off
bridge-fd 0
#VM-LAN

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:ea:34:be
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 15
Partner Key: 1009
Partner Mac Address: 28:80:88:f9:49:dd

Slave Interface: ens1f2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:ea:34:be
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 0c:c4:7a:ea:34:be
port key: 15
port priority: 255
port number: 1
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: 28:80:88:f9:49:dd
oper key: 1009
port priority: 128
port number: 23
port state: 61

Slave Interface: ens1f3
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 0c:c4:7a:ea:34:bf
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 0c:c4:7a:ea:34:be
port key: 15
port priority: 255
port number: 2
port state: 61
details partner lacp pdu:
system priority: 32768
system mac address: 28:80:88:f9:49:dd
oper key: 1009
port priority: 128
port number: 24
port state: 61
 
Could you try to swap Cables? Maybe one of your cables or ports are broken.
 
Yes, tried cable swap and different ports. Trying a LACP Layer 3+4 capable switch next. Have seen several articles eluding to STP Layer 2 issues with supermicro X710 NICs that could possibly be fixed with the latest Intel firmware that Supermicro hasn't approved and released yet. I also purchased a Intel X710 NIC that can be updated to see if that could solve the issue.
 
ens1f2 and ens1f3 are LACP bonded with Layer 2 Hash policy, using XS728T smart switch.
The culprit lies here. At a previous work, we also used the XS728T, it dropped the speed to 0, while happily showing 10GbE on the port view. Setting the port to 1 GbE (form auto-neg), applying it and back to auto-neg. restored the link speed. This is a bug of that Netgear switch and I would not recommend its use in production.
 
The culprit lies here. At a previous work, we also used the XS728T, it dropped the speed to 0, while happily showing 10GbE on the port view. Setting the port to 1 GbE (form auto-neg), applying it and back to auto-neg. restored the link speed. This is a bug of that Netgear switch and I would not recommend its use in production.
Thanks for the info, I am currently upgrading switches supporting cluster to M4300 series...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!