HP DL gen10 no bond communication

tl5k5 · May 26, 2023

Hey all,
I have a DL gen10 server that I installed an HPE dual 10Gb 560FLR-SFP+ Adapter into.
I've found that it will not pass bond0/LACP communication. An Intel X520-T2 with the same interface bond0/LACP config connects without issue.
The switch shows LACP is communicating/connected.

Does anyone see any issues with the config as to why it's not communicating/connecting?

Code:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address 10.##.###.###/24
        gateway 10.###.###.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Here's what iLO sees:

Thanks!!!

jsterr · May 26, 2023

Can you post your dmesg or syslog, are there any errors regarding that card?

tl5k5 · May 26, 2023

jsterr said:
Can you post your dmesg or syslog, are there any errors regarding that card?

None that I understand. I'm more of a sys admin...not a Linux admin. See attached zip file

UPDATE: You may see vmbr1. This was part of my testing. There's no type-o in the config.

jsterr · May 26, 2023

What about? cat /proc/net/bonding/bond0 Do the adapters work without bond? eno1 and eno2?
Can you update the firmware? Is LACP correctly configured on the opposite site? (Layer 2+3 = MAC+IP)

tl5k5 · May 26, 2023

jsterr said:
What about? cat /proc/net/bonding/bond0 Do the adapters work without bond? eno1 and eno2?
Can you update the firmware? Is LACP correctly configured on the opposite site? (Layer 2+3 = MAC+IP)

See attached bond0.
Yes, both ports work on the adapter (10Gb) when not in a bond.
I already updated the adapter firmware to the latest version.
Yes, LACP is configured on the opposite side. Again, an X520-T2 worked in bond/LACP with the same config. The Intel was swapped out for the HP FlexLOM adaptor. The only difference is the card, all other components and configs are the same.

Thanks!

CORRECTION: sorry Intel was an X520-DA2

tl5k5 · May 26, 2023

I may have found the issue.
OK HPE...which is it?!?!? You can't have it both ways!

Seems strange this could be different on the same exact chipset!!!

vs.

jsterr · May 26, 2023

tl5k5 said:
See attached bond0.
Yes, both ports work on the adapter (10Gb) when not in a bond.
I already updated the adapter firmware to the latest version.
Yes, LACP is configured on the opposite side. Again, an X520-T2 worked in bond/LACP with the same config. The Intel was swapped out for the HP FlexLOM adaptor. The only difference is the card, all other components and configs are the same.

Thanks!

CORRECTION: sorry Intel was an X520-DA2

Can you install ethtool and checkout: ethtool --show-priv-flags eno1 and ethtool --show-priv-flags eno2 if there is lldp enabled? Regarding the type of the card, you can check it with lspci or with lshw (apt install lshw) and then lshw -c network -businfo

Edit: maybe you also check the driver for the card that is used, and if there is one avaiable from intel with dkms support.

bbgeek17 · May 26, 2023

You said the ports worked when not in bond. Did you reconfigure your switch and removed the LACP config for corresponding ports to test this? Are you moving cables to different ports? Are you doing an apples to apples comparison is the question.
You said that a different card worked in bond - was it physically connected to same ports/bond on the switch or elsewhere?

your log file shows:

Code:

May 26 14:03:19 hostname kernel: [   10.114840] ixgbe 0000:24:00.0: registered PHC device on eno1
May 26 14:03:19 hostname kernel: [   10.224409] bond0: (slave eno1): Enslaving as a backup interface with a down link
May 26 14:03:19 hostname kernel: [   10.286721] ixgbe 0000:24:00.0 eno1: detected SFP+: 4
May 26 14:03:19 hostname kernel: [   10.517704] ixgbe 0000:24:00.1: registered PHC device on eno2
May 26 14:03:19 hostname kernel: [   10.537993] ixgbe 0000:24:00.0 eno1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
May 26 14:03:19 hostname kernel: [   10.624412] bond0: (slave eno2): Enslaving as a backup interface with a down link
May 26 14:03:19 hostname kernel: [   10.640490] vmbr1: port 1(bond0) entered blocking state
May 26 14:03:19 hostname kernel: [   10.640493] vmbr1: port 1(bond0) entered disabled state
May 26 14:03:19 hostname kernel: [   10.686445] ixgbe 0000:24:00.1 eno2: detected SFP+: 3
May 26 14:03:19 hostname kernel: [   10.806289] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
May 26 14:03:19 hostname kernel: [   10.814523] vmbr1: port 1(bond0) entered blocking state
May 26 14:03:19 hostname kernel: [   10.814527] vmbr1: port 1(bond0) entered forwarding state
May 26 14:03:19 hostname kernel: [   10.818105] bond0: (slave eno1): link status definitely up, 10000 Mbps full duplex
May 26 14:03:19 hostname kernel: [   10.818120] bond0: active interface up!
May 26 14:03:19 hostname kernel: [   10.941933] ixgbe 0000:24:00.1 eno2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
May 26 14:03:19 hostname kernel: [   11.026138] bond0: (slave eno2): link status definitely up, 10000 Mbps full duplex
May 26 14:03:20 hostname kernel: [   11.817890] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr1: link becomes ready

This is a pretty descriptive message explaining what the Linux wanted but didnt get:
May 26 14:03:19 hostname kernel: [ 10.806289] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

Double check all of your configuration. Install tcpdump and run this on each physical interface (on working and non-working card):
# tcpdump -ni <IFNAME> ether proto 0x8809

do you see LACP PDUs in each case? You should be seeing them regardless of the Linux side config.
Check the switch for any sort of "banning" of macs/ports due to flapping/changing.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alexskysilk · May 26, 2023

tl5k5 said:
bond-mode 802.3ad

For this to work, you need your switch to support the topology. depending on the switch model and firmware, that may mean you have to explicitly define the LAGG before this can work.

tl5k5 · May 26, 2023

bbgeek17 said:
You said the ports worked when not in bond. Did you reconfigure your switch and removed the LACP config for corresponding ports to test this? Are you moving cables to different ports? Are you doing an apples to apples comparison is the question.
You said that a different card worked in bond - was it physically connected to same ports/bond on the switch or elsewhere?

your log file shows:

Code:

May 26 14:03:19 hostname kernel: [ 10.114840] ixgbe 0000:24:00.0: registered PHC device on eno1 May 26 14:03:19 hostname kernel: [ 10.224409] bond0: (slave eno1): Enslaving as a backup interface with a down link May 26 14:03:19 hostname kernel: [ 10.286721] ixgbe 0000:24:00.0 eno1: detected SFP+: 4 May 26 14:03:19 hostname kernel: [ 10.517704] ixgbe 0000:24:00.1: registered PHC device on eno2 May 26 14:03:19 hostname kernel: [ 10.537993] ixgbe 0000:24:00.0 eno1: NIC Link is Up 10 Gbps, Flow Control: RX/TX May 26 14:03:19 hostname kernel: [ 10.624412] bond0: (slave eno2): Enslaving as a backup interface with a down link May 26 14:03:19 hostname kernel: [ 10.640490] vmbr1: port 1(bond0) entered blocking state May 26 14:03:19 hostname kernel: [ 10.640493] vmbr1: port 1(bond0) entered disabled state May 26 14:03:19 hostname kernel: [ 10.686445] ixgbe 0000:24:00.1 eno2: detected SFP+: 3 May 26 14:03:19 hostname kernel: [ 10.806289] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond May 26 14:03:19 hostname kernel: [ 10.814523] vmbr1: port 1(bond0) entered blocking state May 26 14:03:19 hostname kernel: [ 10.814527] vmbr1: port 1(bond0) entered forwarding state May 26 14:03:19 hostname kernel: [ 10.818105] bond0: (slave eno1): link status definitely up, 10000 Mbps full duplex May 26 14:03:19 hostname kernel: [ 10.818120] bond0: active interface up! May 26 14:03:19 hostname kernel: [ 10.941933] ixgbe 0000:24:00.1 eno2: NIC Link is Up 10 Gbps, Flow Control: RX/TX May 26 14:03:19 hostname kernel: [ 11.026138] bond0: (slave eno2): link status definitely up, 10000 Mbps full duplex May 26 14:03:20 hostname kernel: [ 11.817890] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr1: link becomes ready

This is a pretty descriptive message explaining what the Linux wanted but didnt get:
May 26 14:03:19 hostname kernel: [ 10.806289] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

Double check all of your configuration. Install tcpdump and run this on each physical interface (on working and non-working card):
# tcpdump -ni <IFNAME> ether proto 0x8809

do you see LACP PDUs in each case? You should be seeing them regardless of the Linux side config.
Check the switch for any sort of "banning" of macs/ports due to flapping/changing.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I am cp'ing backups of the bond interface and the none-bond interface files and then systemctl restart networking.
I have two runs of cables going to two different locations on the switch. One set is a LACP LAG and the other is a switchport.
I move the cable(s) into the same port(s) on the NIC.
So yes...apples to apples.
Yes, the different cards used the same cables and same ports on the switch. The Intel is no longer installed.

Config is exactly the same between the Intel and HP cards. They both show up as eno1 and eno2 since they were never in the server at the same time.
I'll look into tcpdump.
The switch ports and LACP show all is good on that end.
This is a pretty simple config. I'm not sure why this is such an issue.

Thanks for the help!

tl5k5 · May 26, 2023

jsterr said:
Can you install ethtool and checkout: ethtool --show-priv-flags eno1 and ethtool --show-priv-flags eno2 if there is lldp enabled? Regarding the type of the card, you can check it with lspci or with lshw (apt install lshw) and then lshw -c network -businfo

Edit: maybe you also check the driver for the card that is used, and if there is one avaiable from intel with dkms support.

see attached.

alexskysilk · May 26, 2023

tl5k5 said:
The switch ports and LACP show all is good on that end.

might want to look at the debug logs on your switch.

jsterr · May 26, 2023

Please post: ethtool -m eno1 and ethtool eno1 - also for eno2
lshw says this is a 82599ES card - googling this with bond brings up lots of topics that bonding is not working.

tl5k5 · May 26, 2023

jsterr said:
Please post: ethtool -m eno1 and ethtool eno1 - also for eno2
lshw says this is a 82599ES card - googling this with bond brings up lots of topics that bonding is not working.

Interesting...makes me wonder more about Post 6 and how HPE docs contradict themselves.

See attached ethtool info

tl5k5 · May 26, 2023

alexskysilk said:
might want to look at the debug logs on your switch.

logs check out fine. Also, I created a secondary LAG before creating the 1st post...still no communication with the initial bond connection.

tl5k5 · May 26, 2023

Thanks everyone for the help.
I'll start troubleshooting again on Monday.
Everyone have a great weekend!

alexskysilk · May 27, 2023

tl5k5 said:
ogs check out fine.

tl5k5 said:
still no communication with the initial bond connection.

these two things are in conflict. You would not be seeing "fine" if there are no packets passing.

HP DL gen10 no bond communication

Well-Known Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Distinguished Member

Distinguished Member

Well-Known Member

Well-Known Member

Attachments

Distinguished Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Distinguished Member