bond interface fails after NIC replacement

CZappe

Member
Jan 25, 2021
32
9
13
Santa Fe, NM, USA
www.bti-usa.com
I've replaced a 10GbE adapter in one of our Proxmox nodes that was running a bonded pair of Cat6 cables to an aggregate switch. The old NIC and its replacement are both Intel Ethernet Controllers (10-Gigabit X540-AT2) so I assumed (probably incorrectly) that all I'd need to do was plug it in, find the IDs of the new ports and update the port IDs of the bond interface, jack in the existing Cat6 cables and off we'd go. No dice.

On the outside of things, the network ports on both the new NIC and the switch show connection lights but no activity lights. On the Proxmox node, I've confirmed that the NIC is recognized by the system using lshw -class network -short:

1612560076445.png

Running ip -br -c link show provides the following additional interface status info:

1612560143382.png

Following that, I checked the systemd journal and see the bond0 interface connecting with the slave ports and transmitting MTU settings to them. Beyond this, however, I'm not sure what of the log entries I've found are helpful to rooting out why the ports and bond interface are down on the new NIC. I can see that the interfaces exist and that when they're failing, they fail together, appropriately.

root@tsoukalos:~# journalctl | egrep 'enp7s0f*|bond0'
Feb 04 16:59:47 tsoukalos kernel: ixgbe 0000:07:00.0 enp7s0f0: renamed from eth5
Feb 04 16:59:47 tsoukalos kernel: ixgbe 0000:07:00.1 enp7s0f1: renamed from eth2
Feb 04 16:59:55 tsoukalos systemd-udevd[733]: Could not generate persistent MAC address for bond0: No such file or directory
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.0: registered PHC device on enp7s0f0
Feb 04 16:59:55 tsoukalos kernel: bond0: (slave enp7s0f0): Enslaving as a backup interface with a down link
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.1: registered PHC device on enp7s0f1
Feb 04 16:59:55 tsoukalos kernel: bond0: (slave enp7s0f1): Enslaving as a backup interface with a down link
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.0 enp7s0f0: changing MTU from 1500 to 9000
Feb 04 16:59:56 tsoukalos kernel: ixgbe 0000:07:00.1 enp7s0f1: changing MTU from 1500 to 9000
Feb 04 16:59:56 tsoukalos kernel: vmbr2: port 1(bond0) entered blocking state
Feb 04 16:59:56 tsoukalos kernel: vmbr2: port 1(bond0) entered disabled state
Feb 04 16:59:56 tsoukalos kernel: device bond0 entered promiscuous mode
Feb 04 16:59:56 tsoukalos kernel: device enp7s0f0 entered promiscuous mode
Feb 04 16:59:56 tsoukalos kernel: device enp7s0f1 entered promiscuous mode
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device enp7s0f0
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device enp7s0f1
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device bond0

I'm pretty new to both Linux/Proxmox and Ethernet bonding so any tips here to get our node back online are much appreciated!

Chris
 
For additional info, here's what I currently have for the relevant interfaces under /etc/network/interfaces. All of these settings were configured in the Proxmox GUI:

auto enp7s0f0
iface enp7s0f0 inet manual
# Intel Ethernet Controller 10-Gigabit X540-AT2

auto enp7s0f1
iface enp7s0f1 inet manual
#Intel Ethernet Controller 10-Gigabit X540-AT2
...
auto bond0
iface bond0 inet manual
bond-slaves enp7s0f0 enp7s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2
mtu 9000
...
auto vmbr2

iface vmbr2 inet static
address XXX.XXX.XXX.15/24
bridge-ports bond0
bridge-stp off
bridge-fd 0
 
So...this was definitely a case of user error.

The reason for the "NO-CARRIER" interface status was, embarrassingly, that I had configured the bond to use NIC interfaces from an older 10GbE card in the same server and then I plugged the Ethernet cables into the interfaces on my new card! Running the ethtool blink command (ethtool -p [interface]) on the ports verified this mistake, which I corrected. Incidentally ports on the new card did not blink at any point, so I identified them through elimination.

At this point, I configured the bond to use the correct interfaces and observed that one bonded port was reporting a link speed of 1000Mb and the other 10000Mb. Checking the system logs revealed that there was now an ixgbe firmware error involving the new card, likely due to a slightly different hardware revision.

... ixgbe 0000:05:00.0: Warning firmware error detected FWSM: 0x00000000

The forum post, PVE 6.0-7 + ixgbe firmware errors, was valuable in diagnosing and correcting this:
ethtool again showed that the firmware on these ports was quite old. I downloaded and installed the latest ixgbe firmware from Intel, following the steps from the above link, rebooted the server, and cleared up all remaining issues!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!