Round-Robin stuck for 2 out of 7 server Bonds

Raymond Burns

Member
Apr 2, 2013
333
3
18
Houston, Texas, United States
I have a weird problem here. 7 servers are configured the exact same way. 2 of them are stuck in round-robin mode for bond0. The two NICs are connected to 2 completely separate Netgear Switches.

Code:
prox-h:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond_miimon 100
        bond_mode active-backup

auto bond1
iface bond1 inet manual
        slaves eth2 eth3
        bond_miimon 100
        bond_mode active-backup

auto vmbr0
iface vmbr0 inet static
        address  10.255.86.27
        netmask  255.255.255.0
        gateway  10.255.86.1
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0
#255.86.0 LAN (Add-in)

auto vmbr1
iface vmbr1 inet static
        address  10.255.87.27
        netmask  255.255.255.0
        bridge_ports bond1
        bridge_stp off
        bridge_fd 0
#CEPH Storage Network

Code:
prox-h:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 90:e2:ba:11:62:60
Slave queue ID: 0

Slave Interface: eth1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 90:e2:ba:11:62:61
Slave queue ID: 0
[code]

[code]
prox-h:~# cat /sys/class/net/bond0/bonding/mode
balance-rr 0

Code:
prox-h:~# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: on (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

Code:
prox-h:~# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off (auto)
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: no

Also, I can assure you that eth1 is plugged in with activity and recognition on the switch. That may be a bad cable, but that is beside the point.

I have another server, same exact configuration, and it gives errors of
Code:
received packet on bond0 with own address as source address

I have 5 other servers, on the same 2 switches. No errors. Properly set to active-backup. All were upgraded from 4.x to 5.1 recently.

Code:
prox-h:~# pveversion
pve-manager/5.1-41/0b958203 (running kernel: 4.13.13-3-pve)


Side Note: This is the configuration from a working Node without errors
Code:
Prox-G:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eth4 inet manual

iface eth5 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
        slaves eth7 eth9
        bond_miimon 100
        bond_mode active-backup
        bond_primary eth9

auto bond1
iface bond1 inet manual
        slaves eth6 eth8
        bond_miimon 100
        bond_mode active-backup
        bond_primary eth8

auto vlan2098
iface vlan2098 inet manual
        vlan-raw-device bond1

auto vmbr0
iface vmbr0 inet static
        address  10.255.86.26
        netmask  255.255.255.0
        gateway  10.255.86.1
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0

auto vmbr1
iface vmbr1 inet static
        address  10.255.87.26
        netmask  255.255.255.0
        bridge_ports vlan2098
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes
        network 10.255.87.0
Code:
Prox-G:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth9 (primary_reselect always)
Currently Active Slave: eth9
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth7
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:0a:cd:29:5e:41
Slave queue ID: 0

Slave Interface: eth9
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0a:cd:29:5e:40
Slave queue ID: 0
Code:
Prox-G:~# pveversion
pve-manager/5.1-41/0b958203 (running kernel: 4.13.13-3-pve)
 
i dont know if it is the problem, but in your first network config, there is no
Code:
       bond_primary ethX
on the bonds