After upgrade to 5.0, network does not start correctly every time

alchemycs

Well-Known Member
Dec 6, 2011
34
8
48
Seattle, WA
alchemycs.com
Hi!
I have a small cluster of Proxmox machines, and I am in the process of upgrading them to 5.0 from 4.4. The two that I have converted have this problem of every few reboots, the network simply doesn't work. I can log in via the console and run /etc/init.d/networking restart and that makes it go, but that's not a good solution.
I have a fairly standard bonded, VLAN'd setup:
Code:
# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eth0 inet manual
iface eth1 inet manual

auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond_miimon 100
        bond_mode 4

##  live
auto bond0.2
iface bond0.2 inet manual
        vlan-raw-device bond0

##  private
auto bond0.4
iface bond0.4 inet manual
        vlan-raw-device bond0

##  live
auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0.2
        bridge_stp off
        bridge_fd 0

##  private
auto vmbr1
iface vmbr1 inet static
        address 10.10.10.18
        netmask 255.255.255.0
        gateway 10.10.10.1
        bridge_ports bond0.4
        bridge_stp off
        bridge_fd 0

When it boots up, all the interfaces are "UP", but the bond0 is set to round-robin, not 802.3ad, and I don't know how that could be:
Code:
no-net# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:xx:xx:xx
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:25:90:xx:xx:xy
Slave queue ID: 0

Here is part of what it looks like when set up correctly:
Code:
good-net# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 00:25:90:08:58:82
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 9
        Partner Key: 19
        Partner Mac Address: f8:c0:01:cb:a1:80

Slave Interface: eth0
...etc...

And the pair of Juniper switches I have show the same thing - both ports are up at 1gb, but the 8023ad link aggregation is not.

These machines have been working just fine before the upgrade to 5.0 - is there anything that may have changed?

Thanks in advanced!
 
  • Like
Reactions: hitsword
The conf here isn't so different (except some LACP tuning), but the main difference is that we don't declare explicitly any bond0.x sub-ifs.

Don't know if it matters or not (well it might do, since it always work here).
Our LAG conf:
Code:
auto eno1
iface eno1 inet manual
auto eno2
iface eno2 inet manual
auto eno3
iface eno3 inet manual
auto eno4
iface eno4 inet manual

auto bond0
iface bond0 inet manual
    slaves eno1 eno2 eno3 eno4
    bond_miimon 100
    bond_mode 802.3ad
    bond_xmit_hash_policy encap3+4
    bond_lacp_rate slow

auto vmbr0
iface vmbr0 inet static
    address  192.168.50.11
    netmask  255.255.255.128
    gateway  192.168.50.1
    bridge_ports bond0.1062
    bridge_stp off
    bridge_fd 0

...and so on
 
You should look within /etc to check if something is creating the bond0 interface before the network starts (as bond0 interface must be down to change its mode).
Was the system properly dist-upgraded ? Any rc.local stuff remaining from the past?

While this is not a fix at all, to workaround you might try some manual/dirty
Code:
pre-up modprobe bonding
pre-up ip link add bond0 type bond mode 802.3ad || logger something sucks
just after the bond0 declaration...
 
You should look within /etc to check if something is creating the bond0 interface before the network starts (as bond0 interface must be down to change its mode).
Was the system properly dist-upgraded ? Any rc.local stuff remaining from the past?

While this is not a fix at all, to workaround you might try some manual/dirty
Code:
pre-up modprobe bonding
pre-up ip link add bond0 type bond mode 802.3ad || logger something sucks
just after the bond0 declaration...
Maybe he has failed ,
 
It was a dist-upgraded according to the how-to on the wiki, and I can't think of anything that would cause it, I keep rc.local default. I added your line just after the bond0 declaration, and unfortunately there is no difference :-(

I checked dmesg for bond0 messages, and saw this when it does not com up properly:
Code:
bond0: option mode: unable to set because the bond device is up
When it does come up normally, this is all that shows up in the dmesg:
Code:
[    7.335257] bond0: Setting MII monitoring interval to 100
[    7.347282] bond0: Adding slave eth0
[    7.529506] bond0: Enslaving eth0 as a backup interface with a down link
[    7.532085] bond0: Adding slave eth1
[    7.725549] bond0: Enslaving eth1 as a backup interface with a down link

So, clearly something is setting it to the wrong mode early on, and I'm not sure what. Nothing else in /etc/ references bond0, and the if-pre-up.d/ifenslave file is functionally the same as Proxmox v4.4

So, back to the drawing board? :)
 
I checked dmesg for bond0 messages, and saw this when it does not com up properly:
Code:
bond0: option mode: unable to set because the bond device is up
Yes, this is the problem, it's already up whereas it should not be...
Anything mentioning bonding in /etc ?

lsmod mentions bonding ?
In the logs you should see something like "Ethernet Channel Bonding Driver", which means that the kernel module has loaded.It should the line just before "bond0: blah". If not, something loads the module before.

It would be necessary to trace what happens ; but a possibility to somehow force the state at the kernel module load would probably be:
Code:
echo -e "alias bond0 bonding\noptions bonding mode=4 miimon=100 lacp_rate=0" >  /etc/modprobe.d/bonding.conf
 
This is what lsmod says:
Code:
Module                  Size  Used by
bonding               147456  0

When it boots, it does have this (regardless of whether it boots successfully or not):
Code:
[    7.325589] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

And these are all the files under /etc that have "bond" in them:
Code:
/etc/network/if-up.d/ifenslave
/etc/network/if-post-down.d/ifenslave
/etc/network/if-pre-up.d/ifenslave
/etc/network/interfaces
I'll give your force a try! :)
Thank you!!
 
I don't know, if it's related to network config via /etc/network/interface, but there is problem with bond0 - "hidden bond0" when loading modules. Google about "options bonding max_bonds=0"
 
I seem to have encountered this exact same issue on a fresh proxmox 5.1 install. The problem seems to come from upstream Debian Stretch, as I was able to fix it consistently with attached patch (none of the solutions offered here worked consistently).
The patch introduces a short sleep directly after if-pre-up.d/ifenslave has inserted the `bonding` module into the kernel, or after manual creation of the device (when loaded with max_bonds=0).

I think I'll try opening a bug somewhere upstream shortly, since this probably also impacts all our debian servers with an LACP bond.
For now I'm going with the patch :p

Code:
--- /etc/network/if-pre-up.d/ifenslave.orig    2018-01-31 00:39:53.408660244 +0100
+++ /etc/network/if-pre-up.d/ifenslave    2018-01-31 00:45:29.668216453 +0100
@@ -12,11 +12,15 @@
     # If the bonding module is not yet loaded, load it.
     if [ ! -r /sys/class/net/bonding_masters ]; then
         modprobe -q bonding
+        # GF20180131 Give the interface a chance to come up
+        sleep 2
     fi
 
     # Create the master interface.
     if ! grep -sq "\\<$BOND_MASTER\\>" /sys/class/net/bonding_masters; then
         echo "+$BOND_MASTER" > /sys/class/net/bonding_masters
+        # GF20180131 ... also with max_bonds=0
+        sleep 2
     fi
 }
 
Same problem here.

Bonding mode is set to bond_mode active-backup in /etc/network/interfaces. Proxmox comes up always reproducable in load balancing (round-robin) (reported by /proc/net/bonding/)
 
I've had the same issue. In my setup LAN bond is added to the bridge 'vmbr0'. PVE5. I added the line:
Code:
    pre-up sleep 2
to bridge configuration. It's working so far. Didn't need to patch anything.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!