After upgrade to 5.0, network does not start correctly every time

Discussion in 'Proxmox VE: Networking and Firewall' started by alchemycs, Sep 2, 2017.

  1. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    30
    Likes Received:
    5
    Hi!
    I have a small cluster of Proxmox machines, and I am in the process of upgrading them to 5.0 from 4.4. The two that I have converted have this problem of every few reboots, the network simply doesn't work. I can log in via the console and run /etc/init.d/networking restart and that makes it go, but that's not a good solution.
    I have a fairly standard bonded, VLAN'd setup:
    Code:
    # cat /etc/network/interfaces
    auto lo
    iface lo inet loopback
    
    iface eth0 inet manual
    iface eth1 inet manual
    
    auto bond0
    iface bond0 inet manual
            slaves eth0 eth1
            bond_miimon 100
            bond_mode 4
    
    ##  live
    auto bond0.2
    iface bond0.2 inet manual
            vlan-raw-device bond0
    
    ##  private
    auto bond0.4
    iface bond0.4 inet manual
            vlan-raw-device bond0
    
    ##  live
    auto vmbr0
    iface vmbr0 inet manual
            bridge_ports bond0.2
            bridge_stp off
            bridge_fd 0
    
    ##  private
    auto vmbr1
    iface vmbr1 inet static
            address 10.10.10.18
            netmask 255.255.255.0
            gateway 10.10.10.1
            bridge_ports bond0.4
            bridge_stp off
            bridge_fd 0
    
    When it boots up, all the interfaces are "UP", but the bond0 is set to round-robin, not 802.3ad, and I don't know how that could be:
    Code:
    no-net# cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: load balancing (round-robin)
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: eth0
    MII Status: up
    Speed: 1000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 00:25:90:xx:xx:xx
    Slave queue ID: 0
    
    Slave Interface: eth1
    MII Status: up
    Speed: 1000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 00:25:90:xx:xx:xy
    Slave queue ID: 0
    
    Here is part of what it looks like when set up correctly:
    Code:
    good-net# cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: IEEE 802.3ad Dynamic link aggregation
    Transmit Hash Policy: layer2 (0)
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    802.3ad info
    LACP rate: slow
    Min links: 0
    Aggregator selection policy (ad_select): stable
    System priority: 65535
    System MAC address: 00:25:90:08:58:82
    Active Aggregator Info:
            Aggregator ID: 1
            Number of ports: 2
            Actor Key: 9
            Partner Key: 19
            Partner Mac Address: f8:c0:01:cb:a1:80
    
    Slave Interface: eth0
    ...etc...
    
    And the pair of Juniper switches I have show the same thing - both ports are up at 1gb, but the 8023ad link aggregation is not.

    These machines have been working just fine before the upgrade to 5.0 - is there anything that may have changed?

    Thanks in advanced!
     
    hitsword likes this.
  2. Symbol

    Symbol Member
    Proxmox VE Subscriber

    Joined:
    Mar 1, 2017
    Messages:
    39
    Likes Received:
    2
    The conf here isn't so different (except some LACP tuning), but the main difference is that we don't declare explicitly any bond0.x sub-ifs.

    Don't know if it matters or not (well it might do, since it always work here).
    Our LAG conf:
    Code:
    auto eno1
    iface eno1 inet manual
    auto eno2
    iface eno2 inet manual
    auto eno3
    iface eno3 inet manual
    auto eno4
    iface eno4 inet manual
    
    auto bond0
    iface bond0 inet manual
        slaves eno1 eno2 eno3 eno4
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy encap3+4
        bond_lacp_rate slow
    
    auto vmbr0
    iface vmbr0 inet static
        address  192.168.50.11
        netmask  255.255.255.128
        gateway  192.168.50.1
        bridge_ports bond0.1062
        bridge_stp off
        bridge_fd 0
    
    ...and so on
    
     
  3. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    30
    Likes Received:
    5
    Because I didn't have anything else to try, I went ahead & removed all the bond0.x configs, but still get the same problem.
    Any other thoughts anyone? :-D
     
  4. Symbol

    Symbol Member
    Proxmox VE Subscriber

    Joined:
    Mar 1, 2017
    Messages:
    39
    Likes Received:
    2
    You should look within /etc to check if something is creating the bond0 interface before the network starts (as bond0 interface must be down to change its mode).
    Was the system properly dist-upgraded ? Any rc.local stuff remaining from the past?

    While this is not a fix at all, to workaround you might try some manual/dirty
    Code:
    pre-up modprobe bonding
    pre-up ip link add bond0 type bond mode 802.3ad || logger something sucks
    
    just after the bond0 declaration...
     
  5. Yamnsllerty

    Yamnsllerty New Member

    Joined:
    Aug 5, 2017
    Messages:
    6
    Likes Received:
    0
    Maybe he has failed ,
     
  6. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    30
    Likes Received:
    5
    It was a dist-upgraded according to the how-to on the wiki, and I can't think of anything that would cause it, I keep rc.local default. I added your line just after the bond0 declaration, and unfortunately there is no difference :-(

    I checked dmesg for bond0 messages, and saw this when it does not com up properly:
    Code:
    bond0: option mode: unable to set because the bond device is up
    When it does come up normally, this is all that shows up in the dmesg:
    Code:
    [    7.335257] bond0: Setting MII monitoring interval to 100
    [    7.347282] bond0: Adding slave eth0
    [    7.529506] bond0: Enslaving eth0 as a backup interface with a down link
    [    7.532085] bond0: Adding slave eth1
    [    7.725549] bond0: Enslaving eth1 as a backup interface with a down link
    
    So, clearly something is setting it to the wrong mode early on, and I'm not sure what. Nothing else in /etc/ references bond0, and the if-pre-up.d/ifenslave file is functionally the same as Proxmox v4.4

    So, back to the drawing board? :)
     
  7. Symbol

    Symbol Member
    Proxmox VE Subscriber

    Joined:
    Mar 1, 2017
    Messages:
    39
    Likes Received:
    2
    Yes, this is the problem, it's already up whereas it should not be...
    Anything mentioning bonding in /etc ?

    lsmod mentions bonding ?
    In the logs you should see something like "Ethernet Channel Bonding Driver", which means that the kernel module has loaded.It should the line just before "bond0: blah". If not, something loads the module before.

    It would be necessary to trace what happens ; but a possibility to somehow force the state at the kernel module load would probably be:
    Code:
    echo -e "alias bond0 bonding\noptions bonding mode=4 miimon=100 lacp_rate=0" >  /etc/modprobe.d/bonding.conf
     
  8. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    30
    Likes Received:
    5
    This is what lsmod says:
    Code:
    Module                  Size  Used by
    bonding               147456  0
    
    When it boots, it does have this (regardless of whether it boots successfully or not):
    Code:
    [    7.325589] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    And these are all the files under /etc that have "bond" in them:
    Code:
    /etc/network/if-up.d/ifenslave
    /etc/network/if-post-down.d/ifenslave
    /etc/network/if-pre-up.d/ifenslave
    /etc/network/interfaces
    
    I'll give your force a try! :)
    Thank you!!
     
  9. czechsys

    czechsys Member

    Joined:
    Nov 18, 2015
    Messages:
    122
    Likes Received:
    3
    I don't know, if it's related to network config via /etc/network/interface, but there is problem with bond0 - "hidden bond0" when loading modules. Google about "options bonding max_bonds=0"
     
  10. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    30
    Likes Received:
    5
    @czechsys, that is interesting, and does look like my problem that has arisen after the upgrade.
    @Symbol, that does seem to have fixed it. Thank you for that! ^_^
     
  11. Gerlof Fokkema

    Gerlof Fokkema New Member

    Joined:
    Jan 31, 2018
    Messages:
    1
    Likes Received:
    2
    I seem to have encountered this exact same issue on a fresh proxmox 5.1 install. The problem seems to come from upstream Debian Stretch, as I was able to fix it consistently with attached patch (none of the solutions offered here worked consistently).
    The patch introduces a short sleep directly after if-pre-up.d/ifenslave has inserted the `bonding` module into the kernel, or after manual creation of the device (when loaded with max_bonds=0).

    I think I'll try opening a bug somewhere upstream shortly, since this probably also impacts all our debian servers with an LACP bond.
    For now I'm going with the patch :p

    Code:
    --- /etc/network/if-pre-up.d/ifenslave.orig    2018-01-31 00:39:53.408660244 +0100
    +++ /etc/network/if-pre-up.d/ifenslave    2018-01-31 00:45:29.668216453 +0100
    @@ -12,11 +12,15 @@
         # If the bonding module is not yet loaded, load it.
         if [ ! -r /sys/class/net/bonding_masters ]; then
             modprobe -q bonding
    +        # GF20180131 Give the interface a chance to come up
    +        sleep 2
         fi
     
         # Create the master interface.
         if ! grep -sq "\\<$BOND_MASTER\\>" /sys/class/net/bonding_masters; then
             echo "+$BOND_MASTER" > /sys/class/net/bonding_masters
    +        # GF20180131 ... also with max_bonds=0
    +        sleep 2
         fi
     }
    
     
    MR_Andrew and Symbol like this.
  12. volker

    volker New Member

    Joined:
    Sep 9, 2015
    Messages:
    6
    Likes Received:
    0
    Same problem here.

    Bonding mode is set to bond_mode active-backup in /etc/network/interfaces. Proxmox comes up always reproducable in load balancing (round-robin) (reported by /proc/net/bonding/)
     
  13. Dale Sykora

    Dale Sykora New Member
    Proxmox VE Subscriber

    Joined:
    Jul 5, 2016
    Messages:
    11
    Likes Received:
    0
    Same issue here. Thanks for the patch Gerlof! That seems to solve the problem.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice