Fault-tolerance bond with end-to-end testing

Xandrios · Mar 30, 2024

Hi,

After realising that NFS trunking/multipath is not offering improved redundancy I'm looking at getting this done at the network level. Specifically I would like to be able to use a pair of switches without MLAG-like features, yet, still have redundancy for the storage network. Ideally by having two separate uplinks for each node (including NAS/SAN) and send traffic out the link that allows connectivity to the other side end-to-end.

Basic link-monitoring is insufficient as switches easily fail while still having active links. Also if the cross-connectivity between the switches were to fail all links to the host will remain up (thus no link switching happens), yet hosts that have their primary link pointed towards switch #1 wont be able to see other hosts that have their primary link on switch #2. So, we need something 'smarter'.

So, ideally we would want to use LACP from the nodes to two switches, and use MLAG between the switches. This allows for high availability and is the most flexible - but most likely also not within the budget for our small deployments (2-4 PVE nodes). Another (cheaper) option could be to stack the switches, however this does not allow for individual management of switches - meaning that one config error, or firmware update, will take down everything.

In this thread I was spitballing some ideas on how to do this at the OS level: Using arp-ping to determine which of the two interfaces has access to the destination, and activate/change a network route to force outgoing traffic to use that interface. This can be done through the use of a regularly executed script for instance.

The idea for this mechanism is basically from the NIC-teaming options that used to be available. NIC-Teaming is no longer being used I believe. However looking through the man pages for bonding it looks like something similar is actually available still. Bonding with an ARP-IP target. This allows an Active-Backup bond to decide which link to use based on an arp-ping, rather than link status. Basically exactly what I'm looking for.

So I'm trying this on a PVE hosts. First thing I realised is that, in order tfor the bond to be able to send ARP requests, the bond needs to have an IP address. This isn't great as this means that I won't be able to attach a vmbr to the bond. But Ok, for the sake of testing, this is the config I came up with:

Code:

auto eno7
iface eno7 inet manual

auto eno8
iface eno8 inet manual

auto bond0
iface bond0 inet static
        address 10.200.0.101/24
        bond-slaves eno7 eno8
        bond-mode active-backup
        bond-primary eno7
        bond-arp-interval 100
        bond-arp-ip-target 10.200.0.201
        bond-arp-validate filter

The bond comes up fine. The IP is reachable. It looks fine... but it is not.

Code:

root@pve-gen10:~#  cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.5.13-3-pve

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eno7
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Something potentially suspicious is the bond-arp-interval not showing up in the bond status, but MII showing. Running an network capture on the bond0 (or eno7 / eno8) does not show any arp messages towards 10.200.0.201. Doing an arping manually works fine:

Code:

root@pve-gen10:~# arping -I bond0 10.200.0.201
ARPING 10.200.0.201
60 bytes from 00:11:32:91:4e:af (10.200.0.201): index=0 time=182.914 usec

I'm not sure where to look for the potential error. Looking through Debian manpages I'm able to find all the bonding stuff to be described in ifupdown-ng, however the manpages for bookworm ifupdown2 don't mention bonding at all - which is a little odd as definitely the basics are supported.

Anybody familiar with this topic and could give me some hints as to were I could look to make this work?

Xandrios · Mar 31, 2024

Allright, made some progress on the quest for end-to-end redundancy with basic hardware

The bond works when manually configured using Sysfs, as described on kernel.org.

Bash:

root@pve-gen10:~# echo +10.200.0.201 > /sys/class/net/bond0/bonding/arp_ip_target
root@pve-gen10:~# echo 1 > /sys/class/net/bond0/bonding/lp_interval
root@pve-gen10:~# echo 2000 > /sys/class/net/bond0/bonding/arp_interval
root@pve-gen10:~# cat /sys/class/net/bond0/bonding/slaves
eno7 eno8
root@pve-gen10:/~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.5.13-3-pve

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eno7 (primary_reselect failure)
Currently Active Slave: eno8
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
ARP Polling Interval (ms): 2000
ARP Missed Max: 2
ARP IP target/s (n.n.n.n form): 10.200.0.200, 10.200.0.201
NS IPv6 target/s (xx::xx form):

Slave Interface: eno7
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 5
Permanent HW addr: d8:9d:67:25:4a:5e
Slave queue ID: 0

Slave Interface: eno8
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: d8:9d:67:25:4a:5f
Slave queue ID: 0

Which causes arp requests to be sent out. Haven't tested too many fault scenarios but the basics seem to be working. In order to do this there needs to be an IP available as arp source address - which I had initially configured on the bond interface itself. That, however, does not allow the bond to be also used for sub-interfaces (vlans, vmbr for VM's, etc). It would work for storage network access but that's about it.

However... the IP address does not have to be on the bond itself. Apparently the mechanism is smart enough to realise that any interface that has an IP, that is routable over the bond, can be used as source. Therefore, configuring the bond without and IP - but adding a vmbr with an IP, works fine. Just make sure that you dont have another interface that is not part of the bond that has an IP in the same subnet... as that will cause things to fail.

So, from a functional perspective the bonding does work. However. The /etc/interfaces config posted above does not. It is odd that some of the bonding params seem to be picked up, and others are not. I'm starting to think that ifupdown2 is not behaving the way that I expect it would/should..

Xandrios · Mar 31, 2024

And what do you know... there's actually a bug report for this: https://github.com/CumulusNetworks/ifupdown2/issues/199
I've tried all variations of config keys in the interfaces file, but for some reason only a subset of the bonding arguments are actually picked up.

Workaround is to add something like this to the bond0 config:

Code:

post-up echo 0 > /sys/class/net/bond0/bonding/miimon
post-up echo 2000 > /sys/class/net/bond0/bonding/arp_interval
post-up echo +10.200.0.201 > /sys/class/net/bond0/bonding/arp_ip_target || true

So that fixes that. Not pretty, but it will work.

Now the next challenge is vlans. All of this is supposed to run over a specific vlan. When running it without vlans all works well:

Code:

eno7  eno8
  |     |
  |    /
  |  /
 bond0
   |
 vmbr1 (10.200.0.101)

This makes the arp requests, used by the bond to determine which physical interface to use, to go out as 10.200.0.101. That works.

Now, for using a vlan. The typical way to do so would be like this:

Code:

eno7  eno8
  |     |
  |    /
  |  /
 bond0
   |
 vmbr1
  |
 vmbr1.40 (10.200.0.101)

Which allows me to manually send out arp's with vlan 40 tagging... but only when done specifically through interface vmbr.40. And that kind of makes sense, as the other interfaces have no knowledge of the vlan.

However in this case we need the arps, sent out at the bond0 level, to already be vlan tagged. So... we may need to make the bond itself aware of the vlan in some way. Meaning that we would also have a bond per vlan that we have to support (Which makes sense I guess). Perhaps we can do something like this? Not sure if this is a valid topology..

Code:

eno7       eno8
  |          |
eno7.40    eno8.40
  |          |
  |  ________|
  | |
 bond0
   |
 vmbr1 (10.200.0.101)

Xandrios · Mar 31, 2024

Turns out that the 3rd scenario cannot work: The bonding driver does not accept anything other than hardware interfaces. So; it has to be the second option - which, according to the documentation - should theoretically work. The idea is that the bond will 'learn' what networks exist based on traffic passing through the bond. It will then use that information to send out its arp packets.

This explains why the scenario in the first diagram actually works: Even though the IP is not active on the bond, it is still used for sending arp messages.

Unfortunately I haven't been able to get this working yet. I've built this scenario:

Code:

eno7  eno8
  |     |
  |    /
  |  /
 bond0
   |
 vmbr1
  |
 vmbr1.40 (10.200.0.101)

Forcing any traffic from 10.200.0.101 (vlan 40) through the bond does not have the bond 'learn' the IP address/VLAN. In fact even if I connect a VM to vmbr1 and have it ping through the bond, the bond does not learn the IP to start sending ARP requests.

The only way to have it start doing that seems to be if an IP is directly set on vmbr1. However... doing so means vlans can't be added. Am I missing something dumb on how this structure should look like?

Xandrios · Mar 31, 2024

I have got it to work...almost.

The VLAN tag needs to be defined on the interface directly under the bond. So, that means the following two models work:

Code:

eno7  eno8
  |     |
  |    /
  |  /
 bond0
   |
   |----------|
   |          |  
   |          |        
 vmbr1     bond0.40 (10.200.0.101)
   |
   |
  VM
(vlan x)
 (IP y)

Code:

eno7  eno8
  |     |
  |    /
  |  /
 bond0
   |
   |----------------------|----------------------|
   |                      |                      |
 bond0.40               bond0.41               bond0.42
   |                      |                      |
 vmbr40 (10.200.0.101)  vmbr41 (10.41.0.101)   vmbr42 (10.42.0.101)

In the first model the hypervisor interface is defined separately (vlan 40 in this example). This IP can only be used by the hypervisor (Either management, or storage - depending on if you're bonding the management or storage interfaces). Further VLANs are defined in the VM config.

The second model allows creating a vmbr for each individual vlan. This is akin the VMWare model, and I personally like having strongly typed/defined network segments for VM assignment. This also means that you should not be defining vlans on the VM nic config page, or it'll be q-in-q.

However, there is one issue still. Even though the above two models work, when reloading the network configuration the following error is thrown:

Code:

error: netlink: bond0: cannot add bridge vlan 40: operation failed with 'Operation not supported' (95)

Which I don't understand. And obviously things still come up and are working, even while ifreload is complaining. Creating that bond0.40 interface manually works without errors... so it looks like it should be supported.

Code:

ip link add link bond0 name bond0.40 type vlan id 40
ip link set bond0.40 up

Config file illustrating the scenario:

Code:

auto eno7
iface eno7 inet manual

auto eno8
iface eno8 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno7 eno8
        bond-miimon 0
        bond-mode active-backup
        bond-primary eno7
        post-up echo 0 > /sys/class/net/bond0/bonding/miimon
        post-up echo 2000 > /sys/class/net/bond0/bonding/arp_interval
        post-up echo +10.200.0.200 > /sys/class/net/bond0/bonding/arp_ip_target || true

auto vmbr40
iface vmbr40 inet static
        address 10.200.0.101/24
        bridge-ports bond0.40
        bridge-stp off
        bridge-fd 0

Any ideas?

Search

Search

Fault-tolerance bond with end-to-end testing

Xandrios

New Member

Xandrios

New Member

Xandrios

New Member

Xandrios

New Member

Xandrios

New Member