- Sep 13, 2013
I ran my CEPH cluster with both LACP and balance-rr bonding setup. I am finding out although balance-rr with dummy switches has some big advantages, the issues of packet loss is out weighing the performance. Thanks to jinjer i got to know that balance-rr allows have fault tolerant switches, not just NIC themselves. Performance wise i am not noticing any positive difference. After much thinking and debate, i have decided to go back to LACP method of bonding. Even though i will not get higher bandwidth than 1 gbps, it seems like the least problematic way to go for a stable platform.
Sorry for the late addition to this discussion, but from the networking side there's one huge problem with the multiple-indepdent-switches theory of balance-rr. All the switches must be interconnected, since balance-rr assumes a common FIB at the switching layer.
In theory, if every single device on the subnet is plugged into the same set of switches, this could work. Otherwise, the odds of a packet getting delivered to the correct switch is exactly 1/n.
So, if you have A CEPH nodes and B PVE nodes, and they are all have C NICs dedicated to storage, you could interconnect them with C switches using (A+B) ports on each switch. This can work because any packet can be received on any port inside a bond group. You'll see slightly erratic traffic patterns, but given enough traffic it should be approximately equal on all C links from any given server and approximately equal on all C switches.
Where this breaks down completely is if you have any devices at all in that IP subnet that are NOT connected to all C switches. Let's say CEPH node #1 only has two NICs dedicated to storage, but everything else has three NICs dedicated to storage. One out of every three packets destined for CEPH#1 will be delivered to switch #3... and will not reach CEPH#1 because it's not connected to that switch.
If the storage subnet is protected by a firewall, the firewall now must be connected to all three switches using balance-rr also, or the same effect occurs.
Note that this means you do NOT get true redundancy using balance-rr; if f NICs or cables fail, you start dropping f/C packets.
While it is possible to mitigate this problem by interconnecting the switches, you then get two problems. Firstly, you now have a Spanning Tree topology with its attendant problems. (And if you disable STP on your switches, you should be taken out back and shot. The protocol still exists nowadays mainly to save your ass when you do something stupid at 2am.) Secondly, you now have FIB flapping and the switch CPUs become bottlenecks. FIB flapping is when the switch sees that MAC address A resides on port Y. Oh, wait, it just moved to port Z. Wait, it just moved to port Q. Oh, now it's back to port Y. No, it's back on port Q again! The switch's management plane (i.e. the CPU) has to be involved in coordinating the forwarding table: what port should a packet destined for MAC address A be sent out?
This is also why connecting multiple balance-rr links into a single switch often produces strange results. Most cheaper switches have very low-end CPUs (i.e. 400MHz), because normally the CPU doesn't have to do much work. Using balance-rr is a good way to force it to do a lot of work. (Yes, switch ASICs handle most of the workload, but not 100% of it.)
To recap: balance-rr can be useful for certain specific scenarios ONLY IF the switch(es) and network fabric can handle it correctly. It does not give you good redundancy (except against complete switch failure in the isolated-switch scenario), it's not standardized in any way, and no-one other than Linux implements it this way. LACP definitely has limitations, but it handles more situations correctly and is reliable in virtually every circumstance. The way to maximize bandwidth with LACP is to maximize the number of flows, and particularly on dumber switches, maximize the number of MAC addresses involved, i.e. use Multipath if you can. You'll never get more than 1Gb/sec between any two conversation endpoints, but with e.g. iSCSI multipath, you can have multiple conversation endpoints involved simultaneously.
In a distributed-storage environment, the way to maximize bandwidth - generally! - is to maximize the number of nodes and minimize the amount of storage on each of those nodes. Then also maximize the number of clients talking to each of thoes nodes. A small number of large storage arrays with a small number of clients will run into network limitations unless the vendor does tricks in the protocol (e.g. Equallogic redirects iSCSI sessions to alternate IP addresses, one bound to each NIC; VMware sets up multiple iSCSI initiators, one per NIC; Linux has balance-rr... maybe!).
Note that VMware does something similar to balance-rr, but it adds LACP-like session pinning mostly so that the switches don't suffer from CPU exhaustion. Many higher-end (Cisco, Juniper, HP/3COM, etc.) switches will actually disable the ports involved in a balance-rr group by default, treating them as a deliberate spoofing attack!