Ceph large file (images) resync

athompso · Dec 29, 2013

symmcom said:
I ran my CEPH cluster with both LACP and balance-rr bonding setup. I am finding out although balance-rr with dummy switches has some big advantages, the issues of packet loss is out weighing the performance. Thanks to jinjer i got to know that balance-rr allows have fault tolerant switches, not just NIC themselves. Performance wise i am not noticing any positive difference. After much thinking and debate, i have decided to go back to LACP method of bonding. Even though i will not get higher bandwidth than 1 gbps, it seems like the least problematic way to go for a stable platform.

Sorry for the late addition to this discussion, but from the networking side there's one huge problem with the multiple-indepdent-switches theory of balance-rr. All the switches must be interconnected, since balance-rr assumes a common FIB at the switching layer.
In theory, if every single device on the subnet is plugged into the same set of switches, this could work. Otherwise, the odds of a packet getting delivered to the correct switch is exactly 1/n.
So, if you have A CEPH nodes and B PVE nodes, and they are all have C NICs dedicated to storage, you could interconnect them with C switches using (A+B) ports on each switch. This can work because any packet can be received on any port inside a bond group. You'll see slightly erratic traffic patterns, but given enough traffic it should be approximately equal on all C links from any given server and approximately equal on all C switches.

Where this breaks down completely is if you have any devices at all in that IP subnet that are NOT connected to all C switches. Let's say CEPH node #1 only has two NICs dedicated to storage, but everything else has three NICs dedicated to storage. One out of every three packets destined for CEPH#1 will be delivered to switch #3... and will not reach CEPH#1 because it's not connected to that switch.
If the storage subnet is protected by a firewall, the firewall now must be connected to all three switches using balance-rr also, or the same effect occurs.

Note that this means you do NOT get true redundancy using balance-rr; if f NICs or cables fail, you start dropping f/C packets.

While it is possible to mitigate this problem by interconnecting the switches, you then get two problems. Firstly, you now have a Spanning Tree topology with its attendant problems. (And if you disable STP on your switches, you should be taken out back and shot. The protocol still exists nowadays mainly to save your ass when you do something stupid at 2am.) Secondly, you now have FIB flapping and the switch CPUs become bottlenecks. FIB flapping is when the switch sees that MAC address A resides on port Y. Oh, wait, it just moved to port Z. Wait, it just moved to port Q. Oh, now it's back to port Y. No, it's back on port Q again! The switch's management plane (i.e. the CPU) has to be involved in coordinating the forwarding table: what port should a packet destined for MAC address A be sent out?
This is also why connecting multiple balance-rr links into a single switch often produces strange results. Most cheaper switches have very low-end CPUs (i.e. 400MHz), because normally the CPU doesn't have to do much work. Using balance-rr is a good way to force it to do a lot of work. (Yes, switch ASICs handle most of the workload, but not 100% of it.)

To recap: balance-rr can be useful for certain specific scenarios ONLY IF the switch(es) and network fabric can handle it correctly. It does not give you good redundancy (except against complete switch failure in the isolated-switch scenario), it's not standardized in any way, and no-one other than Linux implements it this way. LACP definitely has limitations, but it handles more situations correctly and is reliable in virtually every circumstance. The way to maximize bandwidth with LACP is to maximize the number of flows, and particularly on dumber switches, maximize the number of MAC addresses involved, i.e. use Multipath if you can. You'll never get more than 1Gb/sec between any two conversation endpoints, but with e.g. iSCSI multipath, you can have multiple conversation endpoints involved simultaneously.

In a distributed-storage environment, the way to maximize bandwidth - generally! - is to maximize the number of nodes and minimize the amount of storage on each of those nodes. Then also maximize the number of clients talking to each of thoes nodes. A small number of large storage arrays with a small number of clients will run into network limitations unless the vendor does tricks in the protocol (e.g. Equallogic redirects iSCSI sessions to alternate IP addresses, one bound to each NIC; VMware sets up multiple iSCSI initiators, one per NIC; Linux has balance-rr... maybe!).

Note that VMware does something similar to balance-rr, but it adds LACP-like session pinning mostly so that the switches don't suffer from CPU exhaustion. Many higher-end (Cisco, Juniper, HP/3COM, etc.) switches will actually disable the ports involved in a balance-rr group by default, treating them as a deliberate spoofing attack!

-Adam Thompson

jinjer · Feb 15, 2014

athompso said:
Sorry for the late addition to this discussion, but from the networking side there's one huge problem with the multiple-indepdent-switches theory of balance-rr. All the switches must be interconnected, since balance-rr assumes a common FIB at the switching layer.
In theory, if every single device on the subnet is plugged into the same set of switches, this could work. Otherwise, the odds of a packet getting delivered to the correct switch is exactly 1/n.

I see where you're going, so I skip the rest of the read. You're correct about iscsi multipath being preferrable to balance-rr, but until we have ceph-multipath or gluster-multipath that is not an option for a scale-out cluster.

OTOH I don't remember the last time I saw a switch fail or a nic fail on a server. This was a problem in the days of 10BaseT when switches were built in a much cheaper way and it was possible to physically short a port and burn it. Once a cable is plugged in a server or switch and the server is hosted in a rack there is zero possibility of a network switch/cable/network card failure.

If such a failure happens, you can just shut down the server and fix it, or use a different port on the switch in no time.

As a bonus, you can actually enjoy a much higher speed in the 99.9999 % of the time when things work normally with no failures.

jinjer · Feb 15, 2014

symmcom said:
Just wanted to confirm that increasing the number of OSDs does increase the performance of CEPH Cluster. I went from 6 OSDs to 10 OSDs and the increase of performance was immediate without changing anything else.

This is nice to know. Quite possibly the osd are talking to each other using different sockets which helps the switch to balance the traffic via normal means (802.3ad).

I'm going with gluster, where this is less of a possibility, as each brick is a whole server and talks to other over the same socket.

tarax · Feb 15, 2014

@jinger about switch failure... had 2 switch death only in the last 6 month (and at the same client !). Sh@#€%& happens man, deal with it or you'll be bitten hard.

About improving network path load balancing and redundancy... couldn't Multipath TCP offer a good alternative to Ethernet LAGs ?
I'm no network wizard so any Light shed on this would be more than appreciated

Bests

athompso · Feb 15, 2014

jinjer said:
I see where you're going, so I skip the rest of the read. You're correct about iscsi multipath being preferrable to balance-rr, but until we have ceph-multipath or gluster-multipath that is not an option for a scale-out cluster.

A scaled-out CEPH cluster inherently uses many, many paths, but not "multipath" as in having redundant paths

jinjer said:
OTOH I don't remember the last time I saw a switch fail or a nic fail on a server. This was a problem in the days of 10BaseT when switches were built in a much cheaper way and it was possible to physically short a port and burn it. Once a cable is plugged in a server or switch and the server is hosted in a rack there is zero possibility of a network switch/cable/network card failure.

You've been very lucky, then.
I see - across all the switches and servers I'm at least partly responsible for - roughly a 1% network-component failure rate per year per installation. The quality of electrical power seems to make a big difference, and so far the worst offenders overall are Cisco 29xx-series and 35xx-series switches, with a roughly 5% port failure rate per year after the first year. Broadly speaking: Linksys switches tend to die or develop bizarre problems that affect every port; Netgear switches forget their configuration or their power supplies fail; Dlink switches need to be power-cycled. Cat5/Cat6 ports fail at least 10x more often than SFP/SFP+ ports.
I have seen server NICs die, but usually only because of electrical issues.
I have seen cables degrade to the point where the cable causes roughly ~0.1% packet loss; this is EXTREMELY difficult to troubleshoot, as it is often dependent on temperature, humidity, EM radiation, etc.! Hand-terminated cables fail about 25x more often than factory-terminated cables. In turn, non-molded boots fail about 100x more often than molded boots. (For those readers who don't know what I'm talking about, look at http://www.belkin.com/us/A3L980-S-Belkin/p/P-A3L980-S/, and notice how the grey plastic of the strain-relief boot extends all the way down the inside of the RJ45 plug. The only times I've seen one of those fail was due to extreme mechanical compression on the cable itself, i.e. rolling an equipment rack right over the cable, never at the ends.)
These are all fairly low probabilities, but a 1% annual failure rate per component means *something* will probably fail once a year if you have >100 ports/NICs/cables/switches/servers.

jinjer said:
If such a failure happens, you can just shut down the server and fix it, or use a different port on the switch in no time.
As a bonus, you can actually enjoy a much higher speed in the 99.9999 % of the time when things work normally with no failures.

As long as MTTR doesn't matter to you, and you have taken all my *other* points into consideration, particularly the default gateway also needing to be plugged into every switch, which I find unlikely, then yes, you're right, balance-rr will provide better throughput. What I take issue with is the 99.9999%, and the "in no time" comment. If the switches all sit right beside your bed, and you never leave your bedroom, then you can say you'll probably work around any switchport failure within ~5 minutes. Assuming you have an alarm set up, that you can wake up or jump out of the shower or bath, that you are never sitting on the toilet, etc... and that you have a monitoring system that somehow tells you exactly what failed...
Realistically, failures always seem to occur at 2am, and always at the data center you can't get into until 8am, that's halfway across the city/state/province/country.
I hear you trying to say Murphy's Law doesn't apply to you, which I find hard to believe.

Failure analysis says that with the number of non-redundant moving parts in your design, you are likely to - statistically, not necessarily in every case - experience a much lower overall availability. If each switch or switchport or NIC or cable has a 1% chance of failure, then a system with 3 nodes, each with 3 NICs, each plugged into 3 switches, means you have a 33% chance of failure per year. Any one of those 33 components failing will leave you with a system where some packets will not reach their destination. A 4-way symmetric design with 1% AFR per component has a 56% probability of some failure every year.

Mathematically:
Overall failure probability = (#nodes * failure(nodes) ) + (#nodes * #switches * failure(NICs) ) + (#nodes * #switches * failure(cables) ) + (#nodes * #switches * failure(switchport) ) + (#switches * failure(switches) ).
This assumes all failure probabilities are normalized to the same time denominator, typically annual. This is derivable from the MTBF values on most manufacturers' datasheets.

I suspect you're committing the classical mistake in failure engineering - just because something has not happened to you yet, does not mean it will continue to not happen to you. Conversely, just because something has happened to you, does not necessarily mean it will continue to happen. So if you have situations where you don't need particularly high availability, or at least you don't need to be able to *guarantee* it, and you've found a combination of equipment with negligible failure rates, and you will only ever use multi-port Linux boxes in your networks (no printers, no embedded devices, no laptops, no desktops) then go ahead and use balance-rr with each link plugged into a different switch.

My point, ultimately, is that you cannot (or at least should not) use balance-rr in *most* situations, not all. You may have found the ideal situation for using it, but please don't recommend it as a general-purpose solution to replace LACP and good network design.

-Adam

Ceph large file (images) resync

athompso

Renowned Member

jinjer

Renowned Member

jinjer

Renowned Member

tarax

Member

athompso

Renowned Member

We value your privacy