Ceph large file (images) resync

jinjer

Renowned Member
Oct 4, 2010
204
7
83
I'm wondering about the speed of ceph for resyncing of large files (i.e. kvm images) in the event of a failure of one of the nodes.

Say a node in a ceph cluster is reboot, taking away images for a few minutes.

In glusterfs it will take a read-back of the whole vm storage to resync images.

Is ceph the same?

This is where drbd shines, however it's only a 2 node solution.
 
I'm wondering about the speed of ceph for resyncing of large files (i.e. kvm images) in the event of a failure of one of the nodes.

Say a node in a ceph cluster is reboot, taking away images for a few minutes.

In glusterfs it will take a read-back of the whole vm storage to resync images.

Is ceph the same?

This is where drbd shines, however it's only a 2 node solution.

CEPH syncing indeed fast. In my current test setup i have about 50 testers from Proxmox community testing out remote vm access. Several times i have manually powered down one of CEPH node to simulate real world server disaster. They would tell you they still had all VM working without sweating much. All together in the entre cluster i have little over 100 vms running. None goes down or slows down significantly that nobody can work.

Sent from my ASUS Transformer Pad TF700T using Tapatalk
 
CEPH syncing indeed fast. In my current test setup i have about 50 testers from Proxmox community testing out remote vm access. Several times i have manually powered down one of CEPH node to simulate real world server disaster. They would tell you they still had all VM working without sweating much. All together in the entre cluster i have little over 100 vms running. None goes down or slows down significantly that nobody can work.

Sent from my ASUS Transformer Pad TF700T using Tapatalk
How many nodes of ceph you have in your cluster? What type of hardware you're running on them?

thank you.
 
How many nodes of ceph you have in your cluster? What type of hardware you're running on them?

thank you.

I have 3 nodes. All of them have somewhat identical specs. They are not the mighty powerful specs, but i wanted to try CEPH with less powerful machine and max out all possible tweaks. So when i do move to better machine i know for sure everything will work with some added performance.

Intel Xeon Motherboard
Intel i3-2120 CPU
8GB RAM
Intel RS Series RAID Controller
Intel 24 Port RAID Expander
12 bay hot swap chasis
4 SATA HDD in each
500 Watt CPU
4 Intel NICs ( 1 for management, 3 bonded with 802.3ad)
Headless

Sent from my ASUS Transformer Pad TF700T using Tapatalk
 
I have 3 nodes. All of them have somewhat identical specs. They are not the mighty powerful specs, but i wanted to try CEPH with less powerful machine and max out all possible tweaks. So when i do move to better machine i know for sure everything will work with some added performance.

Intel Xeon Motherboard
Intel i3-2120 CPU
8GB RAM
Intel RS Series RAID Controller
Intel 24 Port RAID Expander
12 bay hot swap chasis
4 SATA HDD in each
500 Watt CPU
4 Intel NICs ( 1 for management, 3 bonded with 802.3ad)
Headless
Thank you. Have you considered bonding with balance-rr using multiple switches and no 802.3ad?
Also, did you do some benchmark from inside a proxmox kvm running on the above?
 
I have never setup balance-rr, only 802.3ad. The main goal was to increase total bandwidth of storage cluster. I did not find any info pointing out that any other bonding increased bandwidth. All of them seemed to be for either NIC redundancy or load balance. Is it correct?

Countless benchmark has been running on this cluster regularly. Most of them are contributed by proxmox community connected to our cloud testing platform. To be honest, performance is not amazing. Windows XP is worse. Due to licensing, i only assign Windows XP VMs for testing. But i personally run Windows 7 VM and storage I/O performance is much better. Beta testers are allowed to install their own VM though. I only provide bare ISO for windows, linux installation disk. For benchmark info check the Sticky Thread, Volunteer Wanted.

Sent from my ASUS Transformer Pad TF700T using Tapatalk
 
with 802.3ad each connection is only allowed to travel on one link. This is ok as long as you have multiple concurrent connections to different clients. Eventually they even out. Sometimes the hashing algo will not allocate evently between the interfaces.

balance-rr is different: it is load balance (i.e. load is spread accross multiple links) and failover (any link can fail). It transmits each packet of the same connection on a new link in a round-robin fashion. This is the only mode that allows you to break the speed limit of the single link (i.e. I get 1.8 gig on a two 1G bonded interfaces on a single iperf test)

It also makes for a nice redundant switch setup. It works across multiple switches with no switch configuration. You basically plug one nic in one switch and the other nic in the other switch. Either switch can die and you still have connectivity.

I will check the "volunteers needed thread".
 
balance-rr is different: it is load balance (i.e. load is spread accross multiple links) and failover (any link can fail). It transmits each packet of the same connection on a new link in a round-robin fashion. This is the only mode that allows you to break the speed limit of the single link (i.e. I get 1.8 gig on a two 1G bonded interfaces on a single iperf test)
You mean to say famous "iperf" will actually show over 1G link???? I have been trying to cross that barrier and spent many hours searching only to finally settle down with explanation in 802.3ad mode you will never more than 1GB per link even though it is bonded. It is when many client joins at the same time is when 802.3ad shines.

Right now in my setup storage nodes and proxmox nodes are all individually bonded with 3 NICs and all connected to a 48 port switch with LACP configured. Was i correct to assume that few dozens of VM client will be considered as separate client to the bond?


Sent from my ASUS Transformer Pad TF700T using Tapatalk
 
LACP requires that each connection goes over a single link. This way packet order is ensured and it mimics a normal ethernet link.
In LACP, the "sender" decides which link to send packets based on a hash function taking variables from L2, L3 or L4 or a combination of them (i.e. MAC address, IP address, source and dest port). Then the hash is squashed over the available links with a simple modulus (i.e. Sending link = HASH % NumberOfActiveLinks)

If you only use L2 (the mac address) all connection toward the same mac address (physical machine) will use the same interface. If you add L3 and L4, chances are that different connections running on the same client will be served trough a different link (this is true when the number of connections is large. for fewer connections it's not always the case).

With balance-rr, every packet sent from the bond outgoing queue uses "the next" Ethernet link, regardless of the connection that this packet belongs to. Since you're sending packets "almost in parallel" they travel different routes and different switches, you can break the single link limit. You're also causing the packet reordering of the tcp stack to work harder as packets can be received out of order.

In this setup, it is better to keep the switches separate from each other. This is to prevent possible loops (I know about stp, but that is a source of other issues is buggy on most budged ones). In your case you need three switches, one for each part of the bond.

Edit: I would like to add that in case you're using a single switch to connect a bond, once the packets enter the switch, they loose their parallelism. That is, the switch decides how to send them to the destination server. In case that happens to be another "bond" chances are the switch will use some type of xor/hash policy that will limit the transmission to occur across a single link. This is another good reason to use multiple switches and a single Ethernet link per switch: This way the switch has no decision to make but can only forward the packet via the one channel it has control of.

Slightly off-topic: What is the tool you're using to visualize the bond usage? I've never seen it before
Edit: found it... it's nload :)
 

Attachments

  • t3.jpg
    t3.jpg
    39 KB · Views: 44
Last edited:
Look like balance-rr setup involves little more than just changing mode in conf file.

I have setup 3 separate switches and connected all nodes separately to each switches.
Changed mode in interface conf to use balance-rr.
Rebooted all nodes
Ran iperf from one node to another. Stat shows i have speed of 512kbps!!
Rechecked all connection. Double checked that 3 switches are not taking to each other at all.

Any idea?


Yes it is indeed nload which shows that i/o data for each CEPH nodes. Beside this i also use htop, netperf, iostat
 
Look like balance-rr setup involves little more than just changing mode in conf file.

I have setup 3 separate switches and connected all nodes separately to each switches.
Changed mode in interface conf to use balance-rr.
Rebooted all nodes
Ran iperf from one node to another. Stat shows i have speed of 512kbps!!
Rechecked all connection. Double checked that 3 switches are not taking to each other at all.

Any idea?


Yes it is indeed nload which shows that i/o data for each CEPH nodes. Beside this i also use htop, netperf, iostat

Out of my head, the things to try would be:

1. Use even number of links (start with 2 links)
2. Remove the switches and try connecting two nodes with cross cables (test if switches are a problem).
3. Do you have vlans or bridges stacked on the bond interface? Try running on flat bond with no other layers.
4. Do you see retransmissions or arp floods ?
5. Try using arp instead of mii for monitoring.

If all the above fails, you really need to resort to tcpdump and timestamps to "see" what actually happens on the lines. It might be the tcp window or buffers in the network stack which are having trouble reassembling the packets and causing retransmissions.

Yes, it can be tricky to setup at first, but will make you run on cheap switches and faster than using a proprietary stackable solution.
 
Last edited:
Looks like i have hijacked your thread and transformed it into Link Aggregation tutorial thread. :)

I went ahead and bought 3 Netgear Smart switches and started from scratch. After resetting up everything now i have new issues. If i put all nodes/nics on same switch, i get 2.65Gbps connection with iperf between nodes. If i separate them with 3 switches, i get 976mbps which is pretty much max what 1 gigabit nic can push out.
I do not have any vlan on this bonding. Have not tried tcpdump. This would be next test.
 
Here are some numbers of balance-rr aggregation with 3 separate switches:
Benchmark with iperf:
rr-1-iperf.png

Benchmark with netperf:
rr-1-netperf.png

Benchmark with nload while running iperf:
rr-1-nload.png
 
I am really glad you made it at the end with the three separate switches (failover and speed together).

I am not sure how much parallelism there is in ceph during the synchronization, but if you're copying from few nodes at a time to recover a failed node, it's surely a nice thing to be able to break the 1G barrier.

What was the issue that was limiting you to 512K first and 980M after?
 
One of the NIC in the node was not acting right. I took it out and put a new one in. That seemed to resolve 512k issue. The next phase of the test today is to simulate multiple switch failure and see if anything goes down :)
 
While trying to dig deeper into Link Bonding using balance-rr, i came across following Wiki(http://en.wikipedia.org/wiki/Link_aggregation) which says the following :
"Single switch
With modes balance-rr, balance-xor, broadcast and 802.3ad all physical ports in the link aggregation group must reside on the same logical switch, which in most scenarios will leave a single point of failure when the physical switch to which both links are connected goes offline. Modes active-backup, balance-tlb, and balance-alb can also be set up with two or more switches. But after failover (like all other modes), in some cases, active sessions may fail (due to ARP problems) and have to be restarted."

Am i facing issue because i am trying to separate balance-rr bonding with 3 separate switches? Should i put all links back to single switch for balance-rr and setup LACP ?
 
I am not sure how much parallelism there is in ceph during the synchronization, but if you're copying from few nodes at a time to recover a failed node, it's surely a nice thing to be able to break the 1G barrier.

CEPH is a distributed file system. Any data to be written are broken down into chunks and sent to all CEPH nodes. So there are some parallel writings but not as intensive as GlusterFS/DRBD mirror nodes are.
 
Don't forget replication. By default you have 2x replication groups so it gets broken up and evenly distributed, then based on the default crush map each osd will replicate its data to an osd in a different chassis.

Also on the switches it was also my understanding that you needed to be all on the same switch, but I was thinking maybe jinjer knew something I didn't. In order to LACP across switches I had to buy special switches that join together into a logical chassis and allow me to create a virtual LAG across two separate switches.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!