Ceph large file (images) resync

jinjer

Active Member
Oct 4, 2010
203
6
38
Am i facing issue because i am trying to separate balance-rr bonding with 3 separate switches? Should i put all links back to single switch for balance-rr and setup LACP ?
I see no problem in your setup. It's actually an old, little, well-kept secret :)
Separating the links over different and independent switches is precisely what you want to avoid the single point of failure and also break the 1G barrier. Unless you want to spend big bucks on stackable and proprietary solution. Don't forget that the stack is also a single point of failure. It's a bus where all switches connect and a single switch can kill it because of software bugs or hardware failure.

Regarding the need to restart connections, balance-rr uses the same MAC on each enslaved interface. There is no arp magic going on.

Even if there was, I'm sure ceph is smart enough to restart a broken connection and recover from a temporary (and occasional) hiccup in the network.

That comments on wikipedia are probabily more related to interrupted browser downloads which we don't care for in our case.
 
Last edited:

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
Don't forget replication. By default you have 2x replication groups so it gets broken up and evenly distributed, then based on the default crush map each osd will replicate its data to an osd in a different chassis.
You are very right on this one. With 3 node setup 2x replica is sufficient. Even though there are replication happening, i think it is till faster than full node replication.

Also on the switches it was also my understanding that you needed to be all on the same switch, but I was thinking maybe jinjer knew something I didn't.
Thats what i was thinking. I never setup multi switch link aggregation and almost article i found about balance-rr says you need to put all links on same switch for it. But i followed jinjer's way of separate switches for balance-rr and it seems to be working except some kinks here and there.

In order to LACP across switches I had to buy special switches that join together into a logical chassis and allow me to create a virtual LAG across two separate switches.
Ya thats stackable switches you are talking about. They are expansive to build a network on and as you said they still act as one.

I will keep pushing the boundary of balance-rr over separate switches and see where i hit the wall with it.
Thanks to jinjer i got to know a brilliant idea of physical switch redundancy. :)
 

jinjer

Active Member
Oct 4, 2010
203
6
38
:)

To bring back this thread in topic, I'd like to know what happens when a ceph node resets and comes back online. Do the cluster maintain some sort of "map" where modified blocks for each node(file) are kept or does it start a full-scale resync of the KVM machine images? I am told glusterfs will to a full-scale (rsync-link) resync of each file. Is ceph "smarter"?

Also, what sort of aggregate iops you can expect from any proxmox node towards the ceph cluster (in your 3 node configuration) ? I am currently looking at replacing my current nfs exported zfs storage with something more scalable. I need to make decisions soon :)
 
Last edited:

ymmot04

Active Member
May 26, 2011
34
0
26
Chicago, United States
tomfoster.us
Assuming you have multiple mons, or your 1 mon doesn't go down, as soon as the node goes down the OSDs will be marked "down" and that IO will be redirected. By default, there is a 5 minute timeout before the OSDs are marked out and the data starts to migrate. If shortly after the 5 minute mark the node comes back online, the replication stops and you will get Health_OK fairly quickly since it only needs to update the missed replicas for that 5 minute period. If you know you are going to reboot, you can also mark the cluster with the "noout" flag which will prevent it from marking the disks out. Just don't forget to turn it back off.

The results on my system were interesting. The results below I realized were not accurate as I was able to produce similar numbers from another VM simultaneously on the same Proxmox host. I've got 36 OSDs, each on a 7.2K spinner with the journal on the same disk. I didn't have time to completely finish testing but my initial results testing from a VM were:

10/25/20131 Node32 Threads
Read
Block Size (Bytes)Throughput (MBps)IOPS Per diskTotal IOPS
409614.243101.28355563646.208
131072188.08541.796666671504.68
4194304434.3863.016569444108.5965
Write
Block Size (Bytes)Throughput (MBps)IOPS
40969.0264.142222222309.12
131072286.49463.665333332291.952
4194304429.532.982847222107.3825

After I realized this, I did another quick test directly from the proxmox os and got:

Read
Block Size (Bytes)Throughput (MBps)IOPS Per diskTotal IOPS
409600
13107200
41943041165.1828.091541667291.2955


Didn't have time to run it through the full barrage of tests from both read and write, but I realized for some reason at the VM level my access to the cluster was being limited even though the host machine was able to hit achieve much higher results.
 

jinjer

Active Member
Oct 4, 2010
203
6
38
With 4MB blocks you're probably being limited by the network speed (10G ethernet?).
The kvm seem to incur a 2x penalty, but perhaps you're running in a default iops limit for kvm (which is good to prevent a single kvm bringing down the cluster).
How many servers are you using to host that 36 disks/osd?

I would concentrate more on the 64k - 128k operations are those are the ones causing major problems (in our shop). This means temporary tables for mysql databases and also running an average mail spool.

webserver loads tend to be more read-only, with a lower average block size.
 

ymmot04

Active Member
May 26, 2011
34
0
26
Chicago, United States
tomfoster.us
Yes, 10G ethernet. I've got 3 nodes, 12 disks per node. Each node also has 2 SSDs being used for OS/mon.

You might be right that my 10G interface on Proxmox could be a bottleneck. After that point, the IO Aggregator (connects all the blades together) has a 40G trunk to the primary switch, which then has 10G connections to each Ceph node giving the Ceph cluster a 30G combined bandwidth.

I just did another set of tests directly on one Proxmox host, these are to a 2x replication pool as well so they should be real world as the replication from OSDs will be happening. Results below:

Read
Block Size (Bytes)Throughput (MBps)IOPS Per diskTotal IOPS
409648.027341.525333312294.912
131072535.739119.05311114285.912
4194304885.0996.146520833221.27475
Write
Block Size (Bytes)Throughput (MBps)IOPS
409622.034156.68622225640.704
131072328.49572.998888892627.96
4194304616.884.283888889154.22

To test the theory of my 10G link on Proxmox being saturated I will try again running the test from two nodes at once and combine the results.
 

ymmot04

Active Member
May 26, 2011
34
0
26
Chicago, United States
tomfoster.us
Running from two nodes at once netted a 10-20% increase in most tests. Part of this could be due to me starting them about a second apart. I assume reduced load on the NICs/cables and thus less packet loss probably helps as well.
 

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
@ymmot04, We gotta share notes for CEPH+Proxmox!! :)

My main goal currently has been to get the max I/O performance with the hardware i have. Althought the specs are not the best in the world, i feel like i am not getting the performance i should be getting. Benchmarks from Proxmox VMs shows low i/o speed. What tests did you run to get the benchmarks above?
 

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
Don't forget about me re: sharing notes :)

How can i forget. :) Your tips of balance-rr and multi switch is pricelss. Today i am going to do final reconfiguration of the setup with some upgraded Intel 1 Gigabit NIC, recabling and reorganizing the both proxmox and ceph cluster. I gotta do something about this low I/O for CEPH. Either i get some form of confirmation that this is what the max speed i am going to get and live with it or increase bandwidth/performance. Starting from network layout rework seems to be logical way to go. I will keep everybody posted.

What benchmarking tools do you guys run to get I/O, iops,bandwidth both in node and across node?
 

mir

Famous Member
Apr 14, 2012
3,559
122
83
Copenhagen, Denmark
@ymmot04, We gotta share notes for CEPH+Proxmox!! :)
May I suggest to create some new forums? I would suggest:
1) PVE and storage. eg ceph, glusterfs, sheepdog, and zfs
2) PVE and network. Eg. network architecture, switch, nics etc.
3) PVE nodes and hardware. Eg. motherboards, disks, RAM, and CPU. (kind of like my own PVE build)
 

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
May I suggest to create some new forums? I would suggest:
1) PVE and storage. eg ceph, glusterfs, sheepdog, and zfs
2) PVE and network. Eg. network architecture, switch, nics etc.
3) PVE nodes and hardware. Eg. motherboards, disks, RAM, and CPU. (kind of like my own PVE build)

I vote Yes!
 

ymmot04

Active Member
May 26, 2011
34
0
26
Chicago, United States
tomfoster.us
Sounds like a great idea to me. That way we can see how others are configuring their systems and what works well.

As far as benchmarking for Ceph, you want to use the RADOS testing tool. This is included in ceph-common and thus already installed on Proxmox.

I recommend first creating a test pool on ceph with appropriate PG size. (See the Customize Ceph section I wrote for appropriate PG sizing)

sudo ceph osd pool create 2xtest 4096

**See attached excel spreadsheet for the block sizes I have been testing with. Make sure to update the disks in the top left to reflect how many disks are in your cluster. After the test enter your average throughput in the "Throughput MBps" column and it will generate your IOPS.

**You must do a write of a particular block size before you can do a read (seq), also if you wait too long you may need to do another write before you can do a read.

This is the format for the testing command:

rados -p test bench -b *blockSize* *secondsToRun* *seq/write* -t *numberOfThreads* -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

"blockSize" is in bytes and you can use the sizes listed in my spreadsheet for a good spread between bandwidth and IO. "secondsToRun" I usually set to 30-60 seconds. "seq" is read and "write" is... write. The number of threads should match the number of cores on the box you are testing from (that gives best results from what I have seen). You must also copy your ceph.conf and keyring from ceph into the working directory of the server you are testing from. I test directly from a proxmox host because my VMs seem to be limited in throughput as I explained earlier.

Example tests:

rados -p test bench -b 4194304 60 write -t 32 -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

rados -p test bench -b 4194304 60 seq -t 32 -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

It might also be a good idea to purge your cache before doing the read for best accuracy, although it didn't make much of a difference in my case.

echo 3 > /proc/sys/vm/drop_caches
 

Attachments

  • Ceph Iops Calculator Template.zip
    10 KB · Views: 9

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
Great info ymmot04!!

I use RADOS bench for testing iops. Did not use some of the --arguments you have used. I have a separate pool just for read/write purpose. But my read test never worked. I can do write just fine. But everytime i tried to read through seq option it said i have to write something before reading. Also the echo to purge cache never worked. No doubt something not right on my end. After i am done reorganizing the network topology a bit for balance-rr i am going to try the tests again. Thanks for the calculator template!!
 

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
Did you use the --no-cleanup option when you did a write? If not, the data would be removed and you would have nothing to read.

LOL, that explains it. No i did not use --no-cleanup. So right after i was done write it deleted the file and read test said nothing to read. I knew i was missing something. :) Thanks again !
 

symmcom

Renowned Member
Oct 28, 2012
1,087
38
68
Calgary, Canada
www.symmcom.com
@jinjer,
balance-rr.png
Here is a pic of how i am trying to setup balance-rr for Proxmox+CEPH clusters. The top part is with both switches separately connected with nothing connecting them with each other. In this setup Proxmox pvestatd randomly loses connection and GUI shows node is offline along with all VMs although nothing is really offline. After few secs pvestatd comes back. While GUI shows offline, i can access to all VM, can SSH without any issue.

In 2nd setup if i connected both switches with each other, everything works just fine without any issue.

In both setup there does not seems to be any reduction in iperf benchmark performance, except in first setup in every few consecutive tests it would show i/o drop significantly. Any idea?
 

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
I'm wondering about the speed of ceph for resyncing of large files (i.e. kvm images) in the event of a failure of one of the nodes.

Say a node in a ceph cluster is reboot, taking away images for a few minutes.

In glusterfs it will take a read-back of the whole vm storage to resync images.

Is ceph the same?

This is where drbd shines, however it's only a 2 node solution.

Hi,
I'm just expand our ceph-cluster (3 nodes) to 38 hdds (from 24). The resync is done with more than 400 MB/s:
Code:
ceph health
HEALTH_WARN 9 pgs backfill; 22 pgs backfilling; 31 pgs stuck unclean; recovery 152685/17523530 degraded (0.871%);  recovering 124 o/s, 496MB/s
The OSDs use a separate 10GB-Network.

Udo
 

jinjer

Active Member
Oct 4, 2010
203
6
38
@jinjer,
View attachment 1782
Here is a pic of how i am trying to setup balance-rr for Proxmox+CEPH clusters. The top part is with both switches separately connected with nothing connecting them with each other. In this setup Proxmox pvestatd randomly loses connection and GUI shows node is offline along with all VMs although nothing is really offline. After few secs pvestatd comes back. While GUI shows offline, i can access to all VM, can SSH without any issue.

In 2nd setup if i connected both switches with each other, everything works just fine without any issue.

In both setup there does not seems to be any reduction in iperf benchmark performance, except in first setup in every few consecutive tests it would show i/o drop significantly. Any idea?
With interconnected switches, you're exercising the stp algo on the switches as they see the same mac address both on one of their ports and on the port of the other switch.

I would try a few tests between two nodes without a switch in the middle, to make sure that the whole balance-rr thing is working properly (and the drivers for the cards are ok).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!