Ceph large file (images) resync

jinjer · Oct 27, 2013

symmcom said:
Am i facing issue because i am trying to separate balance-rr bonding with 3 separate switches? Should i put all links back to single switch for balance-rr and setup LACP ?

I see no problem in your setup. It's actually an old, little, well-kept secret

Separating the links over different and independent switches is precisely what you want to avoid the single point of failure and also break the 1G barrier. Unless you want to spend big bucks on stackable and proprietary solution. Don't forget that the stack is also a single point of failure. It's a bus where all switches connect and a single switch can kill it because of software bugs or hardware failure.

Regarding the need to restart connections, balance-rr uses the same MAC on each enslaved interface. There is no arp magic going on.

Even if there was, I'm sure ceph is smart enough to restart a broken connection and recover from a temporary (and occasional) hiccup in the network.

That comments on wikipedia are probabily more related to interrupted browser downloads which we don't care for in our case.

wahmed · Oct 28, 2013

ymmot04 said:
Don't forget replication. By default you have 2x replication groups so it gets broken up and evenly distributed, then based on the default crush map each osd will replicate its data to an osd in a different chassis.

You are very right on this one. With 3 node setup 2x replica is sufficient. Even though there are replication happening, i think it is till faster than full node replication.

ymmot04 said:
Also on the switches it was also my understanding that you needed to be all on the same switch, but I was thinking maybe jinjer knew something I didn't.

Thats what i was thinking. I never setup multi switch link aggregation and almost article i found about balance-rr says you need to put all links on same switch for it. But i followed jinjer's way of separate switches for balance-rr and it seems to be working except some kinks here and there.

ymmot04 said:
In order to LACP across switches I had to buy special switches that join together into a logical chassis and allow me to create a virtual LAG across two separate switches.

Ya thats stackable switches you are talking about. They are expansive to build a network on and as you said they still act as one.

I will keep pushing the boundary of balance-rr over separate switches and see where i hit the wall with it.
Thanks to jinjer i got to know a brilliant idea of physical switch redundancy.

jinjer · Oct 29, 2013

To bring back this thread in topic, I'd like to know what happens when a ceph node resets and comes back online. Do the cluster maintain some sort of "map" where modified blocks for each node(file) are kept or does it start a full-scale resync of the KVM machine images? I am told glusterfs will to a full-scale (rsync-link) resync of each file. Is ceph "smarter"?

Also, what sort of aggregate iops you can expect from any proxmox node towards the ceph cluster (in your 3 node configuration) ? I am currently looking at replacing my current nfs exported zfs storage with something more scalable. I need to make decisions soon

ymmot04 · Oct 29, 2013

Assuming you have multiple mons, or your 1 mon doesn't go down, as soon as the node goes down the OSDs will be marked "down" and that IO will be redirected. By default, there is a 5 minute timeout before the OSDs are marked out and the data starts to migrate. If shortly after the 5 minute mark the node comes back online, the replication stops and you will get Health_OK fairly quickly since it only needs to update the missed replicas for that 5 minute period. If you know you are going to reboot, you can also mark the cluster with the "noout" flag which will prevent it from marking the disks out. Just don't forget to turn it back off.

The results on my system were interesting. The results below I realized were not accurate as I was able to produce similar numbers from another VM simultaneously on the same Proxmox host. I've got 36 OSDs, each on a 7.2K spinner with the journal on the same disk. I didn't have time to completely finish testing but my initial results testing from a VM were:

[TABLE="width: 402"]
[TR]
[TD="align: right"]10/25/2013[/TD]
[TD]1 Node[/TD]
[TD]32 Threads[/TD]
[TD][/TD]
[/TR]
[TR]
[TD][/TD]
[TD]Read[/TD]
[TD][/TD]
[TD][/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS Per disk[/TD]
[TD]Total IOPS[/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]14.243[/TD]
[TD="align: right"]101.2835556[/TD]
[TD="align: right"]3646.208[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]188.085[/TD]
[TD="align: right"]41.79666667[/TD]
[TD="align: right"]1504.68[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]434.386[/TD]
[TD="align: right"]3.016569444[/TD]
[TD="align: right"]108.5965[/TD]
[/TR]
[TR]
[TD][/TD]
[TD]Write[/TD]
[TD][/TD]
[TD][/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS[/TD]
[TD][/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]9.02[/TD]
[TD="align: right"]64.14222222[/TD]
[TD="align: right"]2309.12[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]286.494[/TD]
[TD="align: right"]63.66533333[/TD]
[TD="align: right"]2291.952[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]429.53[/TD]
[TD="align: right"]2.982847222[/TD]
[TD="align: right"]107.3825[/TD]
[/TR]
[/TABLE]

After I realized this, I did another quick test directly from the proxmox os and got:

[TABLE="width: 428"]
[TR]
[TD="colspan: 4"]Read[/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS Per disk[/TD]
[TD]Total IOPS[/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD][/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]0[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD][/TD]
[TD="align: right"]0[/TD]
[TD="align: right"]0[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]1165.182[/TD]
[TD="align: right"]8.091541667[/TD]
[TD="align: right"]291.2955[/TD]
[/TR]
[/TABLE]

Didn't have time to run it through the full barrage of tests from both read and write, but I realized for some reason at the VM level my access to the cluster was being limited even though the host machine was able to hit achieve much higher results.

jinjer · Oct 29, 2013

With 4MB blocks you're probably being limited by the network speed (10G ethernet?).
The kvm seem to incur a 2x penalty, but perhaps you're running in a default iops limit for kvm (which is good to prevent a single kvm bringing down the cluster).
How many servers are you using to host that 36 disks/osd?

I would concentrate more on the 64k - 128k operations are those are the ones causing major problems (in our shop). This means temporary tables for mysql databases and also running an average mail spool.

webserver loads tend to be more read-only, with a lower average block size.

ymmot04 · Oct 29, 2013

Yes, 10G ethernet. I've got 3 nodes, 12 disks per node. Each node also has 2 SSDs being used for OS/mon.

You might be right that my 10G interface on Proxmox could be a bottleneck. After that point, the IO Aggregator (connects all the blades together) has a 40G trunk to the primary switch, which then has 10G connections to each Ceph node giving the Ceph cluster a 30G combined bandwidth.

I just did another set of tests directly on one Proxmox host, these are to a 2x replication pool as well so they should be real world as the replication from OSDs will be happening. Results below:

[TABLE="width: 428"]
[TR]
[TD="colspan: 4"]Read[/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS Per disk[/TD]
[TD]Total IOPS[/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]48.027[/TD]
[TD="align: right"]341.5253333[/TD]
[TD="align: right"]12294.912[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]535.739[/TD]
[TD="align: right"]119.0531111[/TD]
[TD="align: right"]4285.912[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]885.099[/TD]
[TD="align: right"]6.146520833[/TD]
[TD="align: right"]221.27475[/TD]
[/TR]
[TR]
[TD="colspan: 4"]Write[/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS[/TD]
[TD][/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]22.034[/TD]
[TD="align: right"]156.6862222[/TD]
[TD="align: right"]5640.704[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]328.495[/TD]
[TD="align: right"]72.99888889[/TD]
[TD="align: right"]2627.96[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]616.88[/TD]
[TD="align: right"]4.283888889[/TD]
[TD="align: right"]154.22[/TD]
[/TR]
[/TABLE]

To test the theory of my 10G link on Proxmox being saturated I will try again running the test from two nodes at once and combine the results.

ymmot04 · Oct 29, 2013

Running from two nodes at once netted a 10-20% increase in most tests. Part of this could be due to me starting them about a second apart. I assume reduced load on the NICs/cables and thus less packet loss probably helps as well.

wahmed · Oct 29, 2013

@ymmot04, We gotta share notes for CEPH+Proxmox!!

My main goal currently has been to get the max I/O performance with the hardware i have. Althought the specs are not the best in the world, i feel like i am not getting the performance i should be getting. Benchmarks from Proxmox VMs shows low i/o speed. What tests did you run to get the benchmarks above?

jinjer · Oct 29, 2013

Don't forget about me re: sharing notes

Dell M series... sweet.

wahmed · Oct 29, 2013

jinjer said:
Don't forget about me re: sharing notes

How can i forget.

Your tips of balance-rr and multi switch is pricelss. Today i am going to do final reconfiguration of the setup with some upgraded Intel 1 Gigabit NIC, recabling and reorganizing the both proxmox and ceph cluster. I gotta do something about this low I/O for CEPH. Either i get some form of confirmation that this is what the max speed i am going to get and live with it or increase bandwidth/performance. Starting from network layout rework seems to be logical way to go. I will keep everybody posted.

What benchmarking tools do you guys run to get I/O, iops,bandwidth both in node and across node?

mir · Oct 29, 2013

symmcom said:
@ymmot04, We gotta share notes for CEPH+Proxmox!!

May I suggest to create some new forums? I would suggest:
1) PVE and storage. eg ceph, glusterfs, sheepdog, and zfs
2) PVE and network. Eg. network architecture, switch, nics etc.
3) PVE nodes and hardware. Eg. motherboards, disks, RAM, and CPU. (kind of like my own PVE build)

wahmed · Oct 29, 2013

mir said:
May I suggest to create some new forums? I would suggest:
1) PVE and storage. eg ceph, glusterfs, sheepdog, and zfs
2) PVE and network. Eg. network architecture, switch, nics etc.
3) PVE nodes and hardware. Eg. motherboards, disks, RAM, and CPU. (kind of like my own PVE build)

I vote Yes!

ymmot04 · Oct 29, 2013

Sounds like a great idea to me. That way we can see how others are configuring their systems and what works well.

As far as benchmarking for Ceph, you want to use the RADOS testing tool. This is included in ceph-common and thus already installed on Proxmox.

I recommend first creating a test pool on ceph with appropriate PG size. (See the Customize Ceph section I wrote for appropriate PG sizing)

sudo ceph osd pool create 2xtest 4096

**See attached excel spreadsheet for the block sizes I have been testing with. Make sure to update the disks in the top left to reflect how many disks are in your cluster. After the test enter your average throughput in the "Throughput MBps" column and it will generate your IOPS.

**You must do a write of a particular block size before you can do a read (seq), also if you wait too long you may need to do another write before you can do a read.

This is the format for the testing command:

rados -p test bench -b *blockSize* *secondsToRun* *seq/write* -t *numberOfThreads* -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

"blockSize" is in bytes and you can use the sizes listed in my spreadsheet for a good spread between bandwidth and IO. "secondsToRun" I usually set to 30-60 seconds. "seq" is read and "write" is... write. The number of threads should match the number of cores on the box you are testing from (that gives best results from what I have seen). You must also copy your ceph.conf and keyring from ceph into the working directory of the server you are testing from. I test directly from a proxmox host because my VMs seem to be limited in throughput as I explained earlier.

Example tests:

rados -p test bench -b 4194304 60 write -t 32 -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

rados -p test bench -b 4194304 60 seq -t 32 -c ceph.conf -k ceph.client.admin.keyring --no-cleanup

It might also be a good idea to purge your cache before doing the read for best accuracy, although it didn't make much of a difference in my case.

echo 3 > /proc/sys/vm/drop_caches

wahmed · Oct 30, 2013

Great info ymmot04!!

I use RADOS bench for testing iops. Did not use some of the --arguments you have used. I have a separate pool just for read/write purpose. But my read test never worked. I can do write just fine. But everytime i tried to read through seq option it said i have to write something before reading. Also the echo to purge cache never worked. No doubt something not right on my end. After i am done reorganizing the network topology a bit for balance-rr i am going to try the tests again. Thanks for the calculator template!!

ymmot04 · Oct 30, 2013

Did you use the --no-cleanup option when you did a write? If not, the data would be removed and you would have nothing to read.

wahmed · Oct 30, 2013

ymmot04 said:
Did you use the --no-cleanup option when you did a write? If not, the data would be removed and you would have nothing to read.

LOL, that explains it. No i did not use --no-cleanup. So right after i was done write it deleted the file and read test said nothing to read. I knew i was missing something.

Thanks again !

wahmed · Oct 30, 2013

@jinjer,

Here is a pic of how i am trying to setup balance-rr for Proxmox+CEPH clusters. The top part is with both switches separately connected with nothing connecting them with each other. In this setup Proxmox pvestatd randomly loses connection and GUI shows node is offline along with all VMs although nothing is really offline. After few secs pvestatd comes back. While GUI shows offline, i can access to all VM, can SSH without any issue.

In 2nd setup if i connected both switches with each other, everything works just fine without any issue.

In both setup there does not seems to be any reduction in iperf benchmark performance, except in first setup in every few consecutive tests it would show i/o drop significantly. Any idea?

dietmar · Oct 30, 2013

symmcom said:
In this setup Proxmox pvestatd randomly loses connection and GUI shows node is offline

And cman loose connections also ('pvecm status')?

udo · Oct 31, 2013

jinjer said:
I'm wondering about the speed of ceph for resyncing of large files (i.e. kvm images) in the event of a failure of one of the nodes.

Say a node in a ceph cluster is reboot, taking away images for a few minutes.

In glusterfs it will take a read-back of the whole vm storage to resync images.

Is ceph the same?

This is where drbd shines, however it's only a 2 node solution.

Hi,
I'm just expand our ceph-cluster (3 nodes) to 38 hdds (from 24). The resync is done with more than 400 MB/s:

Code:

ceph health
HEALTH_WARN 9 pgs backfill; 22 pgs backfilling; 31 pgs stuck unclean; recovery 152685/17523530 degraded (0.871%);  recovering 124 o/s, 496MB/s

The OSDs use a separate 10GB-Network.

Udo

jinjer · Oct 31, 2013

symmcom said:
@jinjer,
View attachment 1782
Here is a pic of how i am trying to setup balance-rr for Proxmox+CEPH clusters. The top part is with both switches separately connected with nothing connecting them with each other. In this setup Proxmox pvestatd randomly loses connection and GUI shows node is offline along with all VMs although nothing is really offline. After few secs pvestatd comes back. While GUI shows offline, i can access to all VM, can SSH without any issue.

In 2nd setup if i connected both switches with each other, everything works just fine without any issue.

In both setup there does not seems to be any reduction in iperf benchmark performance, except in first setup in every few consecutive tests it would show i/o drop significantly. Any idea?

With interconnected switches, you're exercising the stp algo on the switches as they see the same mac address both on one of their ports and on the port of the other switch.

I would try a few tests between two nodes without a switch in the middle, to make sure that the whole balance-rr thing is working properly (and the drivers for the cards are ok).

Ceph large file (images) resync

Renowned Member

Famous Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Famous Member

Renowned Member

Famous Member

Famous Member

Famous Member

Active Member

Attachments

Famous Member

Active Member

Famous Member

Famous Member

Proxmox Staff Member

Distinguished Member

Renowned Member

We value your privacy