Ceph large file (images) resync

Yes, 10G ethernet. I've got 3 nodes, 12 disks per node. Each node also has 2 SSDs being used for OS/mon.

You might be right that my 10G interface on Proxmox could be a bottleneck. After that point, the IO Aggregator (connects all the blades together) has a 40G trunk to the primary switch, which then has 10G connections to each Ceph node giving the Ceph cluster a 30G combined bandwidth.

I just did another set of tests directly on one Proxmox host, these are to a 2x replication pool as well so they should be real world as the replication from OSDs will be happening. Results below:

[TABLE="width: 428"]
[TR]
[TD="colspan: 4"]Read[/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS Per disk[/TD]
[TD]Total IOPS[/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]48.027[/TD]
[TD="align: right"]341.5253333[/TD]
[TD="align: right"]12294.912[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]535.739[/TD]
[TD="align: right"]119.0531111[/TD]
[TD="align: right"]4285.912[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]885.099[/TD]
[TD="align: right"]6.146520833[/TD]
[TD="align: right"]221.27475[/TD]
[/TR]
[TR]
[TD="colspan: 4"]Write[/TD]
[/TR]
[TR]
[TD]Block Size (Bytes)[/TD]
[TD]Throughput (MBps)[/TD]
[TD]IOPS[/TD]
[TD][/TD]
[/TR]
[TR]
[TD="align: right"]4096[/TD]
[TD="align: right"]22.034[/TD]
[TD="align: right"]156.6862222[/TD]
[TD="align: right"]5640.704[/TD]
[/TR]
[TR]
[TD="align: right"]131072[/TD]
[TD="align: right"]328.495[/TD]
[TD="align: right"]72.99888889[/TD]
[TD="align: right"]2627.96[/TD]
[/TR]
[TR]
[TD="align: right"]4194304[/TD]
[TD="align: right"]616.88[/TD]
[TD="align: right"]4.283888889[/TD]
[TD="align: right"]154.22[/TD]
[/TR]
[/TABLE]

To test the theory of my 10G link on Proxmox being saturated I will try again running the test from two nodes at once and combine the results.

Here is my benchmark of CEPH from one of the Proxmox node using rados. CEPH network is bonded 2gbps.
iops-bench.PNG

Block size 4194303 read maxed out the bandwidth to 204.35 mb/s. So seems like the reorganized network working fine. 4096 block Read is the only number is my setup was able to catch up at 47.7 mb/s. You have 36 OSDs i got 6. Does the number of OSDs really makes significant difference.

Ran some #hdparm benchmarks on HDD themselves. Here are the avg. numbers from 10 benchmarks:
CEPH Node 1 : 131 MB/s
CEPH Node 2 : 134 MB/s
 
Last edited:
Good to see some numbers, symcomm. Maybe I should rename the thread.

All you care for is iops in the 64-128k range, in addition to the 4k range.

I think that your tests are heavily skewed by in-memory cache. The write tests instead seem to show the real thing, topping at aroun 100iops which is what a single sata 7.2k spinner can do alone.

I would suggest you to run a bonnie++ bench from inside a KVM linux machine and try to do parallel tests between machines.

what you're interested in is the seeks/sec that bonnie gives you. That number is almost identical to the real iops that you get in "real world" synthetic test (database/mail/web load). Don't forget to run it with the -f switch.

I would hunt for at least 100 seeks/sec accross the pool of kvm machines. Anything less and your users will think your offering is sub-par. Hell... even amazon gives you more than 50 on their instances.

I have a question for you ceph gurus.
Why ceph insists on using one OSD per disk, instead of making a big raid(0) device from a bunch of disks and have some proper speed when accessing it?

All this multiple processes with their cache and need to sync to each other are only increasing complexity, not to mention that resource requirements sky-rocket trough the roof. Perhaps the 100iops bottleneck is really this: you can't write to the same osd at higher rate, and the ceph "cluster" is not that smart parallelizing small chunks of data.

I would suggest to try to implement the ceph cluster using a single raid0 array per server (software) and try to keep the chink and stripe sizes similar to the real work load.
 
I have a question for you ceph gurus.
Why ceph insists on using one OSD per disk, instead of making a big raid(0) device from a bunch of disks and have some proper speed when accessing it?

All this multiple processes with their cache and need to sync to each other are only increasing complexity, not to mention that resource requirements sky-rocket trough the roof. Perhaps the 100iops bottleneck is really this: you can't write to the same osd at higher rate, and the ceph "cluster" is not that smart parallelizing small chunks of data.

I am no way to be qualified as "ceph guru", fresh Apprentice may be. :)
But i think the per OSD per HDD approach gives CEPH an edge. It does not have to deal with extra layer of RAID management. CEPH thinks OSD is minimum building block and thus it is far easier to setup. A physical node is nothing but a holding chasis to plug in HDD. Gluster, Napp-it, DRBD in all those you will have to think machine to machine. When changing chassis, you will have to pick up the entire block of HDD or resync the entire package. Sure the performance is not the greatest compare to others, but sure makes management easier. May be i am too lazy. :)
 
All this multiple processes with their cache and need to sync to each other are only increasing complexity, not to mention that resource requirements sky-rocket trough the roof. Perhaps the 100iops bottleneck is really this: you can't write to the same osd at higher rate, and the ceph "cluster" is not that smart parallelizing small chunks of data.

The theory is that a RAID controller (HW or SW) just adds unneeded complexity (and costs), and the amount of data you need to resync on failure get very large. But using the RAID controller cache with signle disks might be a good idea.
 
@ymmot04, @jinjer,
Could you please post some proper bonnie++ benchmarking commands to test both bandwidth and IOPS ?
Thanks!
the following should get you going:
Code:
mkdir /mnt/bonnie
chown nobody /mnt/bonnie
bonnie++ -f -n 384 -u nobody -d /mnt/bonnie

Also don't forget to use the [-r ram-size-in-MiB] option to tell bonnie how much physical ram your node has (or you could be benchmarking the cache on your host which is probably much bigger than your kvm machine knows.
 
The theory is that a RAID controller (HW or SW) just adds unneeded complexity (and costs), and the amount of data you need to resync on failure get very large. But using the RAID controller cache with signle disks might be a good idea.
I'm a strong supporter of software raid. hardware raid has little and expensive memory, slow cpu and so on.
However, I don't buy the complexity thing. You're replacing a level of "simple" complexity (the local raid disk handling) with a complexity which is order of magnitudes higher (i.e. remote osd/networking/connect/reconnect and and environment general unreliability).

I don't know what is the complexity of the osd/mon algorithms, and how it scales with the size of an osd versus the number of osd, but from "common sense" I would expect it to be much easier to handle fewer loosely connected but bigger and faster blocks (i.e. raid 0 or 10 arrays) rather than handling a myriad of small and slow single-disk ("bricks") which are also loosely interconnected and need to be kept in sync across slow lines (i.e. network).

beats me... but explains some ridiculous (imho) resource requirements I've read on the ceph doc page.
 
Out of the box CEPH can do the following things. Some of which i personally tried:
CEPH can be scaled from multi-rack to multi-remote-location. From one admin node one can see the entire cluster, all its nodes, even down to a single hdd. The whole complexity of RAID can be totally ignored, minimizing management overhead. So long an OSD is visible to the the cluster, it is part of the cluster and part of all syncs. Only one node is needed to do almost all management work on the cluster.

I agree with jinjer's comment on easier handling of fewer bigger block than many smaller one. True, in CEPH you are managing many many HDD but it just makes it that much easier. No need to worry about RAID going bad, just have to make sure HDD are running and thats it regardless which node they are in and whatever location they are in. jinjer, could you post a link where you saw resource requirement?

I may sound like a CEPH Sales person, but i am just in Love with it. :)
 
the following should get you going:
Code:
mkdir /mnt/bonnie
chown nobody /mnt/bonnie
bonnie++ -f -n 384 -u nobody -d /mnt/bonnie

Also don't forget to use the [-r ram-size-in-MiB] option to tell bonnie how much physical ram your node has (or you could be benchmarking the cache on your host which is probably much bigger than your kvm machine knows.
Hi,
for reference done inside an 1GB-VM with 1 Core:
Code:
bonnie++ -f -n 384 -u nobody -d bonnie -r 1024
Using uid:65534, gid:65534.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
archiv      2G           191675  18 78441  11           207552  17  3236  45
Latency                       35066us     158ms              8302us   32120us
Version  1.96       ------Sequential Create------ --------Random Create--------
archiv         -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                384 17433  71 525564  99 19062  59 14178  56 175472  98 18429  68
Latency              2719ms     236us    4268us    8692ms    4163us    5786ms
1.96,1.96,archiv,1,1383388544,2G,,,,191675,18,78441,11,,,207552,17,3236,45,384,,,,,17433,71,525564,99,19062,59,14178,56,175472,98,18429,68,,35066us,158ms,,8302us,32120us,2719ms,236us,4268us,8692ms,4163us,5786ms
The latency is perhaps an problem?!

Udo
 
udo, thank you for the numbers. yes, the latency seems huge. Does the server feel slow and sluggish during the test?

The -r parameter is wrong. If you have 1G ivm machine and 16G ram on the node, you should test with -r 16384, otherwise any unused ram on the host node will act as cache, altering the results.

This mainly affects the raw throughput tests (i.e. the intelligent read/write/rewrite).
You can also play with the -n 384 parameter. What this means is that bonnie will create 384K files in the directory for testing metadata operations and so on. Usually that number is enough.
The seeks/sec seems a little high for a clustered fs. There is probably some caching there too. I would try to increase the -n parameter.
 
Some final numbers for CEPH+Proxmox Cluster:
ceph-bench.png
After finalizing balance-rr multiswitch setup, ran several benchmarks. All tests had similar result. Above image shows one of the final result. First benchmarks measured the speed of the OSDs themselves, 2nd and 3rd benchmark to measure LAN bandwidth and write speed and 4th is RADOS benchmark of 3 block sizes.
Would greatly appreciate any thoughts on the numbers based on current setup of 2 CEPH Nodes, 6 OSDs and 2Gbps network connectivity.
 
write speed looks very bad?

Ya, specially for 4MB write is particularly slowest. Its true i dont have 10Gb network backbone as ymmot04, but still i should have faster write than this. I did all test possible with current setup and i dont think there is anything i can try on it. So the next step is add a 3rd node, and fill all bays with HDD to bring total OSD number to 36. Then run identical test and see what happens. I was told with CEPH the higher number of OSD you have the faster it works.
 
I'm very interested in this thread as I'm about to do a similar test/setup with CEPH and bonding. I'm wondering if the network performance stability and speed still hold if I bond 6 1-GB NIC cards using 6 "dumb" switches? Is "stability" an issue? (I read somewhere about out-of-packet problems?) (Or, does it become less linear a relationship as we bond that many NICs or is the overhead loss negligible?) If this gives us everything (for "free") essentially, then why is everyone not doing this? I see a lot of references to managed LACP switches (which will not make a single iperf "n" times faster, where n is the number of NICs), but few references to this multi-"dumb"-switch approach? (That said, I found a nice Linux KVM reference on "bonding" here which makes me feel better about the strategy: http://www.linux-kvm.org/page/HOWTO_BONDING )
 
I have read somewhere that using more than 3 nics in a balance-rr setup is a waste of nics since out of order packages increases exponentially when you pass 3 nics.
 
I'm very interested in this thread as I'm about to do a similar test/setup with CEPH and bonding. I'm wondering if the network performance stability and speed still hold if I bond 6 1-GB NIC cards using 6 "dumb" switches? Is "stability" an issue? (I read somewhere about out-of-packet problems?) (Or, does it become less linear a relationship as we bond that many NICs or is the overhead loss negligible?) If this gives us everything (for "free") essentially, then why is everyone not doing this? I see a lot of references to managed LACP switches (which will not make a single iperf "n" times faster, where n is the number of NICs), but few references to this multi-"dumb"-switch approach?

I ran my CEPH cluster with both LACP and balance-rr bonding setup. I am finding out although balance-rr with dummy switches has some big advantages, the issues of packet loss is out weighing the performance. Thanks to jinjer i got to know that balance-rr allows have fault tolerant switches, not just NIC themselves. Performance wise i am not noticing any positive difference. After much thinking and debate, i have decided to go back to LACP method of bonding. Even though i will not get higher bandwidth than 1 gbps, it seems like the least problematic way to go for a stable platform.

It is also possible that i may have missed some important settings which causing massive packet loss even though switches are separated. I would be interested to know if anybody else has a balance-rr bonding working flawlessly with ordinary gigabit NICs.

As mir pointed out above higher number of NICs are just going to cause whole bunch of issue for balance-rr bonding, i suggest that you stay away or if you do want to learn and find out performance in your own environment on a completely separated test network not the live/production platform. 6 NICs setup might cause serious trouble that might take your network down causing major downtime.

I will keep posting here my updates of going back to LACP bonding.
 
Gave CephFS a try finally. Cloned an existing Win 7 VM which is stored on RBD and put it on CephFS. Following is the benchmark by CrystalDisk. The virtual disk has Writethrough cache enabled. Obviously there are some Read caching happening. I also added the benchmark for the same VM but on RBD storage. Does not look like there is a huge difference and VM seems to be stable on CephFS.

cephfs-1.PNG
Benchmark for VM on CephFS Storage

rbd-1.png
Benchmark for VM on RBD Storage
 
Not wanting to get more off-topic on an off-topic thread but after reading about the the bonding issues it seems appropriate to share an alternate idea.

Trying to aggregate 1G ethernet is a pain, creates a mess of cables and might not even be the cheapest solution for high bandwidth networks.

You can get a dual port 10Gbps Infiniband card off of ebay for about the same or less cost as six 1GB cards.
I've seen 24 port 10G Infiniband switches for as little as $250

Most of the IB gear sold as used is stuff pulled from decomissioned or upgradeded compute clusters, I have not received a defective used IB part ever. ( Well I once got a part that was smashed, seller promptly replaced it )
IB is only useful for local networks like you would use for DRBD, CEPH, iSCSI, you are not able to give virtual servers interfaces on the IB network.
I reguarly get over 600MB/sec live migration speeds, so its useful just for the Proxmox Cluster network too.

If 10G is not fast enough there are 20G IB Cards/Switches too, cards are getting cheap, switches are still rather expensive tho.
20G Infiniband is "old technology" 40G IB came out in 2010, I can't wait for people to start upgrading their 40G Interconnects and those end up on ebay :p

You can bond IB interfaces but only in active-backup mode, dual port cards are great for redundant connections to protect from switch failures or bumped cable. Or use each port for a different network.

Thats my 2 cents, order in moderation to prevent demand from driving up prices ;)
 
Not wanting to get more off-topic on an off-topic thread but after reading about the the bonding issues it seems appropriate to share an alternate idea.

This thread has now became official HQ of Proxmox+CEPH discussion. :D

The thought of Infiniband has occured to me and I did look on eBay for some equipment. The price almost made me buy whole lot. Dont panic, not going to do that. :)
@e100, Could you please post the model of Card if you working with CEPH? Did you need to recompile driver or worked out of the box?

I just have been waiting for somebody to tell me that this is all the speed i will get on my 1Gbps network and no amount of tweaking will increase the speed. I guess i have been trying to save some cash before i go spend $$ on eBay.

So. To all CEPH users > Is this is normal benchmark for 1Gbps network with 6 OSDs and 2 CEPH Nodes?
iops-bench.PNG
 
Last edited:
I have used all of these:
Mellanox MHGA28-XTC (20Gbps dual port,pcie)
Mellanox MHEA28-XTC (10Gbps dual port, pcie)
Mellanox MHXL-CF128-T (10Gbps dual port, pci-x)

I have not had to compile anything to make those cards work, I suspect that will be true for nearly any card since many compute clusters us IB and many run Linux.

We have 6 TopSpin 120 Switches 24Port 10Gbps, dual power supplies, built in subnet manager
If your switch does not have a subnet manager or you are not using a switch (pc to pc connection) you need to run opensm on one or more of your nodes.

Relating to Ceph performance, I really have no idea if your benchmarks are ok or not, just started using CEPH last week.
I have 4 CEPH nodes, each has 4 SATA disks. One disk is used for journals, the other 3 disks are used for OSDs for a total of 12 OSDs.
Over the infiniband using the rados client I can get 770MB/sec read speed (sequential)
But inside a KVM VM I only get 187MB/sec, this seems to be caused by KVM itself, it only has one IO thread.
With only one thread using rados I get about 200MB/sec so this seems to support the idea that the single IO thread in KVM is a limiting factor.

I did get better performance using a different IB port/network for the CEPH public and private networks than when I had them all on one IB port/network.

Here are my CEPH benchmarks from within a windows 2008 VM:
Code:
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   186.845 MB/s
          Sequential Write :    71.298 MB/s
         Random Read 512KB :   137.421 MB/s
        Random Write 512KB :    57.416 MB/s
    Random Read 4KB (QD=1) :     2.945 MB/s [   718.9 IOPS]
   Random Write 4KB (QD=1) :     8.378 MB/s [  2045.3 IOPS]
   Random Read 4KB (QD=32) :    42.006 MB/s [ 10255.5 IOPS]
  Random Write 4KB (QD=32) :     8.145 MB/s [  1988.4 IOPS]

  Test : 1000 MB [F: 0.3% (0.1/32.0 GB)] (x5)
  Date : 2013/11/15 8:54:25
    OS : Windows Server 2008 R2 Server Standard Edition (full installation) SP1 [6.1 Build 7601] (x64)

Have you tried any tuning in CEPH?
I found increasing "osd op threads" and "osd disk threads" seemed to help

in ceph.conf:
Code:
[global]
  osd op threads = 8
  osd disk threads = 4

Someone will want to know, to configure Infiniband interfaces in /etc/network/interfaces:
These examples work on both debian and ubuntu.
Code:
auto ib0
iface ib0 inet static
        pre-up modprobe ib_ipoib
        address  192.168.99.99
        netmask  255.255.255.0
        pre-up echo connected > /sys/class/net/ib0/mode
        mtu 65520

If you want active-backup bonding:
Code:
auto ib0
iface ib0 inet manual
        pre-up modprobe ib_ipoib
        pre-up modprobe bonding
        bond-master bond3
        bond-primary ib0

auto ib1
iface ib1 inet manual
        pre-up modprobe ib_ipoib
        pre-up modprobe bonding
        bond-master bond3

auto bond3 
iface bond3 inet static
        address  192.168.99.99
        netmask  255.255.255.0
        pre-up modprobe ib_ipoib
        bond-slaves none
        bond_miimon 100
        bond_mode active-backup
        pre-up echo connected > /sys/class/net/ib0/mode
        pre-up echo connected > /sys/class/net/ib1/mode
        pre-up modprobe bonding
        mtu 65520
 
  • Like
Reactions: chrone