ceph-performance and latency

  • Like
Reactions: Deepen Dhulla
Hi Guys,

last fio git version, can now use a rbd device directly. (so you can test from your proxmox host, and avoid overhead of the vm)
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html

(you just need to install librbd-dev, then from last fio git, ./configure , make)


Just one question, I'm going to build a full ssd cluster next year, and I'm looking for good and cheap 10gb cheap.
I see that mellanox is selling good ethernet and infiniband switches.
ceph is going to work on infiniband/rdma soon. Somebody is already using infiniband switches ? (ip over infiband ?)
How many bandwith can you reach (ipoib) ?
What about mellanox ?
 
  • Like
Reactions: Deepen Dhulla
I did a benchmark about a week ago on one of my Ceph cluster. Benchmark was done from from host to see the actual I/O of Ceph cluster. How does this compare with others?
http://forum.proxmox.com/threads/18580-CEPH-desktop-class-HDD-benchmark
1gb network is your limiting factor. 26 hdds should perform several times than 8 hdds, not only 50%. Try 10gb.


Hi Guys,

last fio git version, can now use a rbd device directly. (so you can test from your proxmox host, and avoid overhead of the vm)
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html

(you just need to install librbd-dev, then from last fio git, ./configure , make)


Just one question, I'm going to build a full ssd cluster next year, and I'm looking for good and cheap 10gb cheap.
I see that mellanox is selling good ethernet and infiniband switches.
ceph is going to work on infiniband/rdma soon. Somebody is already using infiniband switches ? (ip over infiband ?)
How many bandwith can you reach (ipoib) ?
What about mellanox ?
Thanks for the fio update. I was just looking for tools to benchmark different workloads (especially random and mixed io).

I'm already using mellanix 10g infiniband and upgrading to 20g infiniband soon. IPoIB works without any problems (http://pve.proxmox.com/wiki/Infiniband). Unfortunately I'm unable to pass infiniband to kvm (ip route works with bad performance). RDMA support for ceph would be awesome (http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger)!


Patrick
 
Last edited:
Unfortunately I'm unable to pass infiniband to kvm (ip route works with bad performance).

Do you mean ipoib ? I don't known if you use settings from wiki, with mtu 65520 , but that could explain why you have bad performance, because of fragmentation in the routers, which cannot handle more than mtu 1500.

But good to known that ipoib is fast :) (And it's also possible to do live migration with rdma on last qemu, I would like to implement it too).

 
Hi spirit,

yes I meant IPoIB. I switched to mtu=1500 and write were reduced to 180 MB/s and reads to 225. Even for cached read I could not exceed 225. At mtu=65520 I achieve >1GB/s cached reads.

For KVM ip route I route via ethernet to host (at mtu=1500). Iperf did not pass 2Gbit/s. Maybe larger MTUs on ethernet would perform better, but I would have to reconfigure other network components. OpenVZ works fine with ipoib, but also with limited performance (1.5Gbits instead of 7.8 Gbits). I'm going to try openVZ on cephfs for high-availability.
I'm already running cephfs with fuse without any problems. Cephfs with one active MDS should be stable.


Patrick
 
Hi spirit,

yes I meant IPoIB. I switched to mtu=1500 and write were reduced to 180 MB/s and reads to 225. Even for cached read I could not exceed 225. At mtu=65520 I achieve >1GB/s cached reads.

For KVM ip route I route via ethernet to host (at mtu=1500). Iperf did not pass 2Gbit/s. Maybe larger MTUs on ethernet would perform better, but I would have to reconfigure other network components. OpenVZ works fine with ipoib, but also with limited performance (1.5Gbits instead of 7.8 Gbits). I'm going to try openVZ on cephfs for high-availability.
I'm already running cephfs with fuse without any problems. Cephfs with one active MDS should be stable.


Patrick

Ok, thanks!
I was thinking to use same card for vms->internet, but seem that ipoib don't scale well with low mtu.
But for ceph storage it seem to be a good solution . I'll wait a little bit more for rdma implementation :)
 
Hi,

I added 9 additional disks (24x 4TB in total) and the benchmarks are exactly the same! I guess I saturated my raid controller (hp p410), SAS expander or similar. DD on one disk 150 MB/s, on 8 disks in parallel only 52 MB/s. All disks are in one 2U disk enclosure (hp msa60) connected with 1x sas multi lane
Code:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/test bs=1M count=1k oflag=direct &; dd if=/dev/zero of=/var/lib/ceph/osd/ceph-1/test bs=1M count=1k oflag=direct &; dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=1M count=1k oflag=direct &; dd if=.....

# of disks12345678
MB/s each15013012411574625652
MB/s total150260372460370372392416

At 52 MB/s per disk ceph performs as expected: 3x Nodes * 8 Disks * 52 MB/s / (2 Replicas * 2 Journal) => 312 MB/s

Could anyone run these benchmarks? Thanks!



Regards, Patrick
 
Hi,

I added 9 additional disks (24x 4TB in total) and the benchmarks are exactly the same! I guess I saturated my raid controller (hp p410), SAS expander or similar. DD on one disk 150 MB/s, on 8 disks in parallel only 52 MB/s. All disks are in one 2U disk enclosure (hp msa60) connected with 1x sas multi lane

results seem to be really poor. you should be able to reach "number of lane" x 3gbits/s (375MB/S) or "number of lane" x 6gbit/s (750MB/S).
or something in wrong in your raid controller or sas expander.
 
Yes, with 4 lanes there should not be a problem. Even at 3 gbits it should be sufficient (1.5 GB/s). I also disabled write-back cache, tested various scheduler and settings. I'm not entirely sure it is limited by the raid controller/sas expander. I really appreciate if you could run this benchmark on your system.
Here is a one-liner which tests 1 to N dds in parallel on disks in /var/lib/ceph/osd/ceph-*:

Code:
for N_DISKS in $(seq 1 `ls -d /var/lib/ceph/osd/ceph-* | wc -l`); do print "### Benchmarking $N_DISKS disks"; for DISK in `ls -d /var/lib/ceph/osd/ceph-* | head -n $N_DISKS`; do dd if=/dev/zero of=$DISK/benchmark bs=1M count=1k oflag=direct &; done; wait; done; rm /var/lib/ceph/osd/ceph-*/benchmark


Thanks!
 
Yes, with 4 lanes there should not be a problem. Even at 3 gbits it should be sufficient (1.5 GB/s). I also disabled write-back cache, tested various scheduler and settings. I'm not entirely sure it is limited by the raid controller/sas expander. I really appreciate if you could run this benchmark on your system.
Here is a one-liner which tests 1 to N dds in parallel on disks in /var/lib/ceph/osd/ceph-*:

Code:
for N_DISKS in $(seq 1 `ls -d /var/lib/ceph/osd/ceph-* | wc -l`); do print "### Benchmarking $N_DISKS disks"; for DISK in `ls -d /var/lib/ceph/osd/ceph-* | head -n $N_DISKS`; do dd if=/dev/zero of=$DISK/benchmark bs=1M count=1k oflag=direct &; done; wait; done; rm /var/lib/ceph/osd/ceph-*/benchmark


Thanks!

Sorry I can't test here, donc have free disks ;)

Do you have try to do a bench with 'fio' ?
you can do sequential|random read|write benchs, tunning queue depth,on multiple devices at the same time.
 
Thanks for the fio hint. I was able to replicate the results for sequential write with fio. With increased concurrency level (4 threads per disk) it achieved >1GB/s seq write. It's strange that for one disk a single thread is sufficient, but for more than 4 disks it requires multiple threads per disk. At least I can rule out controller/expander bottlenecks.

I tried to increase various ceph settings to achieve a higher level of concurrency, but with no useful results. Peak performance never exceeds 400 MB/s and as soon as the journal is full, performance drops significantly. It does not even get close to the theoretical maximum for my setup.

I guess I wait for ceph giant with RDMA, (stable) leveldb backend and new rhel 3.10 kernel and hope that performance magically increases.


~ Patrick
 
should come really soon in proxmox now that rhel7 is released.
you should really use it for ceph, don't use 2.6.32. They are some features in kernel which improve a lot ceph performance.

I've been looking since rhel7 was released and there is no information (that I could find) from the OpenVZ developers on when they will start with the new kernel.

Shouldn't there be already a beta kernel or something?

Serge
 
I've been looking since rhel7 was released and there is no information (that I could find) from the OpenVZ developers on when they will start with the new kernel.

Shouldn't there be already a beta kernel or something?

Serge

For openvz, no patch yet. I known that openvz team is working on it since months, and was waiting for final rhel7 release.
So it should be ready soon.
 
Hi,
short update.

Due to hints in this thread (and others in the web) I have change the ceph-config now with this parameters:
Code:
[osd]
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
osd_op_threads = 4
osd_disk_threads = 4

As hint - the two last parameters are injectable on the fly with an command such this (use the right admin-socket):
Code:
## control values with

# ceph --admin-daemon /var/run/ceph/ceph-osd.28.asok config show | grep thread

## change value with

# ceph tell osd.* injectargs '--osd_op_threads 4'
# ceph tell osd.* injectargs '--osd_disk_threads 4'
Additional we expand/reorginize our ceph-cluster to 5 nodes with 12*4TB each (60 OSDs). Which also improve the performance.

And I must learn, that the measurings inside an VM have an big range - there must done 5 meassurements to avarage the value.

The "old" ceph-nodes have still an high (up to 20%) fragmentation - I must defrag this disks and give performance values after that.

Udo
 
This will cause CEPH to mount the disks with inode64, in ceph.conf global section:
Code:
osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:
Code:
/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F


Maybe a stupid question :

Is it possible to apply these options without rebooting ?
 
Sure: /etc/init.d/ceph stop osd; umount /var/lib/ceph/osd/ceph-* ; /etc/init.d/ceph start osd

"ceph start osd" will now mount the disks with new options
 
Sure: /etc/init.d/ceph stop osd; umount /var/lib/ceph/osd/ceph-* ; /etc/init.d/ceph start osd

"ceph start osd" will now mount the disks with new options

Hi,
stop osd maybe only work if noup is set?! But you can also simply do an
Code:
mount -o remount,rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M /dev/sdd1 /var/lib/ceph/osd/ceph-1
Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!