ceph-performance and latency

wahmed · Jun 1, 2014

I did a benchmark about a week ago on one of my Ceph cluster. Benchmark was done from from host to see the actual I/O of Ceph cluster. How does this compare with others?
http://forum.proxmox.com/threads/18580-CEPH-desktop-class-HDD-benchmark

spirit · Jun 1, 2014

Hi Guys,

last fio git version, can now use a rbd device directly. (so you can test from your proxmox host, and avoid overhead of the vm)
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html

(you just need to install librbd-dev, then from last fio git, ./configure , make)

Just one question, I'm going to build a full ssd cluster next year, and I'm looking for good and cheap 10gb cheap.
I see that mellanox is selling good ethernet and infiniband switches.
ceph is going to work on infiniband/rdma soon. Somebody is already using infiniband switches ? (ip over infiband ?)
How many bandwith can you reach (ipoib) ?
What about mellanox ?

Patrick Zippenfenig · Jun 2, 2014

symmcom said:
I did a benchmark about a week ago on one of my Ceph cluster. Benchmark was done from from host to see the actual I/O of Ceph cluster. How does this compare with others?
http://forum.proxmox.com/threads/18580-CEPH-desktop-class-HDD-benchmark

1gb network is your limiting factor. 26 hdds should perform several times than 8 hdds, not only 50%. Try 10gb.

spirit said:
Hi Guys,

last fio git version, can now use a rbd device directly. (so you can test from your proxmox host, and avoid overhead of the vm)
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html

(you just need to install librbd-dev, then from last fio git, ./configure , make)

Just one question, I'm going to build a full ssd cluster next year, and I'm looking for good and cheap 10gb cheap.
I see that mellanox is selling good ethernet and infiniband switches.
ceph is going to work on infiniband/rdma soon. Somebody is already using infiniband switches ? (ip over infiband ?)
How many bandwith can you reach (ipoib) ?
What about mellanox ?

Thanks for the fio update. I was just looking for tools to benchmark different workloads (especially random and mixed io).

I'm already using mellanix 10g infiniband and upgrading to 20g infiniband soon. IPoIB works without any problems (http://pve.proxmox.com/wiki/Infiniband). Unfortunately I'm unable to pass infiniband to kvm (ip route works with bad performance). RDMA support for ceph would be awesome (http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger)!

Patrick

spirit · Jun 2, 2014

Patrick Zippenfenig said:
Unfortunately I'm unable to pass infiniband to kvm (ip route works with bad performance).

Do you mean ipoib ? I don't known if you use settings from wiki, with mtu 65520 , but that could explain why you have bad performance, because of fragmentation in the routers, which cannot handle more than mtu 1500.

But good to known that ipoib is fast

(And it's also possible to do live migration with rdma on last qemu, I would like to implement it too).

Patrick Zippenfenig · Jun 2, 2014

Hi spirit,

yes I meant IPoIB. I switched to mtu=1500 and write were reduced to 180 MB/s and reads to 225. Even for cached read I could not exceed 225. At mtu=65520 I achieve >1GB/s cached reads.

For KVM ip route I route via ethernet to host (at mtu=1500). Iperf did not pass 2Gbit/s. Maybe larger MTUs on ethernet would perform better, but I would have to reconfigure other network components. OpenVZ works fine with ipoib, but also with limited performance (1.5Gbits instead of 7.8 Gbits). I'm going to try openVZ on cephfs for high-availability.
I'm already running cephfs with fuse without any problems. Cephfs with one active MDS should be stable.

Patrick

spirit · Jun 2, 2014

Patrick Zippenfenig said:
Hi spirit,

yes I meant IPoIB. I switched to mtu=1500 and write were reduced to 180 MB/s and reads to 225. Even for cached read I could not exceed 225. At mtu=65520 I achieve >1GB/s cached reads.

For KVM ip route I route via ethernet to host (at mtu=1500). Iperf did not pass 2Gbit/s. Maybe larger MTUs on ethernet would perform better, but I would have to reconfigure other network components. OpenVZ works fine with ipoib, but also with limited performance (1.5Gbits instead of 7.8 Gbits). I'm going to try openVZ on cephfs for high-availability.
I'm already running cephfs with fuse without any problems. Cephfs with one active MDS should be stable.

Patrick

Ok, thanks!
I was thinking to use same card for vms->internet, but seem that ipoib don't scale well with low mtu.
But for ceph storage it seem to be a good solution . I'll wait a little bit more for rdma implementation

Patrick Zippenfenig · Jun 17, 2014

Hi,

I added 9 additional disks (24x 4TB in total) and the benchmarks are exactly the same! I guess I saturated my raid controller (hp p410), SAS expander or similar. DD on one disk 150 MB/s, on 8 disks in parallel only 52 MB/s. All disks are in one 2U disk enclosure (hp msa60) connected with 1x sas multi lane

Code:

dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/test bs=1M count=1k oflag=direct &; dd if=/dev/zero of=/var/lib/ceph/osd/ceph-1/test bs=1M count=1k oflag=direct &; dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=1M count=1k oflag=direct &; dd if=.....

[TABLE="class: grid, width: 500"]
[TR]
[TD]# of disks[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]3[/TD]
[TD]4[/TD]
[TD]5[/TD]
[TD]6[/TD]
[TD]7[/TD]
[TD]8[/TD]
[/TR]
[TR]
[TD]MB/s each[/TD]
[TD]150[/TD]
[TD]130[/TD]
[TD]124[/TD]
[TD]115[/TD]
[TD]74[/TD]
[TD]62[/TD]
[TD]56[/TD]
[TD]52[/TD]
[/TR]
[TR]
[TD]MB/s total[/TD]
[TD]150[/TD]
[TD]260[/TD]
[TD]372[/TD]
[TD]460[/TD]
[TD]370[/TD]
[TD]372[/TD]
[TD]392[/TD]
[TD]416[/TD]
[/TR]
[/TABLE]

At 52 MB/s per disk ceph performs as expected: 3x Nodes * 8 Disks * 52 MB/s / (2 Replicas * 2 Journal) => 312 MB/s

Could anyone run these benchmarks? Thanks!

Regards, Patrick

spirit · Jun 18, 2014

Patrick Zippenfenig said:
Hi,

I added 9 additional disks (24x 4TB in total) and the benchmarks are exactly the same! I guess I saturated my raid controller (hp p410), SAS expander or similar. DD on one disk 150 MB/s, on 8 disks in parallel only 52 MB/s. All disks are in one 2U disk enclosure (hp msa60) connected with 1x sas multi lane

results seem to be really poor. you should be able to reach "number of lane" x 3gbits/s (375MB/S) or "number of lane" x 6gbit/s (750MB/S).
or something in wrong in your raid controller or sas expander.

Patrick Zippenfenig · Jun 18, 2014

Yes, with 4 lanes there should not be a problem. Even at 3 gbits it should be sufficient (1.5 GB/s). I also disabled write-back cache, tested various scheduler and settings. I'm not entirely sure it is limited by the raid controller/sas expander. I really appreciate if you could run this benchmark on your system.
Here is a one-liner which tests 1 to N dds in parallel on disks in /var/lib/ceph/osd/ceph-*:

Code:

for N_DISKS in $(seq 1 `ls -d /var/lib/ceph/osd/ceph-* | wc -l`); do print "### Benchmarking $N_DISKS disks"; for DISK in `ls -d /var/lib/ceph/osd/ceph-* | head -n $N_DISKS`; do dd if=/dev/zero of=$DISK/benchmark bs=1M count=1k oflag=direct &; done; wait; done; rm /var/lib/ceph/osd/ceph-*/benchmark

Thanks!

spirit · Jun 18, 2014

Patrick Zippenfenig said:
Yes, with 4 lanes there should not be a problem. Even at 3 gbits it should be sufficient (1.5 GB/s). I also disabled write-back cache, tested various scheduler and settings. I'm not entirely sure it is limited by the raid controller/sas expander. I really appreciate if you could run this benchmark on your system.
Here is a one-liner which tests 1 to N dds in parallel on disks in /var/lib/ceph/osd/ceph-*:

Code:

for N_DISKS in $(seq 1 `ls -d /var/lib/ceph/osd/ceph-* | wc -l`); do print "### Benchmarking $N_DISKS disks"; for DISK in `ls -d /var/lib/ceph/osd/ceph-* | head -n $N_DISKS`; do dd if=/dev/zero of=$DISK/benchmark bs=1M count=1k oflag=direct &; done; wait; done; rm /var/lib/ceph/osd/ceph-*/benchmark

Thanks!

Sorry I can't test here, donc have free disks

Do you have try to do a bench with 'fio' ?
you can do sequential|random read|write benchs, tunning queue depth,on multiple devices at the same time.

Patrick Zippenfenig · Jun 18, 2014

Thanks for the fio hint. I was able to replicate the results for sequential write with fio. With increased concurrency level (4 threads per disk) it achieved >1GB/s seq write. It's strange that for one disk a single thread is sufficient, but for more than 4 disks it requires multiple threads per disk. At least I can rule out controller/expander bottlenecks.

I tried to increase various ceph settings to achieve a higher level of concurrency, but with no useful results. Peak performance never exceeds 400 MB/s and as soon as the journal is full, performance drops significantly. It does not even get close to the theoretical maximum for my setup.

I guess I wait for ceph giant with RDMA, (stable) leveldb backend and new rhel 3.10 kernel and hope that performance magically increases.

~ Patrick

spirit · Jun 18, 2014

Patrick Zippenfenig said:
new rhel 3.10 kernel

should come really soon in proxmox now that rhel7 is released.
you should really use it for ceph, don't use 2.6.32. They are some features in kernel which improve a lot ceph performance.

sdutremble · Jun 18, 2014

spirit said:
should come really soon in proxmox now that rhel7 is released.
you should really use it for ceph, don't use 2.6.32. They are some features in kernel which improve a lot ceph performance.

I've been looking since rhel7 was released and there is no information (that I could find) from the OpenVZ developers on when they will start with the new kernel.

Shouldn't there be already a beta kernel or something?

Serge

spirit · Jun 18, 2014

sdutremble said:
I've been looking since rhel7 was released and there is no information (that I could find) from the OpenVZ developers on when they will start with the new kernel.

Shouldn't there be already a beta kernel or something?

Serge

For openvz, no patch yet. I known that openvz team is working on it since months, and was waiting for final rhel7 release.
So it should be ready soon.

udo · Jun 25, 2014

Hi,
short update.

Due to hints in this thread (and others in the web) I have change the ceph-config now with this parameters:

Code:

[osd]
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
osd_op_threads = 4
osd_disk_threads = 4

As hint - the two last parameters are injectable on the fly with an command such this (use the right admin-socket):

Code:

## control values with

# ceph --admin-daemon /var/run/ceph/ceph-osd.28.asok config show | grep thread

## change value with

# ceph tell osd.* injectargs '--osd_op_threads 4'
# ceph tell osd.* injectargs '--osd_disk_threads 4'

Additional we expand/reorginize our ceph-cluster to 5 nodes with 12*4TB each (60 OSDs). Which also improve the performance.

And I must learn, that the measurings inside an VM have an big range - there must done 5 meassurements to avarage the value.

The "old" ceph-nodes have still an high (up to 20%) fragmentation - I must defrag this disks and give performance values after that.

Udo

yakakliker · Jun 25, 2014

e100 said:
This will cause CEPH to mount the disks with inode64, in ceph.conf global section:

Code:

osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:

Code:

/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

Maybe a stupid question :

Is it possible to apply these options without rebooting ?

Patrick Zippenfenig · Jun 25, 2014

Sure: /etc/init.d/ceph stop osd; umount /var/lib/ceph/osd/ceph-* ; /etc/init.d/ceph start osd

"ceph start osd" will now mount the disks with new options

yakakliker · Jun 25, 2014

thanks !

udo · Jun 25, 2014

Patrick Zippenfenig said:
Sure: /etc/init.d/ceph stop osd; umount /var/lib/ceph/osd/ceph-* ; /etc/init.d/ceph start osd

"ceph start osd" will now mount the disks with new options

Hi,
stop osd maybe only work if noup is set?! But you can also simply do an

Code:

mount -o remount,rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M /dev/sdd1 /var/lib/ceph/osd/ceph-1

Udo

yakakliker · Jun 28, 2014

Does anyone tried the option filestore flusher = "false" ?

ceph-performance and latency

Famous Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

We value your privacy