ceph-performance and latency

udo · May 18, 2014

Hi,
I use an 4-node ceph-cluster with pve and I'm not really happy with the performance yet.

I have do some benchmark with rados bench and the same (or similiar) with fio inside an VM.
Especialy the latency are much more worse with multible threads.

is this only on my config, or have other peoples the same effect?

used commands for benchmarking:

Code:

rados bench -p test 60 write --no-cleanup -t N # N=1-16
# clear all buffers with "echo 3 > /proc/sys/vm/drop_caches" on the pve-node, and all ceph-nodes
rados bench -p test 60 seq --no-cleanup -t N # N=1-16

# inside VM from
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
# also clearing all buffers (VM, PVE, all ceph-nodes)
# reading with
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
# up to
fio --max-jobs=16 --numjobs=16 --readwrite=write --blocksize=4M --size=512M --direct=1 --name=fiojob

The results are showing here:

Any comments, or comparable values?

Udo

udo · May 19, 2014

Hi,
just move the VM from ceph to drbd-sas storage to compare the fio-output.

Shows, that the problem is not on the VM-side.
OK, the caching of the raid-contoller "tune" the read-values a little bit.

Udo

RobFantini · May 19, 2014

Hi Udo,
I'd like to do the same testing , but need more info on how you do so.

1- do you run all the code on host and inside vm?

2- how did you create the .png chart.

I have a 4 node ceph to test versus a drbd setup using sas + recent dell hardware.

Some of the ceph nodes use 3ware cards and the others sata attached to motherboard.

udo · May 20, 2014

RobFantini said:
Hi Udo,
I'd like to do the same testing , but need more info on how you do so.

1- do you run all the code on host and inside vm?

2- how did you create the .png chart.

I have a 4 node ceph to test versus a drbd setup using sas + recent dell hardware.

Some of the ceph nodes use 3ware cards and the others sata attached to motherboard.

Hi Rob,
thanks!
On the VM I do the fio jobs;

Code:

cd /mnt
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=2 --numjobs=2 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=2 --numjobs=2 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=4 --numjobs=4 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=4 --numjobs=4 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=8 --numjobs=8 --readwrite=write --blocksize=4M --size=2G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=8 --numjobs=8 --readwrite=read --blocksize=4M --size=2G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=16 --numjobs=16 --readwrite=write --blocksize=4M --size=1G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=16 --numjobs=16 --readwrite=read --blocksize=4M --size=1G --direct=1 --name=fiojob

On each "echo 3..." I do the same on the pve-host (and osds, if it's on ceph-storage).

The VM is an debian:

Code:

boot: c
bootdisk: ide0
cores: 1
ide0: ceph_pve:vm-499-disk-1,size=12G
ide2: none,media=cdrom
memory: 8192
name: perftest
net0: virtio=32:1A:01:5D:5C:2B,bridge=vmbr99
ostype: l26
sockets: 1
virtio0: d_sas_r0:vm-499-disk-1,size=32G

The meassurements are done on the 32gb-disk ext4 formatted:

Code:

/dev/vda1 on /mnt type ext4 (rw,relatime,user_xattr,barrier=1,data=ordered)

For rados bench I use an extra pool (from Monitor-node (pve-host))

Code:

# first create test-pool
ceph osd pool create test 1700 1700

rados bench -p test 60 write --no-cleanup -t 1
# my osd-nodes are ceph-02 - ceph-05:
echo 3 > /proc/sys/vm/drop_caches; for i in 2 3 4 5; do ssh ceph-0$i "echo 3 > /proc/sys/vm/drop_caches"; done

rados bench -p test 60 seq --no-cleanup -t 1
# repeat until t 16

# remove benchmark-data from pool
rados -p test cleanup benchmark_data

The png is create by an libreoffice-chart. https://polarzone.de/owncloud/public.php?service=files&t=ad2306d50680f2e16bb541f020d2cfe3

Udo

e100 · May 20, 2014

I have a three node cluster with four 4TB disks on each node. 10G Infiniband network.
My performance is not all that great, about what one would expect from a single SATA disk.
I needed lots of expandable/redundant storage, does not need to be fast, CEPH is working well for that.

Using cache=writeback with ceph disks makes a huge difference on write performance (3x increase) for me.

By default when making OSD in Proxmox it formats them using xfs. I wonder of ext4 would perform better.
Maybe CEPH performs better with many more OSDs and our smaller clusters are simply too small to let it shine.

I have not tried tuning any CEPH options like:
osd disk threads
osd op threads
or mount options like inode64 for xfs (should improves performance)

Maybe some tuning would go a long way to making performance better.

udo · May 20, 2014

Hi e100,

e100 said:
I have a three node cluster with four 4TB disks on each node. 10G Infiniband network.
My performance is not all that great, about what one would expect from a single SATA disk.

and the latency?

...
Using cache=writeback with ceph disks makes a huge difference on write performance (3x increase) for me.

For me is read speed more important now.

By default when making OSD in Proxmox it formats them using xfs. I wonder of ext4 would perform better.
Maybe CEPH performs better with many more OSDs and our smaller clusters are simply too small to let it shine.

yes - one people told me, that 6+ OSD-nodes performs better... but there are also people with only 3-4 nodes which have good performance (ok, with sas-disks instead of sata and so on).

I have not tried tuning any CEPH options like:
osd disk threads
osd op threads
or mount options like inode64 for xfs (should improves performance)

hmm, ceph automaticly mounts with

Code:

/dev/sdd1 on /var/lib/ceph/osd/ceph-10 type xfs (rw,noatime)

If I mount the disk first, ceph umounted and mount again the disk...
So it's strange, that ceph itself don't use the optimal values...

Udo

e100 · May 20, 2014

This will cause CEPH to mount the disks with inode64, in ceph.conf global section:

Code:

osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:

Code:

/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

mir · May 20, 2014

Regarding XFS. I have noticed a huge degration lately in performance when using XFS on virtual disks. See below:

Storage: ZFS (RAID10)
Using fio with this test file:
# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern

[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=64

Disk: virtio1: omnios:vm-144-disk-2,size=20G (eg cache = nocache)

XFS:
read: iops=1469
write: iops=366

EXT4:
read: iops=9075
write iops=2263

udo · May 20, 2014

e100 said:
This will cause CEPH to mount the disks with inode64, in ceph.conf global section:

Code:

osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:

Code:

/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

Wow!
inode64 is an huge performance boost!! (Yes I use 4TB-disks)
Can't compare directly yet, because some backup-jobs are running and the ceph-cluster isn't so calm like during the first test (weekend)... But the read speed with one thread are now doubled! And the writespeed are three times faster, and the latency are round about the half!

Thats looks not too bad.

Thanks

Udo

RobFantini · May 21, 2014

after making that change 2 of 3 nodes did not restart. something else may have caused the issue, but i want to give that warning until issue is solved on our cluster

RobFantini · May 21, 2014

so 2 of 4 nodes are up.

i set this to get /.etc/pve to mount: pvecm e 2

then try to restart ceph and get the following. failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists

Code:

....
=== mon.0 === 
Starting Ceph mon.0 on ceph4-ib...
2014-05-21 08:01:55.561777 7fea0c910780 -1 asok(0x2feae00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-05-21 08:01:55.561949 7fea0c910780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph4-ib...

any clues on fixing that?

e100 · May 21, 2014

inode64 does seem to help in my cluster too.
I wonder what other tuning options we are overlooking....

udo · May 21, 2014

RobFantini said:

so 2 of 4 nodes are up.

i set this to get /.etc/pve to mount: pvecm e 2

then try to restart ceph and get the following. failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists

Code:

....
=== mon.0 === 
Starting Ceph mon.0 on ceph4-ib...
2014-05-21 08:01:55.561777 7fea0c910780 -1 asok(0x2feae00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-05-21 08:01:55.561949 7fea0c910780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph4-ib...

any clues on fixing that?

Hi,
looks that the mon allready run?
Look with

Code:

netstat -an | grep 6789 | grep -i listen

Perhaps kill the old one?
You can also start the mon in foreground to get more error messages (in this example mon b):

Code:

ceph-mon -i b -d -c /etc/ceph/ceph.conf

Udo

e100 · May 23, 2014

Anyone tried using the journal-less leveldb storage for OSDs?
I was about to give it a try when I ran across this, think I will wait a bit now: http://tracker.ceph.com/issues/8381

RobFantini · May 23, 2014

RobFantini said:
after making that change 2 of 3 nodes did not restart. something else may have caused the issue, but i want to give that warning until issue is solved on our cluster

Our issue was caused by 2 loose network connections..... W'e'll try the xfs mount options later on.

Patrick Zippenfenig · May 31, 2014

Hi,

I'm new to this community. I started building a ceph/proxmox cluster 2 months ago: 3x nodes each
- p410 raid controller with writeback cache (single disk raid0)
- 5x 4TB hgst nas (upgrading to 12x each soon)
- 20 Gbit/s infiniband

Currently used for 20TB weather archive data (large files, seq write, random read)

I follwed some mails on ceph mailingslists (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002929.html) and now using the following settings:

Code:

osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M"
osd op threads = 8
osd max backfills = 1
osd recovery max active = 1
filestore max sync interval = 100
filestore min sync interval = 50
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 2000
filestore queue committing max bytes = 536870912

- nobarrier should ONLY be used with monitored bbu and no relearning cycle
- inode64 reduced write latency almost to 1/3 (hdds at >70% utilisation)
- allocsize=4MB is supposed to reduce fragmentation (currently >30%! benchmarks pending)
- backfill and max recovery reduces load for rebuild
- did not properly benchmark filestore settings yet
- I also use noop sheduler, nr_requests=1024, read_ahead_kb=1024

Current rados write benchmarks
[TABLE="class: grid, width: 500"]
[TR]
[TD]Concurrency[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]4[/TD]
[TD]8[/TD]
[TD]16[/TD]
[TD]32[/TD]
[TD]64[/TD]
[/TR]
[TR]
[TD]Avg write latency (ms)[/TD]
[TD]25[/TD]
[TD]37[/TD]
[TD]65[/TD]
[TD]129[/TD]
[TD]246[/TD]
[TD]463[/TD]
[TD]999[/TD]
[/TR]
[TR]
[TD]Avg write bandwith (MB/s)[/TD]
[TD]168[/TD]
[TD]215[/TD]
[TD]244[/TD]
[TD]246[/TD]
[TD]259[/TD]
[TD]276[/TD]
[TD]255[/TD]
[/TR]
[/TABLE]

I will continue to test different settings and hopefully improve performance. More HDDs arrive soon

Regards Patrick

udo · May 31, 2014

Patrick Zippenfenig said:
Hi,

I'm new to this community. I started building a ceph/proxmox cluster 2 months ago: 3x nodes each
- p410 raid controller with writeback cache (single disk raid0)
- 5x 4TB hgst nas (upgrading to 12x each soon)
- 20 Gbit/s infiniband

Currently used for 20TB weather archive data (large files, seq write, random read)

I follwed some mails on ceph mailingslists (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002929.html) and now using the following settings:

Code:

osd mkfs options xfs = "-f -i size=2048" osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M" osd op threads = 8 osd max backfills = 1 osd recovery max active = 1 filestore max sync interval = 100 filestore min sync interval = 50 filestore queue max ops = 2000 filestore queue max bytes = 536870912 filestore queue committing max ops = 2000 filestore queue committing max bytes = 536870912

- nobarrier should ONLY be used with monitored bbu and no relearning cycle
- inode64 reduced write latency almost to 1/3 (hdds at >70% utilisation)
- allocsize=4MB is supposed to reduce fragmentation (currently >30%! benchmarks pending)
- backfill and max recovery reduces load for rebuild
- did not properly benchmark filestore settings yet
- I also use noop sheduler, nr_requests=1024, read_ahead_kb=1024

Current rados write benchmarks
[TABLE="class: grid, width: 500"]
[TR]
[TD]Concurrency[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]4[/TD]
[TD]8[/TD]
[TD]16[/TD]
[TD]32[/TD]
[TD]64[/TD]
[/TR]
[TR]
[TD]Avg write latency (ms)[/TD]
[TD]25[/TD]
[TD]37[/TD]
[TD]65[/TD]
[TD]129[/TD]
[TD]246[/TD]
[TD]463[/TD]
[TD]999[/TD]
[/TR]
[TR]
[TD]Avg write bandwith (MB/s)[/TD]
[TD]168[/TD]
[TD]215[/TD]
[TD]244[/TD]
[TD]246[/TD]
[TD]259[/TD]
[TD]276[/TD]
[TD]255[/TD]
[/TR]
[/TABLE]

I will continue to test different settings and hopefully improve performance. More HDDs arrive soon

Regards Patrick

Hi Patrick,
and your read-benchmark (with cleared buffers)? Due to your BBU you measure olso the cache on writing - ok is also in real live.

Will try your mount-options (except nobarrier).

Udo

Patrick Zippenfenig · May 31, 2014

Hi,

same config, read (clean caches) and write. Replication factor: 1/2. Ceph usage: 20384 GB data, 40815 GB used, 14972 GB / 55787 GB avail

[TABLE="class: grid, width: 500"]
[TR]
[TD]Concurrency[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]4[/TD]
[TD]8[/TD]
[TD]16[/TD]
[TD]32[/TD]
[TD]64[/TD]
[/TR]
[TR]
[TD]Avg write latency (ms)[/TD]
[TD]25[/TD]
[TD]37[/TD]
[TD]65[/TD]
[TD]129[/TD]
[TD]246[/TD]
[TD]463[/TD]
[TD]999[/TD]
[/TR]
[TR]
[TD]Avg write bandwith (MB/s)[/TD]
[TD]168[/TD]
[TD]215[/TD]
[TD]244[/TD]
[TD]246[/TD]
[TD]259[/TD]
[TD]276[/TD]
[TD]255[/TD]
[/TR]
[TR]
[TD]Avg read latency[/TD]
[TD]78[/TD]
[TD]79[/TD]
[TD]93[/TD]
[TD]117[/TD]
[TD]185[/TD]
[TD]307[/TD]
[TD]600[/TD]
[/TR]
[TR]
[TD]Avg read bandwith[/TD]
[TD]51[/TD]
[TD]101[/TD]
[TD]172[/TD]
[TD]227[/TD]
[TD]346[/TD]
[TD]416[/TD]
[TD]423[/TD]
[/TR]
[TR]
[TD]Avg read latency (cached)[/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD]93[/TD]
[TD]186[/TD]
[/TR]
[TR]
[TD]Avg read bandwith (cached)[/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[TD]1369[/TD]
[TD]1366[/TD]
[/TR]
[/TABLE]

Cached reads saturate 10Gbit/s Infiniband (need to configure 20Gbit/s). In my opinion read/write bandwith is quite poor. 15 disks with replication factor 2 and journal "should" write >450 MB/s (15 * 120MB/s / 2 / 2). I will run some low level benchmarks on the filesystem to identify potential bottlenecks.

Patrick

Patrick Zippenfenig · May 31, 2014

I reverted my settings and benchmarked every step.

Concurrency level 32. Ceph version 0.80.1. 15x 4TB HGST NAS, 3 nodes, hp p410 with 512mb+bbu (single drive raid0), 10 gbit/s infiniband

[TABLE="class: grid, width: 500"]
[TR]
[TD]Step[/TD]
[TD]Write MB/s[/TD]
[TD]ms[/TD]
[TD]Read MB/s[/TD]
[TD]ms[/TD]
[/TR]
[TR]
[TD]default*[/TD]
[TD]224[/TD]
[TD]570[/TD]
[TD]193[/TD]
[TD]662[/TD]
[/TR]
[TR]
[TD]inode64[/TD]
[TD]214[/TD]
[TD]595[/TD]
[TD]173[/TD]
[TD]735[/TD]
[/TR]
[TR]
[TD]noop scheduler[/TD]
[TD]220[/TD]
[TD]582[/TD]
[TD]182[/TD]
[TD]696[/TD]
[/TR]
[TR]
[TD]read_ahead_kb=1024[/TD]
[TD]219[/TD]
[TD]580[/TD]
[TD]369[/TD]
[TD]346[/TD]
[/TR]
[TR]
[TD]read_ahead_kb=4096[/TD]
[TD]200[/TD]
[TD]638[/TD]
[TD]439[/TD]
[TD]289[/TD]
[/TR]
[TR]
[TD]1 op thread[/TD]
[TD]209[/TD]
[TD]610[/TD]
[TD]475[/TD]
[TD]268[/TD]
[/TR]
[TR]
[TD]8 op threads[/TD]
[TD]207[/TD]
[TD]616[/TD]
[TD]480[/TD]
[TD]265[/TD]
[/TR]
[TR]
[TD]filestore settings[/TD]
[TD]209[/TD]
[TD]610[/TD]
[TD]454[/TD]
[TD]279[/TD]
[/TR]
[TR]
[TD]logbsize=256k,logbufs=8[/TD]
[TD]243[/TD]
[TD]519[/TD]
[TD]450[/TD]
[TD]282[/TD]
[/TR]
[TR]
[TD]allocsize=4M[/TD]
[TD]246[/TD]
[TD]519[/TD]
[TD]432[/TD]
[TD]295[/TD]
[/TR]
[TR]
[TD]nobarrier[/TD]
[TD]263[/TD]
[TD]486[/TD]
[TD]513[/TD]
[TD]247[/TD]
[/TR]
[/TABLE]

Some observations:
- (*) Baseline (default) benchmark might be wrong. I was using inode64 previously
- First few benchmark seconds are two times as fast a average. Most likely until my bbu caches are full.
- I rerun some tests and they tend to vary by up to 10%

My conclusions:
- inode64 should be enabled for >1TB drives
- read_ahead_kb can be used to increase seq read, but will reduce random read IO
- higher logbsize and logbufs perform better
- allocsize=4M does not impact performance and might be a good idea to prevent fragmentation (object size will never exceed 4MB and xfs' dynamic allocsize might allocate greater chunks resulting in fragmentation)
- nobarrier should be more useful for random write IO. Do not disable barriers without a monitored bbu and no relearn cycle!

Next steps:
- Get more disks and compare results
- Random IO benchmarks
- Identify bottlenecks

Performance is still far from perfect

Patrick

e100 · Jun 1, 2014

Hello Patrick, welcome to the community!

I am using the exact same disks in our first CEPH cluster. We have 12, no raid card just a plan SATA connection.
This is setup on our "test" cluster, we run in our office. It operates various virtual servers for internal office use like our samba servers with many TB of files.

Performance is OK but not great.

Friday my assistant and I were discussing how we could put CEPH on our production cluster.
We are thinking of breaking one DRBD volume on six servers, pull the disks and insert 34 HGST 4TB NAS disks and build a CEPH cluster.
All the servers have Areca RAID card with 2G or 4G battery backed cache.
Based on the information you have provided it looks like we would have pretty good performance.

Wonder if I could convince the boss to let me replace 100+ disks and switch to CEPH.....
The new Areca 1883 cards that support 8G cache would also be great.

ceph-performance and latency

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Distinguished Member

Famous Member

Famous Member

Renowned Member

Distinguished Member

Renowned Member

Famous Member

Active Member

Distinguished Member

Active Member

Active Member

Renowned Member

We value your privacy