ceph-performance and latency

udo

Distinguished Member
Apr 22, 2009
5,975
196
163
Ahrensburg; Germany
Hi,
I use an 4-node ceph-cluster with pve and I'm not really happy with the performance yet.

I have do some benchmark with rados bench and the same (or similiar) with fio inside an VM.
Especialy the latency are much more worse with multible threads.

is this only on my config, or have other peoples the same effect?

used commands for benchmarking:
Code:
rados bench -p test 60 write --no-cleanup -t N # N=1-16
# clear all buffers with "echo 3 > /proc/sys/vm/drop_caches" on the pve-node, and all ceph-nodes
rados bench -p test 60 seq --no-cleanup -t N # N=1-16

# inside VM from
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
# also clearing all buffers (VM, PVE, all ceph-nodes)
# reading with
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
# up to
fio --max-jobs=16 --numjobs=16 --readwrite=write --blocksize=4M --size=512M --direct=1 --name=fiojob
The results are showing here: ceph_performance.png

Any comments, or comparable values?

Udo
 
Hi,
just move the VM from ceph to drbd-sas storage to compare the fio-output.

Shows, that the problem is not on the VM-side.
OK, the caching of the raid-contoller "tune" the read-values a little bit.
fio_drbd_ceph.png
Udo
 
Hi Udo,
I'd like to do the same testing , but need more info on how you do so.

1- do you run all the code on host and inside vm?

2- how did you create the .png chart.


I have a 4 node ceph to test versus a drbd setup using sas + recent dell hardware.

Some of the ceph nodes use 3ware cards and the others sata attached to motherboard.
 
Hi Udo,
I'd like to do the same testing , but need more info on how you do so.

1- do you run all the code on host and inside vm?

2- how did you create the .png chart.


I have a 4 node ceph to test versus a drbd setup using sas + recent dell hardware.

Some of the ceph nodes use 3ware cards and the others sata attached to motherboard.
Hi Rob,
thanks!
On the VM I do the fio jobs;
Code:
cd /mnt
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=2 --numjobs=2 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=2 --numjobs=2 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=4 --numjobs=4 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=4 --numjobs=4 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=8 --numjobs=8 --readwrite=write --blocksize=4M --size=2G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=8 --numjobs=8 --readwrite=read --blocksize=4M --size=2G --direct=1 --name=fiojob
rm fiojob.*
fio --max-jobs=16 --numjobs=16 --readwrite=write --blocksize=4M --size=1G --direct=1 --name=fiojob
echo 3 > /proc/sys/vm/drop_caches
fio --max-jobs=16 --numjobs=16 --readwrite=read --blocksize=4M --size=1G --direct=1 --name=fiojob
On each "echo 3..." I do the same on the pve-host (and osds, if it's on ceph-storage).

The VM is an debian:
Code:
boot: c
bootdisk: ide0
cores: 1
ide0: ceph_pve:vm-499-disk-1,size=12G
ide2: none,media=cdrom
memory: 8192
name: perftest
net0: virtio=32:1A:01:5D:5C:2B,bridge=vmbr99
ostype: l26
sockets: 1
virtio0: d_sas_r0:vm-499-disk-1,size=32G
The meassurements are done on the 32gb-disk ext4 formatted:
Code:
/dev/vda1 on /mnt type ext4 (rw,relatime,user_xattr,barrier=1,data=ordered)
For rados bench I use an extra pool (from Monitor-node (pve-host))
Code:
# first create test-pool
ceph osd pool create test 1700 1700

rados bench -p test 60 write --no-cleanup -t 1
# my osd-nodes are ceph-02 - ceph-05:
echo 3 > /proc/sys/vm/drop_caches; for i in 2 3 4 5; do ssh ceph-0$i "echo 3 > /proc/sys/vm/drop_caches"; done

rados bench -p test 60 seq --no-cleanup -t 1
# repeat until t 16

# remove benchmark-data from pool
rados -p test cleanup benchmark_data
The png is create by an libreoffice-chart. https://polarzone.de/owncloud/public.php?service=files&t=ad2306d50680f2e16bb541f020d2cfe3

Udo
 
Last edited:
I have a three node cluster with four 4TB disks on each node. 10G Infiniband network.
My performance is not all that great, about what one would expect from a single SATA disk.
I needed lots of expandable/redundant storage, does not need to be fast, CEPH is working well for that.

Using cache=writeback with ceph disks makes a huge difference on write performance (3x increase) for me.

By default when making OSD in Proxmox it formats them using xfs. I wonder of ext4 would perform better.
Maybe CEPH performs better with many more OSDs and our smaller clusters are simply too small to let it shine.

I have not tried tuning any CEPH options like:
osd disk threads
osd op threads
or mount options like inode64 for xfs (should improves performance)

Maybe some tuning would go a long way to making performance better.
 
Last edited:
Hi e100,
I have a three node cluster with four 4TB disks on each node. 10G Infiniband network.
My performance is not all that great, about what one would expect from a single SATA disk.
and the latency?
...
Using cache=writeback with ceph disks makes a huge difference on write performance (3x increase) for me.
For me is read speed more important now.
By default when making OSD in Proxmox it formats them using xfs. I wonder of ext4 would perform better.
Maybe CEPH performs better with many more OSDs and our smaller clusters are simply too small to let it shine.
yes - one people told me, that 6+ OSD-nodes performs better... but there are also people with only 3-4 nodes which have good performance (ok, with sas-disks instead of sata and so on).
I have not tried tuning any CEPH options like:
osd disk threads
osd op threads
or mount options like inode64 for xfs (should improves performance)
hmm, ceph automaticly mounts with
Code:
/dev/sdd1 on /var/lib/ceph/osd/ceph-10 type xfs (rw,noatime)
If I mount the disk first, ceph umounted and mount again the disk...
So it's strange, that ceph itself don't use the optimal values...

Udo
 
This will cause CEPH to mount the disks with inode64, in ceph.conf global section:
Code:
osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:
Code:
/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F
 
Regarding XFS. I have noticed a huge degration lately in performance when using XFS on virtual disks. See below:

Storage: ZFS (RAID10)
Using fio with this test file:
# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern


[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=64

Disk: virtio1: omnios:vm-144-disk-2,size=20G (eg cache = nocache)

XFS:
read: iops=1469
write: iops=366

EXT4:
read: iops=9075
write iops=2263
 
This will cause CEPH to mount the disks with inode64, in ceph.conf global section:
Code:
osd mount options xfs = rw,noatime,inode64

This the above in my config CEPH mounts the OSD like this:
Code:
/dev/sdd1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,delaylog,inode64,noquota)

I also forgot to mention that this only helps if your OSD is larger than 1TB:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F
Wow!
inode64 is an huge performance boost!! (Yes I use 4TB-disks)
Can't compare directly yet, because some backup-jobs are running and the ceph-cluster isn't so calm like during the first test (weekend)... But the read speed with one thread are now doubled! And the writespeed are three times faster, and the latency are round about the half!

Thats looks not too bad.

Thanks

Udo
 
after making that change 2 of 3 nodes did not restart. something else may have caused the issue, but i want to give that warning until issue is solved on our cluster
 
so 2 of 4 nodes are up.

i set this to get /.etc/pve to mount: pvecm e 2

then try to restart ceph and get the following. failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
Code:
....
=== mon.0 === 
Starting Ceph mon.0 on ceph4-ib...
2014-05-21 08:01:55.561777 7fea0c910780 -1 asok(0x2feae00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-05-21 08:01:55.561949 7fea0c910780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph4-ib...

any clues on fixing that?
 
Last edited:
so 2 of 4 nodes are up.

i set this to get /.etc/pve to mount: pvecm e 2

then try to restart ceph and get the following. failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
Code:
....
=== mon.0 === 
Starting Ceph mon.0 on ceph4-ib...
2014-05-21 08:01:55.561777 7fea0c910780 -1 asok(0x2feae00) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-mon.0.asok': (17) File exists
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
IO error: lock /var/lib/ceph/mon/ceph-0/store.db/LOCK: Resource temporarily unavailable
2014-05-21 08:01:55.561949 7fea0c910780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768;  /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph4-ib...
any clues on fixing that?
Hi,
looks that the mon allready run?
Look with
Code:
netstat -an | grep 6789 | grep -i listen
Perhaps kill the old one?
You can also start the mon in foreground to get more error messages (in this example mon b):
Code:
ceph-mon -i b -d -c /etc/ceph/ceph.conf
Udo
 
Hi,

I'm new to this community. I started building a ceph/proxmox cluster 2 months ago: 3x nodes each
- p410 raid controller with writeback cache (single disk raid0)
- 5x 4TB hgst nas (upgrading to 12x each soon)
- 20 Gbit/s infiniband

Currently used for 20TB weather archive data (large files, seq write, random read)

I follwed some mails on ceph mailingslists (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002929.html) and now using the following settings:
Code:
osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M"
osd op threads = 8
osd max backfills = 1
osd recovery max active = 1
filestore max sync interval = 100
filestore min sync interval = 50
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 2000
filestore queue committing max bytes = 536870912

- nobarrier should ONLY be used with monitored bbu and no relearning cycle
- inode64 reduced write latency almost to 1/3 (hdds at >70% utilisation)
- allocsize=4MB is supposed to reduce fragmentation (currently >30%! benchmarks pending)
- backfill and max recovery reduces load for rebuild
- did not properly benchmark filestore settings yet
- I also use noop sheduler, nr_requests=1024, read_ahead_kb=1024

Current rados write benchmarks
Concurrency1248163264
Avg write latency (ms)253765129246463999
Avg write bandwith (MB/s)168215244246259276255

I will continue to test different settings and hopefully improve performance. More HDDs arrive soon


Regards Patrick
 
Hi,

I'm new to this community. I started building a ceph/proxmox cluster 2 months ago: 3x nodes each
- p410 raid controller with writeback cache (single disk raid0)
- 5x 4TB hgst nas (upgrading to 12x each soon)
- 20 Gbit/s infiniband

Currently used for 20TB weather archive data (large files, seq write, random read)

I follwed some mails on ceph mailingslists (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002929.html) and now using the following settings:
Code:
osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M"
osd op threads = 8
osd max backfills = 1
osd recovery max active = 1
filestore max sync interval = 100
filestore min sync interval = 50
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 2000
filestore queue committing max bytes = 536870912

- nobarrier should ONLY be used with monitored bbu and no relearning cycle
- inode64 reduced write latency almost to 1/3 (hdds at >70% utilisation)
- allocsize=4MB is supposed to reduce fragmentation (currently >30%! benchmarks pending)
- backfill and max recovery reduces load for rebuild
- did not properly benchmark filestore settings yet
- I also use noop sheduler, nr_requests=1024, read_ahead_kb=1024

Current rados write benchmarks
Concurrency1248163264
Avg write latency (ms)253765129246463999
Avg write bandwith (MB/s)168215244246259276255

I will continue to test different settings and hopefully improve performance. More HDDs arrive soon


Regards Patrick
Hi Patrick,
and your read-benchmark (with cleared buffers)? Due to your BBU you measure olso the cache on writing - ok is also in real live.

Will try your mount-options (except nobarrier).

Udo
 
Hi,

same config, read (clean caches) and write. Replication factor: 1/2. Ceph usage: 20384 GB data, 40815 GB used, 14972 GB / 55787 GB avail

Concurrency1248163264
Avg write latency (ms)253765129246463999
Avg write bandwith (MB/s)168215244246259276255
Avg read latency787993117185307600
Avg read bandwith51101172227346416423
Avg read latency (cached)93186
Avg read bandwith (cached)13691366

Cached reads saturate 10Gbit/s Infiniband (need to configure 20Gbit/s). In my opinion read/write bandwith is quite poor. 15 disks with replication factor 2 and journal "should" write >450 MB/s (15 * 120MB/s / 2 / 2). I will run some low level benchmarks on the filesystem to identify potential bottlenecks.


Patrick
 
I reverted my settings and benchmarked every step.

Concurrency level 32. Ceph version 0.80.1. 15x 4TB HGST NAS, 3 nodes, hp p410 with 512mb+bbu (single drive raid0), 10 gbit/s infiniband

StepWrite MB/smsRead MB/sms
default*224570193662
inode64214595173735
noop scheduler220582182696
read_ahead_kb=1024219580369346
read_ahead_kb=4096200638439289
1 op thread209610475268
8 op threads207616480265
filestore settings209610454279
logbsize=256k,logbufs=8243519450282
allocsize=4M246519432295
nobarrier263486513247

Some observations:
- (*) Baseline (default) benchmark might be wrong. I was using inode64 previously
- First few benchmark seconds are two times as fast a average. Most likely until my bbu caches are full.
- I rerun some tests and they tend to vary by up to 10%

My conclusions:
- inode64 should be enabled for >1TB drives
- read_ahead_kb can be used to increase seq read, but will reduce random read IO
- higher logbsize and logbufs perform better
- allocsize=4M does not impact performance and might be a good idea to prevent fragmentation (object size will never exceed 4MB and xfs' dynamic allocsize might allocate greater chunks resulting in fragmentation)
- nobarrier should be more useful for random write IO. Do not disable barriers without a monitored bbu and no relearn cycle!

Next steps:
- Get more disks and compare results
- Random IO benchmarks
- Identify bottlenecks

Performance is still far from perfect


Patrick
 
Hello Patrick, welcome to the community!

I am using the exact same disks in our first CEPH cluster. We have 12, no raid card just a plan SATA connection.
This is setup on our "test" cluster, we run in our office. It operates various virtual servers for internal office use like our samba servers with many TB of files.

Performance is OK but not great.

Friday my assistant and I were discussing how we could put CEPH on our production cluster.
We are thinking of breaking one DRBD volume on six servers, pull the disks and insert 34 HGST 4TB NAS disks and build a CEPH cluster.
All the servers have Areca RAID card with 2G or 4G battery backed cache.
Based on the information you have provided it looks like we would have pretty good performance.

Wonder if I could convince the boss to let me replace 100+ disks and switch to CEPH.....
The new Areca 1883 cards that support 8G cache would also be great.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!