CEPH poor performance (4 nodes) -> config error?

yavuz · Jul 28, 2014

I am in the process of testing a new cluster. For this cluster we have 4 nodes with each configuration equal:
2 x Xeon E5620 prcoessors
32 GB RAM
160 GB SSD for Proxmox VE
3 x 4 TB WD Black WD4003FZEX disks for CEPH
2 x Intel Gigabit NIC's, 1 main IP and 1 for the storage network

I have created the CEPH cluster and configured all nodes to be a monitor. Each disk is added as an OSD which totals to 12 OSD's. The ceph pool has a size of 3 and pg_num is 512.

I created 1 KVM to benchmark and I think the performance is poor:

Code:

Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15889KB/s, minb=15889KB/s, maxb=15889KB/s, mint=329962msec, maxt=329962msec

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=47242KB/s, minb=47242KB/s, maxb=47242KB/s, mint=110977msec, maxt=110977msec

Ran the benchmarks with:

Code:

fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob

And:

Code:

fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob

Inbetween the write and the read I did:

Code:

echo 3 > /proc/sys/vm/drop_caches

On all hosts and the guest, as per instructions I found on other topics.

After this I tried different options in ceph.conf, I added:

Code:

         osd mkfs options xfs = "-f -i size=2048"
         osd mount options xfs = "rw,noatime,logbsize=256k,logbufs=8,inode64,al$
         osd op threads = 8
         osd max backfills = 1
         osd recovery max active = 1
         filestore max sync interval = 100
         filestore min sync interval = 50
         filestore queue max ops = 10000
         filestore queue max bytes = 536870912
         filestore queue committing max ops = 2000
         filestore queue committing max bytes = 536870912

I added this code under [osd]

Unfortunately this helped a little with read:

Code:

Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15548KB/s, minb=15548KB/s, maxb=15548KB/s, mint=337206msec, maxt=337206msec

And:

Code:

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=51013KB/s, minb=51013KB/s, maxb=51013KB/s, mint=102775msec, maxt=102775msec

Main issue is: I feel that the KVM server is sloppy, not good performing at all.

Questions:
1. Is this performance expected for the current config?
2. If 1=no, what could be the problem? What can we do to get better performance except adding more OSD's?
3. If 1=yes, what would be the expected performance if we expand this cluster to 8 nodes with the same config? Double the performance now (which is still bad write performance!) or more?

spirit · Jul 28, 2014

yavuz said:
I am in the process of testing a new cluster. For this cluster we have 4 nodes with each configuration equal:
2 x Xeon E5620 prcoessors
32 GB RAM
160 GB SSD for Proxmox VE
3 x 4 TB WD Black WD4003FZEX disks for CEPH
2 x Intel Gigabit NIC's, 1 main IP and 1 for the storage network

I have created the CEPH cluster and configured all nodes to be a monitor. Each disk is added as an OSD which totals to 12 OSD's. The ceph pool has a size of 3 and pg_num is 512.

I created 1 KVM to benchmark and I think the performance is poor:

Code:

Run status group 0 (all jobs): WRITE: io=5120.0MB, aggrb=15889KB/s, minb=15889KB/s, maxb=15889KB/s, mint=329962msec, maxt=329962msec Run status group 0 (all jobs): READ: io=5120.0MB, aggrb=47242KB/s, minb=47242KB/s, maxb=47242KB/s, mint=110977msec, maxt=110977msec

Ran the benchmarks with:

Code:

fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob

And:

Code:

fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob

Inbetween the write and the read I did:

Code:

echo 3 > /proc/sys/vm/drop_caches

On all hosts and the guest, as per instructions I found on other topics.

After this I tried different options in ceph.conf, I added:

Code:

osd mkfs options xfs = "-f -i size=2048" osd mount options xfs = "rw,noatime,logbsize=256k,logbufs=8,inode64,al$ osd op threads = 8 osd max backfills = 1 osd recovery max active = 1 filestore max sync interval = 100 filestore min sync interval = 50 filestore queue max ops = 10000 filestore queue max bytes = 536870912 filestore queue committing max ops = 2000 filestore queue committing max bytes = 536870912

I added this code under [osd]

Unfortunately this helped a little with read:

Code:

Run status group 0 (all jobs): WRITE: io=5120.0MB, aggrb=15548KB/s, minb=15548KB/s, maxb=15548KB/s, mint=337206msec, maxt=337206msec

And:

Code:

Run status group 0 (all jobs): READ: io=5120.0MB, aggrb=51013KB/s, minb=51013KB/s, maxb=51013KB/s, mint=102775msec, maxt=102775msec

Main issue is: I feel that the KVM server is sloppy, not good performing at all.

Questions:
1. Is this performance expected for the current config?
2. If 1=no, what could be the problem? What can we do to get better performance except adding more OSD's?
3. If 1=yes, what would be the expected performance if we expand this cluster to 8 nodes with the same config? Double the performance now (which is still bad write performance!) or more?

Hi,

do you use kernel 3.10 ?

- Also, for write, don't forget that writes are done twice, in the journal + datas, so it's better to have dedicated ssd for journal.
-and for network, try to use dedicated link for cluster replication. (ceph public network (vm->ceph) + cluster network (ceph -> ceph).

also write direct io for ceph are known to be slow, it's better to use qemu writeback without direct io.

felipe · Jul 28, 2014

hi,

first at all dont expect very high speeds for smaller clusters with one single threaded benchmark.
the you should have in mind that: with replica of 3 you have only 1/3 of speed for all disks. for the journal writes on the same osd you loose again 1/2 speed. so you allready have only 1/6 of the theoretical throughoutput speed of the 12 disks
and for ceph & network etc. you again will loose some percentage.

also 1 gig network is not too fast...
but never expect more than 70-100mb for your setup
what is more important is are the 4k random reads/writes thats where ceph is good.

Search

Search

CEPH poor performance (4 nodes) -> config error?

yavuz

Renowned Member

spirit

Distinguished Member

felipe

Well-Known Member

We value your privacy