CEPH poor performance (4 nodes) -> config error?

yavuz

Renowned Member
Jun 22, 2014
24
1
68
I am in the process of testing a new cluster. For this cluster we have 4 nodes with each configuration equal:
2 x Xeon E5620 prcoessors
32 GB RAM
160 GB SSD for Proxmox VE
3 x 4 TB WD Black WD4003FZEX disks for CEPH
2 x Intel Gigabit NIC's, 1 main IP and 1 for the storage network

I have created the CEPH cluster and configured all nodes to be a monitor. Each disk is added as an OSD which totals to 12 OSD's. The ceph pool has a size of 3 and pg_num is 512.

I created 1 KVM to benchmark and I think the performance is poor:

Code:
Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15889KB/s, minb=15889KB/s, maxb=15889KB/s, mint=329962msec, maxt=329962msec

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=47242KB/s, minb=47242KB/s, maxb=47242KB/s, mint=110977msec, maxt=110977msec


Ran the benchmarks with:
Code:
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob

And:


Code:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob

Inbetween the write and the read I did:
Code:
echo 3 > /proc/sys/vm/drop_caches

On all hosts and the guest, as per instructions I found on other topics.

After this I tried different options in ceph.conf, I added:
Code:
         osd mkfs options xfs = "-f -i size=2048"
         osd mount options xfs = "rw,noatime,logbsize=256k,logbufs=8,inode64,al$
         osd op threads = 8
         osd max backfills = 1
         osd recovery max active = 1
         filestore max sync interval = 100
         filestore min sync interval = 50
         filestore queue max ops = 10000
         filestore queue max bytes = 536870912
         filestore queue committing max ops = 2000
         filestore queue committing max bytes = 536870912

I added this code under [osd]

Unfortunately this helped a little with read:
Code:
Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15548KB/s, minb=15548KB/s, maxb=15548KB/s, mint=337206msec, maxt=337206msec

And:
Code:
Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=51013KB/s, minb=51013KB/s, maxb=51013KB/s, mint=102775msec, maxt=102775msec


Main issue is: I feel that the KVM server is sloppy, not good performing at all.

Questions:
1. Is this performance expected for the current config?
2. If 1=no, what could be the problem? What can we do to get better performance except adding more OSD's?
3. If 1=yes, what would be the expected performance if we expand this cluster to 8 nodes with the same config? Double the performance now (which is still bad write performance!) or more?
 
I am in the process of testing a new cluster. For this cluster we have 4 nodes with each configuration equal:
2 x Xeon E5620 prcoessors
32 GB RAM
160 GB SSD for Proxmox VE
3 x 4 TB WD Black WD4003FZEX disks for CEPH
2 x Intel Gigabit NIC's, 1 main IP and 1 for the storage network

I have created the CEPH cluster and configured all nodes to be a monitor. Each disk is added as an OSD which totals to 12 OSD's. The ceph pool has a size of 3 and pg_num is 512.

I created 1 KVM to benchmark and I think the performance is poor:

Code:
Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15889KB/s, minb=15889KB/s, maxb=15889KB/s, mint=329962msec, maxt=329962msec

Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=47242KB/s, minb=47242KB/s, maxb=47242KB/s, mint=110977msec, maxt=110977msec


Ran the benchmarks with:
Code:
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=5G --direct=1 --name=fiojob

And:


Code:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob

Inbetween the write and the read I did:
Code:
echo 3 > /proc/sys/vm/drop_caches

On all hosts and the guest, as per instructions I found on other topics.

After this I tried different options in ceph.conf, I added:
Code:
         osd mkfs options xfs = "-f -i size=2048"
         osd mount options xfs = "rw,noatime,logbsize=256k,logbufs=8,inode64,al$
         osd op threads = 8
         osd max backfills = 1
         osd recovery max active = 1
         filestore max sync interval = 100
         filestore min sync interval = 50
         filestore queue max ops = 10000
         filestore queue max bytes = 536870912
         filestore queue committing max ops = 2000
         filestore queue committing max bytes = 536870912

I added this code under [osd]

Unfortunately this helped a little with read:
Code:
Run status group 0 (all jobs):
  WRITE: io=5120.0MB, aggrb=15548KB/s, minb=15548KB/s, maxb=15548KB/s, mint=337206msec, maxt=337206msec

And:
Code:
Run status group 0 (all jobs):
   READ: io=5120.0MB, aggrb=51013KB/s, minb=51013KB/s, maxb=51013KB/s, mint=102775msec, maxt=102775msec


Main issue is: I feel that the KVM server is sloppy, not good performing at all.

Questions:
1. Is this performance expected for the current config?
2. If 1=no, what could be the problem? What can we do to get better performance except adding more OSD's?
3. If 1=yes, what would be the expected performance if we expand this cluster to 8 nodes with the same config? Double the performance now (which is still bad write performance!) or more?

Hi,

do you use kernel 3.10 ?

- Also, for write, don't forget that writes are done twice, in the journal + datas, so it's better to have dedicated ssd for journal.
-and for network, try to use dedicated link for cluster replication. (ceph public network (vm->ceph) + cluster network (ceph -> ceph).

also write direct io for ceph are known to be slow, it's better to use qemu writeback without direct io.
 
hi,

first at all dont expect very high speeds for smaller clusters with one single threaded benchmark.
the you should have in mind that: with replica of 3 you have only 1/3 of speed for all disks. for the journal writes on the same osd you loose again 1/2 speed. so you allready have only 1/6 of the theoretical throughoutput speed of the 12 disks
and for ceph & network etc. you again will loose some percentage.

also 1 gig network is not too fast...
but never expect more than 70-100mb for your setup
what is more important is are the 4k random reads/writes thats where ceph is good.
 
  • Like
Reactions: chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!