ceph performance seems very slow

blackpaw

Renowned Member
Nov 1, 2013
295
20
83
A common post on the forums it seems, but my case is unique! :) probably not really ...


3 proxmox/ceph nodes, one just a nuc that is used for quorum purposes, no OSD's or VM's

Underlying filesystem is ZFS, so using
journal dio = 0

2 OSD's on two nodes
- 3TB Western Digital Reds
- SSD for Cache and Log

OSD Nodes: 2 * 1GB in balance-rr direct. iperf gives 1.8 GB/s

Original tests were with a ZFS log and cache on SSD.

using dd in a guest, I got seq writes of 12 MB/s
I also tried with the ceph journal on a SSD and journial dio on which did improve things, with guest writes up to 32 MB/s
Seq reads are around 80 MB/s

The same tests runs with glusterfs give much better results, sometimes by an order of magnitude.

CEPH Benchmarks
Code:
rados -p test bench -b 4194304 60 write -t 32 -c /etc/pve/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --no-cleanup

 Total time run:         63.303149
Total writes made:      709
Write size:             4194304
Bandwidth (MB/sec):     44.800 

Stddev Bandwidth:       28.9649
Max bandwidth (MB/sec): 96
Min bandwidth (MB/sec): 0
Average Latency:        2.83586
Stddev Latency:         2.60019
Max latency:            11.2723
Min latency:            0.499958


rados -p test bench -b 4194304 60 seq  -t 32 -c /etc/pve/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --no-cleanup
Total time run:        25.486230
Total reads made:     709
Read size:            4194304
Bandwidth (MB/sec):    111.276 

Average Latency:       1.14577
Max latency:           3.61513
Min latency:           0.126247


ZFS
- SSD LOG
- SSD Cache

-----------------------------------------------------------------------
CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   186.231 MB/s
          Sequential Write :     7.343 MB/s
         Random Read 512KB :   157.589 MB/s
        Random Write 512KB :     8.330 MB/s
    Random Read 4KB (QD=1) :     3.934 MB/s [   960.4 IOPS]
   Random Write 4KB (QD=1) :     0.165 MB/s [    40.4 IOPS]
   Random Read 4KB (QD=32) :    23.660 MB/s [  5776.3 IOPS]
  Random Write 4KB (QD=32) :     0.328 MB/s [    80.1 IOPS]

  Test : 1000 MB [C: 38.6% (24.7/63.9 GB)] (x5)
  Date : 2014/11/26 18:46:51
    OS : Windows 7 Professional N SP1 [6.1 Build 7601] (x64)


ZFS
- SSD Cache (No LOG)
Ceph
- SSD Journal

-----------------------------------------------------------------------
CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   198.387 MB/s
          Sequential Write :    23.643 MB/s
         Random Read 512KB :   155.883 MB/s
        Random Write 512KB :    18.940 MB/s
    Random Read 4KB (QD=1) :     3.927 MB/s [   958.7 IOPS]
   Random Write 4KB (QD=1) :     0.485 MB/s [   118.5 IOPS]
   Random Read 4KB (QD=32) :    23.482 MB/s [  5733.0 IOPS]
  Random Write 4KB (QD=32) :     2.474 MB/s [   604.0 IOPS]

  Test : 1000 MB [C: 38.8% (24.8/63.9 GB)] (x5)
  Date : 2014/11/26 22:16:06
    OS : Windows 7 Professional N SP1 [6.1 Build 7601] (x64)



Gluster Benchmarks]

Code:
ZFS
- SSD LOG
- SSD Cache

-----------------------------------------------------------------------
CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   682.756 MB/s
          Sequential Write :    45.236 MB/s
         Random Read 512KB :   555.918 MB/s
        Random Write 512KB :    44.922 MB/s
    Random Read 4KB (QD=1) :    11.900 MB/s [  2905.2 IOPS]
   Random Write 4KB (QD=1) :     1.764 MB/s [   430.6 IOPS]
   Random Read 4KB (QD=32) :    26.159 MB/s [  6386.4 IOPS]
  Random Write 4KB (QD=32) :     2.915 MB/s [   711.6 IOPS]

  Test : 1000 MB [C: 38.6% (24.7/63.9 GB)] (x5)
  Date : 2014/11/26 21:35:47
    OS : Windows 7 Professional N SP1 [6.1 Build 7601] (x64)
  
  
ZFS
- SSD Cache (No LOG)

-----------------------------------------------------------------------
CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   729.191 MB/s
          Sequential Write :    53.499 MB/s
         Random Read 512KB :   625.833 MB/s
        Random Write 512KB :    45.738 MB/s
    Random Read 4KB (QD=1) :    12.780 MB/s [  3120.1 IOPS]
   Random Write 4KB (QD=1) :     2.667 MB/s [   651.1 IOPS]
   Random Read 4KB (QD=32) :    27.777 MB/s [  6781.4 IOPS]
  Random Write 4KB (QD=32) :     3.823 MB/s [   933.4 IOPS]

  Test : 1000 MB [C: 38.6% (24.7/63.9 GB)] (x5)
  Date : 2014/11/26 23:29:07
    OS : Windows 7 Professional N SP1 [6.1 Build 7601] (x64)


It almost seems that ceph is managing to disable the ZFS log & cach altogether.
 
Last edited:
Hi Spirit, thanks for the reply.

I experimented with ceph on ext4/xfs a few weeks back and got similar results. I switched to gluster on zfs for the flexibility that zfs gives, amazing pool managemnent and a superior SSD log/cache. When I saw that ceph could be installed on it I gave it a fling.

TBO, while benchmarks look crap, actual VM performance seems very similar to gluster. Many of our test and VDI VM's can be run with writeback on, which covers a multitude of sins. Haven't tried it with our AD or SQL server yet.

I do wonder if the benchmarks aren't accurately reflecting usage.

I also suspect that my SSD's aren't up to scratch for journal and cache - Samsung 128GB 840 EVO


Even with the inferior performance it is tempting - reliability is very good. Fascinating to kill one of the OSD's while running multiple VM's, a build and a disk copy - doesn't even hiccup, no real change in performance either. Bring it back up and it heals very quickly. Gluster would be brought to its knees by that.
 
>>I switched to gluster on zfs for the flexibility that zfs gives, amazing pool managemnent and a superior SSD log/cache
So, you have ceph log overhead + zfs log overhead ?


>>I also suspect that my SSD's aren't up to scratch for journal and cache - Samsung 128GB 840 EVO

for the journal, you need to be sure that ssd can handle dsync fast

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/


intel dc s3500 can reach around 10000iops, but I have tested some crucal m550 with 300iops ....
 
>>I switched to gluster on zfs for the flexibility that zfs gives, amazing pool managemnent and a superior SSD log/cache
So, you have ceph log overhead + zfs log overhead ?


No, I disabled the ZFS log. It was getting very little use, its possible it was actually slowing write performance.

The ZFS cache does seem to be helping though, reads are actually faster than the disks transfer rate.


for the journal, you need to be sure that ssd can handle dsync fast
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/


Destructive tests unfortunately, so I can't really run it on the current ones :)
Definitely look into some new ones


Thanks.
 
Well I have similar results. These days I try to consider between ceph and gluster. I have a three node cluster with just two storage nodes connected through 10gbps switch. I dedicated two intel ssd to ceph with xfs and two ssd to gluster on zfs on each server. I created a debian vm with virtio drive and writeback cache with 512mb ram and run following tests inside. It runs on one of two storage nodes as the third node does not have a 10gbit card yet.
First fio:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Ceph: read 2937 write 980 IOPS
Gluster: read 1993 write 667

bonnie++ -m test-box -x 3 -s 2048 -r 512
I got results showing gluster is almost twice faster then ceph, these are the middle of three results
random seeks: Ceph 3371 Gluster 6391
sequential output: Ceph 253033 Gluster 424969
rewrite: Ceph: 107086 Gluster: 171343

I also tried rados benchmark as mentiod above
Total time run: 60.943011
Total writes made: 4090
Write size: 4194304
Bandwidth (MB/sec): 268.448

Stddev Bandwidth: 125.82
Max bandwidth (MB/sec): 420
Min bandwidth (MB/sec): 0
Average Latency: 0.475624
Stddev Latency: 0.390042
Max latency: 2.63557
Min latency: 0.097821

Can I run some other tests which could show me what is better in terms of performance? I tried also fencing and active-backup bonding and turning off switches and interfaces and it breaks both of them.
 
Last edited:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
Here my results with 3 osd, small consumer ssd (crucial m550)

mix randread/randwrite
-----------------
fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

read : 4145 write :1489

randread
-----------
fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --filename=/opt/test --bs=4k --iodepth=64 --size=4G --readwrite=randread


read: 13000iops

randwrite
-----------
fio --ioengine=libaio --direct=1 --invalidate=1 --name=test --filename=/opt/test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

randwrite: 400 iops

(yes, this crucial disk sucks with O_DSYNC)



with intel ssd s3500 i'm around 13000 iops read/write. (qemu bottleneck)
 
Forget to say,

I can reach around 120000iops randread/40000iops randwrite 4k, with 6osd intel s3500 when benching at host side.
Write bottleneck is osd cpu.

(they are limits in qemu currently, that's why you can't reach more than 13000-20000iops) with 1 vm.
 
Spirit, I can see you have much better results with just ordinary drives. Have you make some tuning to the ceph pool? Or have you compared ceph to gluster in terms of performance?
 
Spirit, I can see you have much better results with just ordinary drives. Have you make some tuning to the ceph pool? Or have you compared ceph to gluster in terms of performance?

my results are only with ssd.


Here my ssd tuning (for giant ceph version)
Code:
        debug lockdep = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug buffer = 0/0
        debug timer = 0/0
        debug journaler = 0/0
        debug osd = 0/0
        debug optracker = 0/0
        debug objclass = 0/0
        debug filestore = 0/0
        debug journal = 0/0
        debug ms = 0/0
        debug monc = 0/0
        debug tp = 0/0
        debug auth = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug perfcounter = 0/0
        debug asok = 0/0
        debug throttle = 0/0


        osd_op_threads = 5
        filestore_op_threads = 4




        osd_op_num_threads_per_shard = 1
        osd_op_num_shards = 25
        filestore_fd_cache_size = 64
        filestore_fd_cache_shards = 32


        ms_nocrc = true
        ms_dispatch_throttle_bytes = 0


        cephx sign messages = false
        cephx require signatures = false


[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring
         osd_client_message_size_cap = 0
         osd_client_message_cap = 0
         osd_enable_op_tracker = false
 
Sorry, i ment ordinary ssd drives;)
I made some config changes you have post and results are slightly better from fio:
read: 3675 write: 1228

But almost the same for bonnie benchmark. I just do not know which test is more close to real life providing storage for vm.
 
Sorry, i ment ordinary ssd drives;)
I made some config changes you have post and results are slightly better from fio:
read: 3675 write: 1228

But almost the same for bonnie benchmark. I just do not know which test is more close to real life providing storage for vm.

That's pretty low results for ssd drives.

what is your ioscheduler for your disk inside your vm ? I'm using noop scheduler.
 
Thats with SSD for main disk spirit? Will those tuning settings be appropriate for a spinner as well?


Nodes are also proxmox servers - would those settings interfere with VM hosting?
 
I added the following to my ceph.conf and restarted.

Code:
 debug lockdep = 0/0
         debug context = 0/0
         debug crush = 0/0
         debug buffer = 0/0
         debug timer = 0/0
         debug journaler = 0/0
         debug osd = 0/0
         debug optracker = 0/0
         debug objclass = 0/0
         debug filestore = 0/0
         debug journal = 0/0
         debug ms = 0/0
         debug monc = 0/0
         debug tp = 0/0
         debug auth = 0/0
         debug finisher = 0/0
         debug heartbeatmap = 0/0
         debug perfcounter = 0/0
         debug asok = 0/0
         debug throttle = 0/0

         ms_nocrc = true
         ms_dispatch_throttle_bytes = 0

         cephx sign messages = false
         cephx require signatures = false

It seems to have helped quite a bit:

reads are better than the disk max (ZFS Cache one presumes) and writes are close to the disk max.

Code:
Total time run:         62.523359
Total writes made:      815
Write size:             4194304
Bandwidth (MB/sec):     52.141

Stddev Bandwidth:       31.4692
Max bandwidth (MB/sec): 96
Min bandwidth (MB/sec): 0
Average Latency:        2.44287
Stddev Latency:         1.97723
Max latency:            13.1796
Min latency:            0.506066


Total reads made:     815
Read size:            4194304
Bandwidth (MB/sec):    155.912

Average Latency:       0.813669
Max latency:           2.26243
Min latency:           0.10141

Benchmarks inside VM are similarly improved, though seq writes are maxing out at 23 MB/s
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!