Ceph - Bad performance in qemu-guests

raoro

New Member
Jan 22, 2014
12
1
1
Hello everyone,

first of all I want to say thank you to each and everyone in this community!
I've been a long time reader ( and user of pve ) and could get so much valuable information from this forum!

Right now the deployment of the Ceph Cluster gives me some trouble.
We were using DRBD but since we are expanding and the are more nodes in the pve-cluster we decided to switch to Ceph.

The 3 Ceph-Server-Nodes are connected via a 6*GbE-LACP-Bond with Jumbo-Frames over two stacked switches and the Ceph traffic is on a seperate VLAN.
Currently there are 9 OSDs (3*15K SAS with BBWC per host).
The journal is 10GB per OSD and on LVM-Volumes of a SSD-RAID1.
pg_num and pgp_num are set to 512 for the pool.
Replication is 3 and the CRUSH-Map is configured to distribute the requests over the 3 hosts.

The performance of the rados benchmarks is good:
rados -p test bench 60 write -t 8 --no-cleanup
Code:
Total time run:         60.187142
Total writes made:      1689
Write size:             4194304
Bandwidth (MB/sec):     112.250 

Stddev Bandwidth:       48.3496
Max bandwidth (MB/sec): 176
Min bandwidth (MB/sec): 0
Average Latency:        0.28505
Stddev Latency:         0.236462
Max latency:            1.91126
Min latency:            0.053685
rados -p test bench 60 seq -t 8
Code:
Total time run:        30.164931
Total reads made:      1689
Read size:             4194304
Bandwidth (MB/sec):    223.969 

Average Latency:       0.142613
Max latency:           2.78286
Min latency:           0.003772
rados -p test bench 60 rand -t 8
Code:
Total time run:        60.287489
Total reads made:      4524
Read size:             4194304
Bandwidth (MB/sec):    300.162 

Average Latency:       0.106474
Max latency:           0.768564
Min latency:           0.003791

What makes me wonder is the "Min bandwidth (MB/sec): 0" and "Max latency: 1.91126" at write - benchmark.

I've modified the Linux autotuning TCP buffer limits and the rx/tx ring parameters of the Network-Cards (all Intel), which increased the bandwidth, but didn't help with the latency of small IO.

For example in a wheezy-kvm-guest:
Code:
dd if=/dev/zero of=/tmp/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 9,99445 s, 51,2 kB/s

dd if=/dev/zero of=/tmp/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 10,0949 s, 406 kB/s

I also did put flashcache in front of the OSDs but this didn't help much and since there's 1GB of Cache from the RAID-Controller in front of the OSDs I wonder why this is so slow in the guests?
Compared to the raw performance of the SSDs and the OSDs this is realy bad...
Code:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 0,120224 s, 4,3 MB/s

dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 0,137924 s, 29,7 MB/s


dd if=/dev/zero of=/mnt/ssd-test/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 0,147097 s, 3,5 MB/s

dd if=/dev/zero of=/mnt/ssd-test/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 0,235434 s, 17,4 MB/s

Running fio from a node directly via rbd gives expected results, but also with some serious deviations:
Code:
rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.3-1-gaad9
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/13271KB/0KB /s] [0/3317/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=849098: Mon Mar 23 20:08:25 2015
  write: io=2048.0MB, bw=12955KB/s, iops=3238, runt=161874msec
    slat (usec): min=37, max=27268, avg=222.48, stdev=326.17
    clat (usec): min=13, max=544666, avg=7937.85, stdev=11891.77
     lat (msec): min=1, max=544, avg= 8.16, stdev=11.88

Thanks for reading so far :)
I know this is my first post, but I have really run out of options here and would really appreciate your help.

My question are:
Why is the performance in the guests so much worse?
What can we do to enhance this for Linux as well as Windows guests?

Thanks for reading this big post and I hope we can have a nice discussion with a good outcome for everyone, since this is, in my point of view a common issue for a few users.
 
Last edited:
Re: Ceph - Bad performance with small IO

Hi,
latency is an problem with ceph... but there are some things for tuning.

Which version of ceph do you use? Since firefly rbd_cache is enabled by default and this should be, because rbd_cache speed up small IOs if possible (put small IOs together to less bigger IOs).

Do you use an bigger read_ahead_cache (4096) inside the VM? Very important!!

Are your measurerments (much) better if you disable scrubbing ("ceph osd set noscrub" + "ceph osd set nodeep-scrub")? In this case, there are settings to minimize the scrubbing impact.

BTW. I had bad experiences with filebased journaling on lvm! If you have Intel SSDs (DC S3700) you should try this for journaling.

Udo

EDIT: For my config I switched from XFS to ext4 on the OSDs and the latency are app. 50%.
 
Re: Ceph - Bad performance with small IO

Hi Udo,

thanks for your fast reply!

I use ceph giant - version 0.87.1.
The Journals are symlinked to LVM-Volumes on Crucial M500 in RAID1.
Is this then still file-based?

The read ahead cache is already set to a higher value. If you mean:
Code:
blockdev --getra /dev/vda
32768

I tested the "noscrub" + "nodeep-scrub" - settings but performance is about the same.

The rbd_cache seems to be active:
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep rbd_cache

  "rbd_cache": "true",
  "rbd_cache_writethrough_until_flush": "true",
  "rbd_cache_size": "33554432",
  "rbd_cache_max_dirty": "25165824",
  "rbd_cache_target_dirty": "16777216",
  "rbd_cache_max_dirty_age": "1",
  "rbd_cache_max_dirty_object": "0",
  "rbd_cache_block_writes_upfront": "false",

The OSDs are formated with xfs because I read everywhere that this is recommended?
Would like to go with btrfs but not until it is considered stable... :-/
 
Re: Ceph - Bad performance with small IO

Hi Udo,

thanks for your fast reply!

I use ceph giant - version 0.87.1.
ok
The Journals are symlinked to LVM-Volumes on Crucial M500 in RAID1.
Is this then still file-based?
no, in this case it's on blockdevice like partition based journaling but with the lvm layer between ceph and blockdevice. Are you sure, that the crucial work well over a long time (you don't have trim).
can you add anothe LV on the SSD and test the performance?
Code:
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
The read ahead cache is already set to a higher value. If you mean:
Code:
blockdev --getra /dev/vda
32768
no, I mean
Code:
echo 4096 > /sys/block/vda/queue/read_ahead_kb
give this an try - you will see an huge different on reads.
I tested the "noscrub" + "nodeep-scrub" - settings but performance is about the same.

The rbd_cache seems to be active:
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep rbd_cache

  "rbd_cache": "true",
  "rbd_cache_writethrough_until_flush": "true",
  "rbd_cache_size": "33554432",
  "rbd_cache_max_dirty": "25165824",
  "rbd_cache_target_dirty": "16777216",
  "rbd_cache_max_dirty_age": "1",
  "rbd_cache_max_dirty_object": "0",
  "rbd_cache_block_writes_upfront": "false",
looks good.
The OSDs are formated with xfs because I read everywhere that this is recommended?
Would like to go with btrfs but not until it is considered stable... :-/
xfs is the standard, but have you checked your fragmentation (we had up to 20%).
If you use the right settings, you shouldn't have trouble with fragmentation
Code:
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
filestore_xfs_extsize = true
filestore_xfs_extsize is an new parameter (arrived after we switched to ext4 - should work fine, but i don't have experiences with that).

Udo
 
Re: Ceph - Bad performance with small IO

Hi

I use ceph giant - version 0.87.1.

Do you have updated the client too ? When i updated my firefly client to giant client the performance was better
 
Re: Ceph - Bad performance with small IO

The client is also giant and there was some improvement, but still the performance of small io is really bad.
For example an ATTO Benchmark in a Windows guest:
ceph-bench-windows-guest.png

It seems to hit a limitation?

I've been reading Sebastian Han's blog extensively - lots of good information there!
Got lots of my Ceph knowledge and inspiration from him! :)

Looking at the value of the M550 I can get your concerns but the M500 is different.
I did that exact test before choosing them as journal SSDs: Regard: numjosbs=4
Code:
journal-test: (groupid=0, jobs=4): err= 0: pid=654178: Tue Mar 24 12:20:43 2015
  write: io=4419.8MB, bw=75404KB/s, iops=18850, runt= 60012msec
    clat (usec): min=68, max=138432, avg=209.26, stdev=1273.33
     lat (usec): min=68, max=138432, avg=209.59, stdev=1273.33
    clat percentiles (usec):
     |  1.00th=[   98],  5.00th=[  108], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  124], 40.00th=[  133], 50.00th=[  141], 60.00th=[  147],
     | 70.00th=[  157], 80.00th=[  169], 90.00th=[  189], 95.00th=[  201],
     | 99.00th=[  262], 99.50th=[  588], 99.90th=[15168], 99.95th=[16320],
     | 99.99th=[59648]
    bw (KB  /s): min= 1009, max=32624, per=25.10%, avg=18922.50, stdev=9588.83
    lat (usec) : 100=1.37%, 250=97.53%, 500=0.57%, 750=0.07%, 1000=0.04%
    lat (msec) : 2=0.05%, 4=0.02%, 10=0.03%, 20=0.30%, 50=0.02%
    lat (msec) : 100=0.02%, 250=0.01%
  cpu          : usr=1.88%, sys=10.24%, ctx=2263458, majf=0, minf=109
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1131284/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4419.8MB, aggrb=75403KB/s, minb=75403KB/s, maxb=75403KB/s, mint=60012msec, maxt=60012msec

Increasing the read_ahead_cache did not really improve things and I would like to increase the performance for non-linux guests as well:
Code:
dd if=/dev/vda of=/dev/null bs=512 count=1000 iflag=direct
512000 Bytes (512 kB) kopiert, 1,02483 s, 500 kB/s

echo 4096 > /sys/block/vda/queue/read_ahead_kb

dd if=/dev/vda of=/dev/null bs=512 count=1000 iflag=direct
512000 Bytes (512 kB) kopiert, 0,950503 s, 539 kB/s

Obviously I didn't use the right settings for XFS -.-
Code:
xfs_db -c frag -r /dev/sdb1
actual 177046, ideal 127871, fragmentation factor 27.78%
This is just one OSD, the others are all about 25%
But does this affect performance so much?
There is plenty of space available and the writes should hit the SSD-Journals first and then the Cache of the SAS-Controller.

My current mount-options are:
Code:
osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64"
I'll try your purposed ones with filestore_xfs_extsize and report back.

Thanks so far for your input and suggestions udo and phildefer!
 
Re: Ceph - Bad performance with small IO

Just finished defragmenting the OSDs and remounting them with the new mount options.
Now the fragmentation factor is at maximum 0.34% over all OSDs.
Also injected the new setting filestore_xfs_extsize.

But still no improvement :-(

Just to make sure, below are the tunings I already made to Ceph.
These are high over the default settings.
Is there maybe something wrong?
Code:
        osd recovery max active = 1
        osd max backfills = 1
        osd_disk_threads = 4
        osd_op_threads = 4
        osd target transaction size = 50
        osd mkfs options xfs = "-f -i size=2048"
        osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,delaylog,inode64"

        filestore_xfs_extsize = true
        filestore max sync interval = 30
        filestore min sync interval = 29
        filestore xattr use omap = true
        filestore flusher = false
        filestore queue max ops = 10000
        filestore queue max bytes = 536870912
        filestore queue committing max ops = 2000
        filestore queue committing max bytes = 536870912

What makes me wonder is the limits the ATTO-Benchmark reaches.
Is there some way to tune the librbd-access pve uses?
 
Re: Ceph - Bad performance with small IO

Just finished defragmenting the OSDs and remounting them with the new mount options.
Now the fragmentation factor is at maximum 0.34% over all OSDs.
Also injected the new setting filestore_xfs_extsize.

But still no improvement :-(

Just to make sure, below are the tunings I already made to Ceph.
These are high over the default settings.
Is there maybe something wrong?
Code:
        osd recovery max active = 1
        osd max backfills = 1
        osd_disk_threads = 4
        osd_op_threads = 4
        osd target transaction size = 50
        osd mkfs options xfs = "-f -i size=2048"
        osd mount options xfs = "rw,noatime,nobarrier,logbsize=256k,logbufs=8,delaylog,inode64"

        filestore_xfs_extsize = true
        filestore max sync interval = 30
        filestore min sync interval = 29
        filestore xattr use omap = true
        filestore flusher = false
        filestore queue max ops = 10000
        filestore queue max bytes = 536870912
        filestore queue committing max ops = 2000
        filestore queue committing max bytes = 536870912

What makes me wonder is the limits the ATTO-Benchmark reaches.
Is there some way to tune the librbd-access pve uses?
Hi,
"osd_disk_threads = 4" mean 4 threads for housekeeping (scrubbing) - I would leaf this value on 1 (have do the same a time ago).
"filestore xattr use omap = true" is AFAIK only needed with ext4 (and cephfs?!).

Have you proof that "filestore min sync interval = 29" is an good idea? I have the default min/max 0.01/10.

Most of the settings I leaf at default...

Udo
 
Re: Ceph - Bad performance with small IO

I reverted the settings to default and there is no big difference.
The Min/Max sync intervals were an experiment. Read this on the ceph-users mailing list,
but obviously it didn't help much. It was for a much bigger cluster.

The strange thing really is that with fio benchmarks directly with ioengine=rbd on one of the nodes I get about 1600 IOPS, while in one of the qemu-guests it's just about 110.
Thats more than 10 times!
Are we missing here something?

Guest:
Code:
fio --filename=/tmp/test --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=32 --runtime=60 --time_based --group_reporting --name=iotest

iotest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=32
fio-2.2.3-1-gaad9
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/480KB/0KB /s] [0/120/0 iops] [eta 00m:00s]
iotest: (groupid=0, jobs=1): err= 0: pid=27389: Tue Mar 24 17:15:41 2015
  write: io=26876KB, bw=458668B/s, iops=111, runt= 60002msec
    clat (msec): min=4, max=2200, avg= 8.92, stdev=31.90
     lat (msec): min=4, max=2200, avg= 8.92, stdev=31.90
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    8], 80.00th=[    9], 90.00th=[   10], 95.00th=[   12],
     | 99.00th=[   31], 99.50th=[   56], 99.90th=[  249], 99.95th=[  523],
     | 99.99th=[ 2212]
    bw (KB  /s): min=    1, max=  718, per=100.00%, avg=470.16, stdev=162.94
    lat (msec) : 10=92.90%, 20=5.34%, 50=1.19%, 100=0.34%, 250=0.13%
    lat (msec) : 500=0.03%, 750=0.01%, 1000=0.03%, >=2000=0.01%
  cpu          : usr=0.12%, sys=0.63%, ctx=13454, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=6719/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=26876KB, aggrb=447KB/s, minb=447KB/s, maxb=447KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  vda: ios=0/20185, merge=0/6763, ticks=0/59140, in_queue=59124, util=100.00%

Host:
Code:
fio rbd.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.3-1-gaad9
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/5475KB/0KB /s] [0/1368/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=929589: Tue Mar 24 17:21:30 2015
  write: io=387752KB, bw=6437.3KB/s, iops=1609, runt= 60236msec
    slat (usec): min=39, max=12393, avg=196.80, stdev=259.87
    clat (usec): min=122, max=3107.3K, avg=18213.42, stdev=85664.74
     lat (msec): min=1, max=3107, avg=18.41, stdev=85.66
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    6], 60.00th=[    7],
     | 70.00th=[    8], 80.00th=[    9], 90.00th=[   15], 95.00th=[   42],
     | 99.00th=[  359], 99.50th=[  611], 99.90th=[ 1467], 99.95th=[ 1614],
     | 99.99th=[ 1795]
    bw (KB  /s): min=  242, max=17984, per=100.00%, avg=7601.46, stdev=5125.56
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.59%, 4=23.04%, 10=60.30%, 20=8.63%, 50=2.79%
    lat (msec) : 100=1.71%, 250=1.70%, 500=0.56%, 750=0.43%, 1000=0.04%
    lat (msec) : 2000=0.19%, >=2000=0.01%
  cpu          : usr=12.10%, sys=1.64%, ctx=250972, majf=0, minf=3941
  IO depths    : 1=0.1%, 2=0.4%, 4=1.6%, 8=9.3%, 16=79.3%, 32=9.3%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.9%, 8=1.6%, 16=2.5%, 32=2.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=96938/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=387752KB, aggrb=6437KB/s, minb=6437KB/s, maxb=6437KB/s, mint=60236msec, maxt=60236msec

Disk stats (read/write):
    dm-0: ios=30/1304, merge=0/0, ticks=10/74, in_queue=84, util=2.85%, aggrios=428/259124, aggrmerge=0/774, aggrticks=64/6555, aggrin_queue=6563, aggrutil=100.00%
  sdd: ios=428/259124, merge=0/774, ticks=64/6555, in_queue=6563, util=100.00%

Content of rbd.fio:
Code:
[global]
ioengine=rbd
clientname=admin
pool=test
rbdname=fio_test
invalidate=0    # mandatory
rw=randwrite
bs=4k
sync=1
runtime=60

[rbd_iodepth32]
iodepth=32
 
Re: Ceph - Bad performance with small IO

here my ssd config tuning

Code:
>> [global]
>>
>>          filestore_xattr_use_omap = true
>>
>>          debug_lockdep = 0/0
>>          debug_context = 0/0
>>          debug_crush = 0/0
>>          debug_buffer = 0/0
>>          debug_timer = 0/0
>>          debug_filer = 0/0
>>          debug_objecter = 0/0
>>          debug_rados = 0/0
>>          debug_rbd = 0/0
>>          debug_journaler = 0/0
>>          debug_objectcatcher = 0/0
>>          debug_client = 0/0
>>          debug_osd = 0/0
>>          debug_optracker = 0/0
>>          debug_objclass = 0/0
>>          debug_filestore = 0/0
>>          debug_journal = 0/0
>>          debug_ms = 0/0
>>          debug_monc = 0/0
>>          debug_tp = 0/0
>>          debug_auth = 0/0
>>          debug_finisher = 0/0
>>          debug_heartbeatmap = 0/0
>>          debug_perfcounter = 0/0
>>          debug_asok = 0/0
>>          debug_throttle = 0/0
>>          debug_mon = 0/0
>>          debug_paxos = 0/0
>>          debug_rgw = 0/0
>>          osd_op_threads = 5
>>          osd_op_num_threads_per_shard = 1
>>          osd_op_num_shards = 25
>>          #osd_op_num_sharded_pool_threads = 25
>>          filestore_op_threads = 4
>>
>>          ms_nocrc = true
>>          filestore_fd_cache_size = 64
>>          filestore_fd_cache_shards = 32
>>          cephx sign messages = false
>>          cephx require signatures = false
>>
>>          ms_dispatch_throttle_bytes = 0
>>          throttler_perf_counter = false
>>
>>
>> [osd]
>>          osd_client_message_size_cap = 0
>>          osd_client_message_cap = 0
>>          osd_enable_op_tracker = false
>>

disabling debug, cephx, and sharding is really helping.

Also, please test your ssd for journal with o_dsync
http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/
consumer ssd drives are pretty shitty for this.
 
Re: Ceph - Bad performance with small IO

Here's the config of the exampled linux guest:
Code:
balloon: 1024
boot: dcn
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: dios
net0: virtio=82:65:63:AF:2E:CF,bridge=vmbr0
onboot: 1
ostype: l26
sockets: 1
tablet: 0
vga: qxl
virtio0: ceph_images:vm-100-disk-1,size=15G

And for the exampled windows guest:
Code:
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 2048
name: avmc
net0: virtio=0F:0E:6E:EF:69:AD,bridge=vmbr11
ostype: win7
sockets: 1
tablet: 0
unused0: drbd-venus-kvm:vm-111-disk-1
virtio0: ceph_images:vm-111-disk-1,size=32G

Thanks for posting your config spirit. I'll have a look into that.

Is "ms_nocrc = true" safe?

The Crucial M500 actually performs quite well:
Code:
dd if=randfile of=/dev/vg_ssd/test bs=4k count=100000 oflag=direct,dsync
409600000 Bytes (410 MB) kopiert, 10,2335 s, 40,0 MB/s
 
Re: Ceph - Bad performance with small IO

Really some good insights in this thread: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-August/042498.html

But disabling cephx left me with an unusable cluster.
Had to revert the settings to get back to a working state.
Do I need to shutdown the cluster completely to get the settings working?

I changed the thread title to "Ceph - Bad performance in guests",
because obviously the ceph performance isn't so bad when tested with fio,
but in the qemu-guests even with virtio it by a factor of 10?!
 
Re: Ceph - Bad performance with small IO

Code:
virtio0: ceph_images:vm-100-disk-1,size=15G

=> use "cache=writeback" for your virtual disks
 
Re: Ceph - Bad performance with small IO

Yeah well, that sure increases performance.
The hosts are connected to UPSs, but how secure is it?
Does this use the rbd-cache or RAM of the host?
Probably the same but just to clarify.
 
Re: Ceph - Bad performance with small IO

Thanks phildefer for the link and clarifying the rbd-cache question.

Changing the cache-setting to writeback and tuning debug and sharding really helped a lot:

ceph-bench-windows-guest-writeback.png

Thanks to everyone! :)

@spirit or maybe someone else can answer:
To further enhance performance I would also like to disable cephx as the cluster runs in a safe network, but the last time I tried to disable it, it left me with an unusable cluster.

Do I need to shutdown the ceph-cluster completely in order for this to work?
 
Re: Ceph - Bad performance with small IO

To further enhance performance I would also like to disable cephx as the cluster runs in a safe network, but the last time I tried to disable it, it left me with an unusable cluster.

Do I need to shutdown the ceph-cluster completely in order for this to work?

yes, and restart qemu guest too.
 
Re: Ceph - Bad performance with small IO

I was having a similar issue until recently running a PoC on Hammer. I suspect it's the SSDs not being able to keep up with direct I/O requirements Ceph has. I haven't been able to figure out how to disable direct I/O in ceph, but taking SSDs out of the equation (i.e. no SSD journals) improved quite a bit. Still not the performance level I'm after, but at least its in the right direction.

With Crucial M500 or Intel 530 OSD journals I was averaging about 3-5 MB/s.

With Journals on OSD themselves (i.e. sda1 (10-20G) = journal; sda2 (about 4TB) = data). I'm getting 150-200MB/s. 250MB/s during ceph benchmark. Using 12x Seagate 4TB SATA with 2x replica pool.

My hardware = Supermicro 24-drive chassis with crappy AMD 2346 CPUs with 16GB RAM. Upgrading these to X5570 CPUs and much more RAM. I'm looking into Samsung 850 PRO or Intel S3700 SSDs for journals as well. Would be great to get some perspective what folks out there are using..

Thanks.


Hi Udo,

thanks for your fast reply!

I use ceph giant - version 0.87.1.
The Journals are symlinked to LVM-Volumes on Crucial M500 in RAID1.
Is this then still file-based?

The read ahead cache is already set to a higher value. If you mean:
Code:
blockdev --getra /dev/vda
32768

I tested the "noscrub" + "nodeep-scrub" - settings but performance is about the same.

The rbd_cache seems to be active:
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep rbd_cache

  "rbd_cache": "true",
  "rbd_cache_writethrough_until_flush": "true",
  "rbd_cache_size": "33554432",
  "rbd_cache_max_dirty": "25165824",
  "rbd_cache_target_dirty": "16777216",
  "rbd_cache_max_dirty_age": "1",
  "rbd_cache_max_dirty_object": "0",
  "rbd_cache_block_writes_upfront": "false",

The OSDs are formated with xfs because I read everywhere that this is recommended?
Would like to go with btrfs but not until it is considered stable... :-/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!