Tuning performance in VM with sceduler

mir · Nov 27, 2014

Hi all,

Inspired by a thread about ceph and which scheduler to use in a VM I made a quick test using fio. Results are stunning so I hope others can duplicate my results?
My storage is based on ZFS.

Test file used:

Code:

# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1    Linear
# iodepth=4    Very Light
# iodepth=8    Light
# iodepth=64    Moderate
# iodepth=256    Heavy
iodepth=64

Results

Code:

                   NFS                      iSCSI
CFQ       r:4537 w:1130      r:  6927 w: 1733
NOOP    r:7484 w:1874      r: 11454 w: 2874

It seems scheduler has a big impact on performance when dealing with file systems with native sophisticated caching.

liska_ · Nov 28, 2014

I have now testing cluster running on latest proxmox nodes, which have intel ssd drives, connected with 10gb switch. One pool is gluster on zfs based on two striped drives and second one is ceph pool based on two osd drives on two servers. VM is latest debian with writeback cache and raw format on gluster running on one of the storage nodes.

These are my results

a) Ceph
noop: r/w 4215 / 1058
deadline: r/w 4212 / 1055
cfq: r/w 4214 /1052

b) Gluster
noop: r/w 4928 / 1235
deadline r/w 4206/1051
cfq: r/w 5059/1262

I also run a bonnie benchmark and it showed almost twice better results for gluster. Maybe I have some misconfiguraton somewhere ...

mir · Nov 28, 2014

What file system in VM?
What disk emulation (ide,sata,scsi,virtio)?

liska_ · Nov 28, 2014

VM is default debian installation with ext4 run on virtio drive.

mir · Nov 28, 2014

What cache option in proxmox(nocache, writeback etc)
Mount options inside VM for ext4?

liska_ · Nov 28, 2014

Cache is writeback as I wrote and mount options are following:
/dev/disk/by-uuid/e4027256-ecf7-4257-ac5a-30c228d2f74a on / type ext4 (rw,relatime,errors=remount-ro,user_xattr,barrier=1,data=ordered)

I made no changes to vm except scheduler.

mir · Nov 28, 2014

You should add nobarrier to your mount options. Since kernel 2.6.32-29 is has been save to add nobarrier to ext4.

liska_ · Nov 28, 2014

So after barrier=0 and reboot just to be sure are results on gluster and cfq 4648 and 1161, so it is even lower then with barriers in my case.

spirit · Nov 28, 2014

For ceph with ssd drives, you really should try giant.

Big locks have been removed, so osd can scale on more cores now.

(I you have only 2 osds, it's really make a big difference)

mir · Nov 28, 2014

And when using cache=nocache

liska_ · Nov 28, 2014

It is 4184 and 1049 iops on cfq and ceph, so a little bit slower han writeback

blackpaw · Nov 30, 2014

spirit said:
For ceph with ssd drives, you really should try giant.

How do you get giant on proxmox?

spirit · Dec 1, 2014

blackpaw said:
How do you get giant on proxmox?

edit
/etc/apt/sources.list.d/ceph.list

deb http://ceph.com/debian-giant wheezy main

#apt-get update
#apt-get dist-upgrade

on each node

then,

/etc/init.d/ceph restart mon
on each monitor nodes

then

/etc/init.d/ceph restart osd
on each osd nodes

blackpaw · Dec 1, 2014

Thanks spirit, much appreciated. Is it as simple as that to upgrade from firefly to giant?
- upgrade
- restart monitors
- restart osd's

No other procedures/gotchas?

You mentioned "For ceph with ssd drives, you really should try giant.". Is it a similar improvement for spinners + ssd journals?

I'm only running a small setuup at the moment, two osd spinners + two ssd journal partitions on top of ZFS with a ZFS cache. Read performance is good, but write is marginal.

blackpaw · Dec 1, 2014

nb. Sorry, one last question - is giant compatible with the proxmox ceph management tools and ui?

spirit · Dec 1, 2014

Thanks spirit, much appreciated. Is it as simple as that to upgrade from firefly to giant?

- upgrade
- restart monitors
- restart osd's

No other procedures/gotchas?

no it's really simple. Just check that ceph health is ok (through gui or #ceph -w), between each daemon restart

You mentioned "For ceph with ssd drives, you really should try giant.". Is it a similar improvement for spinners + ssd journals?
I'm only running a small setuup at the moment, two osd spinners + two ssd journal partitions on top of ZFS with a ZFS cache. Read performance is good, but write is marginal.

The major improvement with giant, is that osd daemon can use more cores to scale, because before they was some big lock. So only when you need a lot of ios. (so ssd).
For write, maybe it could help too, because for me, write bottleneck is cpu on osd node.

About zfs, i'm not sure it's well tested with ceph. do you use zfs only of osd spinner ? if yes, do you have disable zil on it ? (we have already journal in ceph)

spirit · Dec 1, 2014

blackpaw said:
nb. Sorry, one last question - is giant compatible with the proxmox ceph management tools and ui?

yes, no problem.

blackpaw · Dec 1, 2014

spirit said:
The major improvement with giant, is that osd daemon can use more cores to scale, because before they was some big lock. So only when you need a lot of ios. (so ssd).
For write, maybe it could help too, because for me, write bottleneck is cpu on osd node.

So at worst, no worse

and possibly better. Also I was wanting to expeirment with Caching Tiers in giant as they seemed more mature.

About zfs, i'm not sure it's well tested with ceph. do you use zfs only of osd spinner ?

Yes - only the spinner is managed by zfs, a directory mount with a zfs ARC and L2ARC. It helps a lot with the read performance - 200MB/s with it, 70 MB/s without.

if yes, do you have disable zil on it ? (we have already journal in ceph)

I don't think you can disable the RAM zil in zfs, but I have disabled the SSD zil (SLOG). I use the ssd partition for the ceph journal.

I had to set it all up manully using ceph-osd, but its not difficult and a useful learning exercise.

spirit · Dec 1, 2014

Also I was wanting to expeirment with Caching Tiers in giant as they seemed more mature.

yes, it's have been improve with giant. but I think we should wait for next Hammer release to be good.
Also note, that ceph tiering works with 4M objects size (so 1 small 4k read promote the full object in ssd tier), so zfs granularity is better for this.

blackpaw · Dec 1, 2014

spirit said:
yes, it's have been improve with giant. but I think we should wait for next Hammer release to be good.

Be interesting to test with

spirit said:
Also note, that ceph tiering works with 4M objects size (so 1 small 4k read promote the full object in ssd tier), so zfs granularity is better for this.

Yah, I have no idea how that would work with VM images.

thanks.

Tuning performance in VM with sceduler

Famous Member

Member

Famous Member

Member

Famous Member

Member

Famous Member

Member

Distinguished Member

Famous Member

Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member