Is a SSD cache worthwhile for VM disk performance?

blackpaw

Renowned Member
Nov 1, 2013
295
20
83
Using bcache to create a SSD cache for one desktop PC is certainly worthwhile.

But I'm wondering if the same applies for VM hosting, which I imagine has rather different IO patterns.


I'm testing a two node/brick glusterfs replicate setup, with the proxmox ndoes also being gluster nodes. Perfomance is ok, but to my surprise the hard disk read/write performance is holding things up, not the network.

I have a 60GB SSD partition spare on both nodes and we'd be looking at 7 VM's on each node, one node dedicated to server vm's (AD, SQl Server, Terminal Server"), the other node windows developer VM's. So a lot of random read/writes on both nodes withing large files, but once started, not a lot of large sequential read/writes. No real directory

Now I write it out, it seems a good candidate for caching. I did play with dm-cache which had good results until I managed to destroy the filesystem. dm-Cache is a fiddly pain in the ass to manage - no simple flush command! WriteBack was the best, but is dangerous, writethrough gave excellent read results but actually reduced write performance.

bcache is definitely much simpler and robust in that regard, but I'd have to pull in the bcache tools from an external repo or build them, which I don't like to do - keep our servers pristine!
 
I'm not sure you could build bcache for a 2.6 based kernel. Your easiest options are flashcache (which requires a remount of your filesystems through the cache) or EnhanceIO which can be enabled with filesystems online:
https://github.com/stec-inc/EnhanceIO
https://wiki.archlinux.org/index.php/EnhanceIO

This is a benchmark between some caching solutions:
http://lkml.iu.edu//hypermail/linux/kernel/1306.1/01246.html

Another option is to mount ext4 with an external journal on a small SSD, which would provide great write performance, although I'm not sure about reads:
http://www.raid6.com.au/posts/fs_ext4_external_journal/

 
I'm not sure you could build bcache for a 2.6 based kernel. Your easiest options are flashcache (which requires a remount of your filesystems through the cache) or EnhanceIO which can be enabled with filesystems online:
https://github.com/stec-inc/EnhanceIO
https://wiki.archlinux.org/index.php/EnhanceIO

I should have mentioned I'm using Kernel 3.10.

This is a benchmark between some caching solutions:
http://lkml.iu.edu//hypermail/linux/kernel/1306.1/01246.html

Thanks, its interesting reading. Normally I'd be looking at a writethrough or passthrough cache, hence EnhanceIO would be the better choice - but perhaps writeback is ok when its being stored on a SSD for replay even after crashes?

EnhanceIO is a fork of flashcache. Not sure if its still being actively developed?

One of the things I like about bcache is that it ignores large sequential reads and writes, such as backups, disk clones etc.

Another option is to mount ext4 with an external journal on a small SSD, which would provide great write performance, although I'm not sure about reads:
http://www.raid6.com.au/posts/fs_ext4_external_journal/

I have done that - does speed up small (<1GB) writes, but overall performance is not hugely improved.

Thanks.
 
Thanks, its interesting reading. Normally I'd be looking at a writethrough or passthrough cache, hence EnhanceIO would be the better choice - but perhaps writeback is ok when its being stored on a SSD for replay even after crashes?

EnhanceIO is a fork of flashcache. Not sure if its still being actively developed?

Haven't tried EnhanceIO yet, but some people had experienced data loss with it's writeback mode after unscheduled reboots/crashes. It's writethrough mode seems very safe though, and installation should be easy. (Maybe it could be combined with an SSD based journal for write caching? - just an idea)

It's not developed under this name anymore, STEC Inc. has been sold to HGST and they turned it into HGST ServerCache if I understand correctly.
http://www.hgst.com/software/HGST-server-cache

If you try it with Proxmox, please share your experience (installation, general performance, benchmarks if possible).
 
If you try it with Proxmox, please share your experience (installation, general performance, benchmarks if possible).


I gave it a spin today.

Installation - trivally easy. Install build-essential, dkms, git and the kernel headers. Instrtuctiosn from source:

https://github.com/stec-inc/EnhanceIO/blob/master/Install.txt

Setup by far the easier and less error prone compared to dm-cache. I love that it sets up udev rules for auto mounting of the cache on reboot - with dmcache and flashcache you have to write your own init scripts. Being able to cache existing partions on the fly with no prep is *exteremely* usefull. Overall it feels much safe than fiddling with dm-cache, also status and stats are easy to extract.

Performance - tricky to measure and I'm using it to speed up gluster disks which complicates things. If you hav suggests for bench marks I'd appreciate it.

I ran tests several times so as to populate the cache. Inital reads were limited by sata disk i/o (150 MB/s), but subsequent reads would be up to 400 MB/s.
dd if

I was trying crystaldiskmark inside a VM. Read preformance increased by about 300%, raw write perfomance actually dropped, even with writeback enabled. By random read/write was greatly improved.

OTOH actual application usage varied. A long build process still took around 10 min, Eclipse (Java IDE) startup time stayed the same.


Interesting article here:

http://www.sebastien-han.fr/blog/2014/10/06/ceph-and-enhanceio/

Suggests using it in writethrough mode in combination with an external ssd journal for improved write performance.

Also I never thought of using the /dev/disk/by-id links, much safer than /sda etc.
 
Hmmph - my previous post was actually written yesterday but I neglected to press the post button :(

I'll try it with ext4 and an external journal today.

The gluster people heavily push xfs as the underlying filesystem, but every test I have done in my usecase has it slower. And while you can use an external journal with xfs the tools and options are so limited as to make it a no go for me.
 
we use openzfs and are VERY happy with it.

Caching and Log (write) improve a lot global performance. You should try.
We use generally 2 or 4 disk Raid 1 and use 2 SSD disk but will work with just 1 SSD

After some time running windows VM have very few read disk on SATA and my SSD cache is globally very very active (120 or 240 GB disk)
I have some setup with 15K SAS + 80GB RAM cache and performance is very good too (using ZFS ARC for caching)
 
we use openzfs and are VERY happy with it.

Caching and Log (write) improve a lot global performance. You should try.
We use generally 2 or 4 disk Raid 1 and use 2 SSD disk but will work with just 1 SSD

Tempting, but the memory requirements seem high - 3Gb for my 3TB disks, this on a server that only has 32GB
 
Tempting, but the memory requirements seem high - 3Gb for my 3TB disks, this on a server that only has 32GB

just go for SSD and limit ARC to 512MB or such (options zfs zfs_arc_max).
We have some server with 48GB of ram and put only 1 or GB.
SSD is the good way, have it on memory is really not need except if you have a lot of cache data read.
 
Good point, thanks.

You talked me into it, I'll give it a spin. Do you use the debian repo from http://zfsonlinux.org/debian.html?

Have you found the write cache (ZIL) worthwhile?

yes using this one. I do normal setup and resize parition then use most of the disk for ZFS partition.
There is a blog post about setup zfs and proxmox, from james or something like that's explain some setup.

I use ZIL, see very few activity on it, but much depend on activity you have.

You can split your SSD for ARC/ZIL if you can put too many disk. Simply create 2 partitions.
For zil would be better to mirror it
 
Good point, thanks.

You talked me into it, I'll give it a spin. Do you use the debian repo from http://zfsonlinux.org/debian.html?

Have you found the write cache (ZIL) worthwhile?
You could also make a dedicate storage array to add shared storage for your cluster. If you do this I would recommend a solaris derivative or a freebsd based OS. Latest freenas or Freebsd since 9.2 supports the new in-kernel iscsi provider given you activate experimental mode. The new in-kernel iscsi provider is enterprise grade since it can add or remove luns without needed to restart the daemon. Support for this in proxmox is currently been tested so it should reach the git repository any time soon.
 
Beyond our budget and time at the moment.

I'm testing glusterfs on top of zfs for shared storage.
 
After changing my cluster network and storage network to use infiniband I get these I/O numbers running fio with a iometer test setup (omnios zfs raid10):
Code:
iometer: (g=0): rw=randrw, bs=512-64K/512-64K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [m] [100.0% done] [15506K/3958K /s] [13.4K/3386  iops] [eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=22496
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=3275.7MB, bw=41000KB/s, iops=13377 , runt= 81812msec
    slat (usec): min=0 , max=46283 , avg=12.02, stdev=48.17
    clat (usec): min=208 , max=588504 , avg=3801.72, stdev=5611.74
     lat (usec): min=243 , max=588513 , avg=3814.35, stdev=5612.02
    clat percentiles (usec):
     |  1.00th=[ 1864],  5.00th=[ 2224], 10.00th=[ 2288], 20.00th=[ 2448],
     | 30.00th=[ 2544], 40.00th=[ 2608], 50.00th=[ 2672], 60.00th=[ 2928],
     | 70.00th=[ 3216], 80.00th=[ 3568], 90.00th=[ 4640], 95.00th=[ 7648],
     | 99.00th=[24960], 99.50th=[37120], 99.90th=[67072], 99.95th=[101888],
     | 99.99th=[183296]
    bw (KB/s)  : min= 9289, max=181419, per=100.00%, avg=41084.77, stdev=28574.13
  write: io=840052KB, bw=10268KB/s, iops=3351 , runt= 81812msec
    slat (usec): min=5 , max=16522 , avg=14.50, stdev=50.20
    clat (usec): min=240 , max=189122 , avg=3843.48, stdev=4629.25
     lat (usec): min=252 , max=189133 , avg=3858.60, stdev=4629.82
    clat percentiles (usec):
     |  1.00th=[ 1928],  5.00th=[ 2256], 10.00th=[ 2320], 20.00th=[ 2512],
     | 30.00th=[ 2576], 40.00th=[ 2640], 50.00th=[ 2736], 60.00th=[ 3024],
     | 70.00th=[ 3280], 80.00th=[ 3664], 90.00th=[ 5152], 95.00th=[ 8096],
     | 99.00th=[24448], 99.50th=[35072], 99.90th=[61184], 99.95th=[66048],
     | 99.99th=[102912]
    bw (KB/s)  : min= 2362, max=45999, per=100.00%, avg=10290.16, stdev=7174.27
    lat (usec) : 250=0.01%, 500=0.02%, 750=0.03%, 1000=0.08%
    lat (msec) : 2=1.22%, 4=85.73%, 10=9.54%, 20=1.86%, 50=1.29%
    lat (msec) : 100=0.20%, 250=0.04%, 750=0.01%
  cpu          : usr=7.71%, sys=32.16%, ctx=923423, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1094411/w=274205/d=0, short=r=0/w=0/d=0


Run status group 0 (all jobs):
   READ: io=3275.7MB, aggrb=40999KB/s, minb=40999KB/s, maxb=40999KB/s, mint=81812msec, maxt=81812msec
  WRITE: io=840052KB, aggrb=10268KB/s, minb=10268KB/s, maxb=10268KB/s, mint=81812msec, maxt=81812msec


Disk stats (read/write):
  vda: ios=1092519/273847, merge=0/84, ticks=4123236/1135340, in_queue=5260612, util=99.86%
 
Well things looked promising but ended up somewhat disappointing.

Setup to zfs pools (1 disk, 1 cache, 1 log) and mount on each node, created a glusterfs store on top of them, added as proxmox storage

quick dd test seemed ok.

Copied a vm image over to the new storage - fails to start with kernel crashes. Tried it with multiple vm's, all the same problem.

Removed the log and cache devices, recopied the image - same problem.
 
Found the problem - zfs does not support O_DIRECT. I had to set the kvm cachemode to Write Through
 
In a test VM, Read performance is good, 200-300MB/s after cache is loaded.


Even with writeback cache write performance is abysmal, 25 MB/s, when I was getting 35MB/s for uncached XFS, 120MB/s for cached EXT4
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!