Ceph RBD Cache on Virtual Disks

iptgeek

Active Member
Mar 3, 2016
21
1
43
Hi All

I have been testing KRBD vs standard RBD in KVM virtual machines. I fired up a LXC container last week and ran a quick throughput test and it was much faster than all my VMs so I did some digging and found some really weird results.

All thoughts welcome. I've used the iometer profile in fio in the KVM vms.

The VM had a 32GB HDD 1 cpu core and 512MB of RAM

Testing is on a 4 node all SSD cluster

I'll summarise the results - full FIO are in the pastebin link at the bottom

First tests are librbd
I threw one test with iothread on with standard LIBRBD - didn't make any noticeable difference

Results from LIBRBD cache=default(no cache) discard on

Jobs: 1 (f=1): [m(1)] [100.0% done] [125.5MB/31860KB/0KB /s] [26.4K/6495/0 iops] [eta 00m:00s]
read : io=5042.1MB, bw=172107KB/s, iops=25328, runt= 30004msec

The next one is weird - it goes against all the advice and recommendations. Its slower than no caching

Cache=writeback discard on

Jobs: 1 (f=1): [m(1)] [100.0% done] [103.7MB/26661KB/0KB /s] [19.2K/4899/0 iops] [eta 00m:00s
read : io=4120.6MB, bw=140209KB/s, iops=19193, runt= 30094msec

As expected this is slower.

cache=writethrough discard on
Jobs: 1 (f=1): [m(1)] [100.0% done] [79111KB/21062KB/0KB /s] [11.8K/2985/0 iops] [eta 00m:00s]
read : io=2581.5MB, bw=88103KB/s, iops=10516, runt= 30003msec

The unsafe writeback option should look blisteringly quick - it doesn't
cache=writeback(unsafe)
Jobs: 1 (f=1): [m(1)] [100.0% done] [105.3MB/27021KB/0KB /s] [19.5K/4979/0 iops] [eta 00m:00s]
read : io=4171.9MB, bw=142364KB/s, iops=19570, runt= 30002msec

Considering this is an all SSD cluster the write IOPS are pretty crap using librbd

so the kernel (KRBD) mode tests are next.

Pretty good improvement with no cache KRBD already better than librbd with fullcaching.

DEFAULT(NO CACHE) discard on
Jobs: 1 (f=1): [m(1)] [100.0% done] [118.2MB/30085KB/0KB /s] [25.9K/6352/0 iops] [eta 00m:00s]
read : io=5529.1MB, bw=188636KB/s, iops=28767, runt= 30019msec

Things start to get weird again. I would expect the next one to be a lot slower but its not - its actually better

cache=directsync discard on
Jobs: 1 (f=1): [m(1)] [100.0% done] [135.6MB/33683KB/0KB /s] [29.6K/7211/0 iops] [eta 00m:00s]
read : io=5519.4MB, bw=188352KB/s, iops=28700, runt= 30005msec

The next one is the type of jump in IOPS and throughput I would expect to see from caching - why is this not the case in the librbd tests?

cache=writeback discard on
Jobs: 1 (f=1): [m(1)] [100.0% done] [365.9MB/95586KB/0KB /s] [85.3K/21.4K/0 iops] [eta 00m:00s]
read : io=6553.9MB, bw=325181KB/s, iops=53259, runt= 20638msec

Tasty IOPS and this is the caching mode recommended everywhere for ceph on RBD

writeback(unsafe) discard on
Jobs: 1 (f=1): [m(1)] [100.0% done] [457.2MB/115.4MB/0KB /s] [106K/26.5K/0 iops] [eta 00m:00s]
read : io=6553.9MB, bw=636786KB/s, iops=104295, runt= 10539msec

Marginally faster as you would expect - certainly not worth losing your data for.

cache=none discard
Jobs: 1 (f=1): [m(1)] [100.0% done] [134.6MB/34497KB/0KB /s] [29.4K/7266/0 iops] [eta 00m:00s]
read : io=5570.3MB, bw=190098KB/s, iops=29073, runt= 30005msec

This is better than all the librbd modes.

From these tests it appears that caching doesn't seem to be doing anything when using librbd and that the kernel rbd driver is absolutely screaming compared to using librbd. I'd love to check metrics on iowait and cpu usage differences when running these tests. will do at some other time.


Any experts got any thoughts on this?

Thanks!


Full FIO results --> http://pastebin.com/PyBWu6GV
 
Always wondered what IDE would look like :)

Jobs: 1 (f=1): [m(1)] [100.0% done] [8711KB/1754KB/0KB /s] [811/203/0 iops] [eta 00m:00s]
read : io=256046KB, bw=8517.3KB/s, iops=801, runt= 30062msec
 
writeback with rbd, only enable 1 thing: rbd_cache=true. which is specific to ceph. (with rbd, only cache=none|writeback are working, unsafe is doing nothing here).

This is helping only for sequential writes. It'll merge coalesced blocks in bigger one, and write a big object once in ceph.

And of course, this work only if you don't use sync option in your fio job.

iothread will help mainly with multiple disks.
 
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)

[iometer]

bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=8g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=64
runtime=30
 
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)

[iometer]

bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=1
size=8g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=64
runtime=30

It'll not work with random write. Try sequential write, you should see difference.