ZFS: fio random read performance not scaling with iodepth

Detuner

New Member
Dec 3, 2020
3
0
1
41
Hi!
I've set about figuring out exactly how big is IO performance drop in KVM compared to host ZFS performance. I have a Supermicro platform, 2 x Xeon Gold 6226R, 128 Gb DDR-4 RAM. The storage is 2 x Intel D3-S4610 (ssdsc2kg480g8) in ZFS mirror, pool ashift set to 13. Fresh install of PVE 6.3-2, no other CTs/VMs running on this node except single test VM (ZVol volblocksize=8k, virtio scsi single, iothread=1, no cache). I created a separate dataset for ZFS benching with recordsize set to 8k, which I think would be more fair to compare to 8k blocksize ZVol.
The goal was to compare raw SSD read performance to native ZFS and VM ZVol. I've tried to eleminate ARC as much as possible, so I set arc_max_size to 4G, benchmark datafile size to 16G and executed sync ; echo 3 > /proc/sys/vm/drop_caches before every bench run. I used fio as a benchmark for all the tests. The results for most of workloads were as expected, but random reads show strange behaviour: it does not scale up with iodepth at all on host ZFS, but it does scale perfectly well on VM.
Typical fio job file:

[global]
bs=8k
iodepth=1
direct=1
ioengine=libaio
numjobs=1
name=RandRead1
rw=randread
runtime=90
[job1]
filename=./generic.test

Results, raw SSD performance (MB/s@IOPS):
iodepth=1: 50.4@6.4k
iodepth=2: 95.4@12.2k
iodepth=4: 170@21.8k
iodepth=8: 280@35.7k
iodepth=16: 403@51.6k
iodepth=32: 455@58.2k

Host ZFS performance (MB/s@IOPS):
iodepth=1: 34.3@4.3k
iodepth=2: 38.9@4.9k
iodepth=4: 38.8@4.9k
iodepth=8: 37.9@4.8k
iodepth=16: 36.9@4.7k
iodepth=32: 38.8@4.9k

KVM ZVol performance (MB/s@IOPS):
iodepth=1: 28.1@3.6k
iodepth=2: 75@9.6k
iodepth=4: 127@16.3k
iodepth=8: 233@31.1k
iodepth=16: 412@52.8k
iodepth=32: 495@63.4k

Is there any explanation to this? Does it really mean async reads are processed as sync in ZFS?
 
Have you seen our ZFS Benchmark paper? https://forum.proxmox.com/threads/proxmox-ve-zfs-benchmark-with-nvme.80744/

Why are you benching randread performance? Reading usually is way faster than writing. Doing write tests is usually what is getting benchmarked.

How did you test "Host ZFS" performance? The result is a bit odd and hints that there must be something that was missed.

I noticed a few things with your benchmarks:
  • I don't see the parameter sync=1
  • Filename: How do you benchmark the SSD directly? The path should be some /dev/sdX or /dev/disk/by-id/Your_SSD to write to the SSD directly without any file system in betwen. For the ZVOL performance you would want to create a zvol of the right size ( zfs create -V 16G <pool>/<zvol> ) and then tell FIO to write directly to the zvol which can be found at /dev/zvol/<pool>/<zvol>.
  • Use a longer runtime of about 10 min (600sec) to reduce the impact of caches present.
  • To reduce the impact of the ARC you can set "primarycache=metadata" as option for the zvol with zfs set primarycache=metadata <pool>/<zvol>
  • After these you can run a FIO benchmark inside a VM.
 
Thanks a lot for your reply!
I'm benching both read and write, just stuck a bit on this weird read results, where I totally did not expect any problems.
  • Direct SSD read was benched with "filename=/dev/sda" in fio job description.
  • KVM ZVol results are what I see inside VM, 8k randread from 16G testfile on VM filesystem (ext4). Ext4 cache eleminated by running "sync ; echo 3 > /proc/sys/vm/drop_caches" inside VM before every bench run (and also dropped host caches before every VM bench). A bit dirty, I agree, as it also measures Ext4 performance, so here is additional info: benching randread of ZVol from host OS ("filename=/dev/zvol/rpool/data/vm-2002-disk-0") gives about 39Mb/s@5kIOPS scaling up to 620Mb/s@80kIOPS with increase of iodepth to 32. Benching VM's virtual disk ("filename=/dev/sda" inside VM) gives 34.2Mb/s@4.3kIOPS also scaling up to 556Mb/s@71kIOPS. Nothing unexpected really, some performance drop caused by virtualization, and some more by VM filesystem, but still scales up nicely with iodepth.
  • Tried longer times up to 10min, doesn't significally affect results. I'm quite sure those weird results are not related to ARC issues, since I also monitor SSD usage with iostat and ARC with arcstat while benching. I did separate bench runs with testfile fully in ARC (zfs_arc_max set to 32G + a few warmup runs until I see constant 0% miss in arcstat and no reads in iostat) and got more interesting results. ZFS randread from ARC gives stable 690Mb/s@88.5kIOPS at any iodepth from 1 to 32. ZVol randread gives 350Mb/s@44kIOPS scaling up to 1370Mb/s@175kIOPS. So there is a similar performance issue even with 100% cached reads.
  • Performance starts to scale normally when I try to increase numjobs instead of iodepth.
 
iostat output during ZVol benching (bs=8k, numjobs=1, iodepth=32):
Code:
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           26015.00    0.00    203.24      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.00     8.00     0.00   0.04 100.00
sdb           26094.00    0.00    203.87      0.00     0.00     0.00   0.00   0.00    0.22    0.00   0.00     8.00     0.00   0.04 100.00
zd0           59957.00    0.00    468.41      0.00     0.00     0.00   0.00   0.00    0.52    0.00  31.07     8.00     0.00   0.02 100.00
iostat output during FS benching (bs=8k, numjobs=1, iodepth=32):
Code:
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           2510.00    0.00     19.61      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.00     8.00     0.00   0.40 100.00
sdb           2471.00    0.00     19.30      0.00     0.00     0.00   0.00   0.00    0.16    0.00   0.00     8.00     0.00   0.40 100.00
zd0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
iostat output during FS benching (bs=8k, numjobs=32, iodepth=1):
Code:
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           37810.00    0.00    295.39      0.00     0.00     0.00   0.00   0.00    0.25    0.00   0.00     8.00     0.00   0.03 100.00
sdb           37861.00    0.00    295.84      0.00     0.00     0.00   0.00   0.00    0.25    0.00   0.00     8.00     0.00   0.03 100.00
zd0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
By the way, in ZFS benchmark paper you mentioned above every fio job definition always has iodepth=1 and is scaled by numjobs only. I think I'm missing something important to know either about ZFS or FIO.
 
Last edited: