Benchmarking ZFS storage performance

Koleon

New Member
Jun 9, 2023
11
0
1
Dear Proxmox community,

After several searches in the forum, I couldn't find much information regarding ZFS storage and its performance tuning. Thus, I'd like to start this thread to share best practices, tests, and tunning tips on how you design your data storage.
Recently, I built a home NAS/Sever, and to determine whether to use the Local ZFS Pool Backend or create a ZFS pool with datasets and use bind mount, I ran the tests below.

The fio tests:
Code:
   # SEQ Write with 4 vCPU:
   sync; fio --filename=testfile-seq-160g --size=160G --direct=1 --rw=write --bs=1M --ioengine=libaio --numjobs=4 --iodepth=32 --name=seq-write-test --group_reporting --ramp_time=4

   # SEQ Read with 4 vCPU:
   sync; fio --filename=testfile-seq-160g --direct=1 --rw=read --bs=1M --ioengine=libaio --numjobs=4 --iodepth=32 --name=seq-read-test --group_reporting --readonly --ramp_time=4

   # Random write with 4 vCPU:
   sync; fio --filename=testfile-rand-4g --size=4G --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --numjobs=4 --iodepth=32 --name=rand-write-test --group_reporting --ramp_time=4 --time_based --runtime=300

   # Random read with 4 vCPU:
   sync; fio --filename=testfile-rand-4g --direct=1 --rw=randread --bs=4k --ioengine=libaio --numjobs=4 --iodepth=32 --name=rand-read-test --group_reporting --readonly --ramp_time=4

   # Mixed Random Read/Write database file with 8 vCPU:
   sync; fio --filename=database-testfile --size=4G --direct=1 --rw=randrw --bs=8k --ioengine=libaio --iodepth=32 --numjobs=8 --rwmixread=70 --name=db-mixed-rw-test --group_reporting --ramp_time=4

   # Multi-Threaded Application Simulation e.g. a data analytics tool with with 16 vCPU:
   sync; fio --filename=testfile-seq-readwrite --size=160G --direct=1 --rw=readwrite --bs=64k --ioengine=libaio --iodepth=16 --numjobs=16 --name=multi-thread-app --group_reporting --ramp_time=4


Detailed ZFS pool information: RAID 10 consists of 4 HDDs and 1 NVMe SSD (SLOG) created with following command:
Code:
    # zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=file:///root/zfs-pool.key -O keyformat=raw \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd-7 \
        -O normalization=formD \
        rpool mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd log /dev/nvme0n1p1


The fio test results I ran in LXC container with mounted:
1. a disk created with Local ZFS module,
2. a bind mounted dataset.

blocksize=128KSEQ WriteSEQ ReadRandom write 4GBRandom read 4GBMixed Random Read/Write 4GB database fileMulti-Threaded Application Simulation
Local ZFS module484MB/s1581MB/s12.4MB/s78.9MB/s29.8MB/s; 12.8MB/s434MB/s; 434MB/s
Bind mounted dataset255MB/s1602MB/s13.3MB/s82.6MB/s29.3MB/s; 12.6MB/s576MB/s; 576MB/s

EDIT: after @waltar and @LnxBil valuable comments I adjusted those tests and re-ran them again to get more accurate results. Unfortunately, even with parameter --size=10*$mb_memory I was getting a much higher sequential read speed (about 1600 MB/s) for a 160 GB file, despite using only 16 GB of RAM. I guess it could be due to a combination of several ZFS features as advanced caching, prefetching, and striping mechanisms, particularly optimized for sequential workloads.

Conclusion:
You might say that I could expect such numbers, as we can find many good resources about tuning ZFS performance all around the internet e.g. [1], [2], [3], [4]
However, I was mainly curious about the raw data and comparison between the Local ZFS module and bind-mounted dataset performance.

So far, I'll stick with:
- For all LXC containers and VMs - disk created with Local ZFS module (with blocksize=128K)
- For file share applications (e.g. Nextcloud) - I'll use bind-mounted dataset with blocksize=64K
- For media app (e.g. Plex) - I'll bind-mounted dataset with blocksize=512K or 1M and compression=zstd-12.

If you've performed any tests or you have any tips please share and comment.
 
Last edited:
It's common for a fileserver to have a ram vs. space relationship of 1000 (with a range of 500...2000) while it's even not uncommon that the number of files (=inodes) will be TB space as million of files (could be much less like in a virtualisation storage but even much more like on an application share).
You want to benchmark zfs but lots of your numbers aren't the filesystem I/O numbers as benchmark results from zfs arc cache.
If you want to know what your (any) filesystem is able to handle your benchmark data size should at least be 10x host ram, multiply that GB's by 1000 for number of files and test for 1 until number of cores of your host, eg. if you have 64GB ram use always at least in sum amount of 640GB of testfile(s), bench 640.000 files while having 24 cores bench from 1 to 24 jobs at same time. Be aware that with small number of cores (1...) depending on your storage you would be 100% cpu limited so measuring here sometimes likely cpu and not I/O performance.
Benchmarking depends on the use case also as when you access the storage local like on a hypervisor test local while when accessing by nfs, smb or other measure from other host which from the filesystem performance is really remote useable.
 
After several searches in the forum, I couldn't find much information regarding ZFS storage and its performance tuning.
Yes, that should be a sign. "It depends".


In addition to @waltar's comment, you also only benchmarked files and not zvols, which is for most people in the PVE world the major use case.
 
  • Like
Reactions: waltar
Thank you - @waltar and @LnxBil for the explanation and insight. I tried to learn more on this topic, but the more I learned, the more I realized it isn't easy to reproduce a real-world scenarios. My initial comment has been updated - I adjusted and re-ran those tests, yet I'm still getting relatively high sequential read speed.
 
1581MB/s on 4 spinning rust disks? That is roughly 400 MB/s per Disk, which is VERY implausible.

How did you create the file and has it more than zeroes in it?
 
  • Like
Reactions: waltar
I guess, ZFS does some magic. I wrote the file with the test - "SEQ Write with 4 vCPU" then I use the same file with seq read test.
I double-check it, and it contains more than zeroes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!