Benchmarking ZFS storage performance

Koleon · Oct 5, 2024

Dear Proxmox community,

After several searches in the forum, I couldn't find much information regarding ZFS storage and its performance tuning. Thus, I'd like to start this thread to share best practices, tests, and tunning tips on how you design your data storage.
Recently, I built a home NAS/Sever, and to determine whether to use the Local ZFS Pool Backend or create a ZFS pool with datasets and use bind mount, I ran the tests below.

The fio tests:

Code:

   # SEQ Write with 4 vCPU:
   sync; fio --filename=testfile-seq-160g --size=160G --direct=1 --rw=write --bs=1M --ioengine=libaio --numjobs=4 --iodepth=32 --name=seq-write-test --group_reporting --ramp_time=4

   # SEQ Read with 4 vCPU:
   sync; fio --filename=testfile-seq-160g --direct=1 --rw=read --bs=1M --ioengine=libaio --numjobs=4 --iodepth=32 --name=seq-read-test --group_reporting --readonly --ramp_time=4

   # Random write with 4 vCPU:
   sync; fio --filename=testfile-rand-4g --size=4G --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --numjobs=4 --iodepth=32 --name=rand-write-test --group_reporting --ramp_time=4 --time_based --runtime=300

   # Random read with 4 vCPU:
   sync; fio --filename=testfile-rand-4g --direct=1 --rw=randread --bs=4k --ioengine=libaio --numjobs=4 --iodepth=32 --name=rand-read-test --group_reporting --readonly --ramp_time=4

   # Mixed Random Read/Write database file with 8 vCPU:
   sync; fio --filename=database-testfile --size=4G --direct=1 --rw=randrw --bs=8k --ioengine=libaio --iodepth=32 --numjobs=8 --rwmixread=70 --name=db-mixed-rw-test --group_reporting --ramp_time=4

   # Multi-Threaded Application Simulation e.g. a data analytics tool with with 16 vCPU:
   sync; fio --filename=testfile-seq-readwrite --size=160G --direct=1 --rw=readwrite --bs=64k --ioengine=libaio --iodepth=16 --numjobs=16 --name=multi-thread-app --group_reporting --ramp_time=4

Detailed ZFS pool information: RAID 10 consists of 4 HDDs and 1 NVMe SSD (SLOG) created with following command:

Code:

    # zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=file:///root/zfs-pool.key -O keyformat=raw \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd-7 \
        -O normalization=formD \
        rpool mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd log /dev/nvme0n1p1

The fio test results I ran in LXC container with mounted:
1. a disk created with Local ZFS module,
2. a bind mounted dataset.

`blocksize=128K`	SEQ Write	SEQ Read	Random write 4GB	Random read 4GB	Mixed Random Read/Write 4GB database file	Multi-Threaded Application Simulation
Local ZFS module	484MB/s	1581MB/s	12.4MB/s	78.9MB/s	29.8MB/s; 12.8MB/s	434MB/s; 434MB/s
Bind mounted dataset	255MB/s	1602MB/s	13.3MB/s	82.6MB/s	29.3MB/s; 12.6MB/s	576MB/s; 576MB/s

EDIT: after @waltar and @LnxBil valuable comments I adjusted those tests and re-ran them again to get more accurate results. Unfortunately, even with parameter --size=10*$mb_memory I was getting a much higher sequential read speed (about 1600 MB/s) for a 160 GB file, despite using only 16 GB of RAM. I guess it could be due to a combination of several ZFS features as advanced caching, prefetching, and striping mechanisms, particularly optimized for sequential workloads.

Conclusion:
You might say that I could expect such numbers, as we can find many good resources about tuning ZFS performance all around the internet e.g. [1], [2], [3], [4]
However, I was mainly curious about the raw data and comparison between the Local ZFS module and bind-mounted dataset performance.

So far, I'll stick with:
- For all LXC containers and VMs - disk created with Local ZFS module (with blocksize=128K)
- For file share applications (e.g. Nextcloud) - I'll use bind-mounted dataset with blocksize=64K
- For media app (e.g. Plex) - I'll bind-mounted dataset with blocksize=512K or 1M and compression=zstd-12.

If you've performed any tests or you have any tips please share and comment.

waltar · Oct 5, 2024

It's common for a fileserver to have a ram vs. space relationship of 1000 (with a range of 500...2000) while it's even not uncommon that the number of files (=inodes) will be TB space as million of files (could be much less like in a virtualisation storage but even much more like on an application share).
You want to benchmark zfs but lots of your numbers aren't the filesystem I/O numbers as benchmark results from zfs arc cache.
If you want to know what your (any) filesystem is able to handle your benchmark data size should at least be 10x host ram, multiply that GB's by 1000 for number of files and test for 1 until number of cores of your host, eg. if you have 64GB ram use always at least in sum amount of 640GB of testfile(s), bench 640.000 files while having 24 cores bench from 1 to 24 jobs at same time. Be aware that with small number of cores (1...) depending on your storage you would be 100% cpu limited so measuring here sometimes likely cpu and not I/O performance.
Benchmarking depends on the use case also as when you access the storage local like on a hypervisor test local while when accessing by nfs, smb or other measure from other host which from the filesystem performance is really remote useable.

LnxBil · Oct 5, 2024

Koleon said:
After several searches in the forum, I couldn't find much information regarding ZFS storage and its performance tuning.

Yes, that should be a sign. "It depends".

In addition to @waltar's comment, you also only benchmarked files and not zvols, which is for most people in the PVE world the major use case.

Koleon · Oct 9, 2024

Thank you - @waltar and @LnxBil for the explanation and insight. I tried to learn more on this topic, but the more I learned, the more I realized it isn't easy to reproduce a real-world scenarios. My initial comment has been updated - I adjusted and re-ran those tests, yet I'm still getting relatively high sequential read speed.

LnxBil · Oct 9, 2024

1581MB/s on 4 spinning rust disks? That is roughly 400 MB/s per Disk, which is VERY implausible.

How did you create the file and has it more than zeroes in it?

Koleon · Oct 10, 2024

I guess, ZFS does some magic. I wrote the file with the test - "SEQ Write with 4 vCPU" then I use the same file with seq read test.
I double-check it, and it contains more than zeroes.

Search

Search

Benchmarking ZFS storage performance

Koleon

New Member

waltar

Active Member

LnxBil

Distinguished Member

Koleon

New Member

LnxBil

Distinguished Member

Koleon

New Member