Understanding and Minimizing ZFS Write Amplification

MyThoughts · Jan 26, 2021

While I have always known about ZFS write amplification I have never really tested or investigated the exact amount that was occurring on our Proxmox server builds.

I recently wrote a post investigating performance of the various local storage options (https://forum.proxmox.com/threads/quick-and-dirty-io-performance-testing.82846/). While I am still working on updating that post with ZFS performance metrics, I decided at the same time to measure the write amplification for fio commands sent to the ZFS pool.

The System specs are below (if you read the post I made earlier I explain that this is a test system with parts that I had lying around)

Supermicro X11SLH-F
Xeon E3-1246 v3
32GB DDR3 ECC
Onboard C226 6 x 6 Gbps SATA3
1 x 300GB Intel 320 SSD (Proxmox Installed to this drive, 31GB ext4 root partition, 8GB SWAP, remainder is default lvm-thin)
5 x 120GB Intel 320 SSD (Used for ZFS and other storage options for testing)
The Intel 320SSDs come in around 2100 fsync/s as measured by pveperf.

All tests were run with ashift=13 as a too high ashift should have negligible effect on write amplification (a too low value will absolutely cause tremendous undesired write amplification).

I experimented with various record sizes, and the measurement of actual data written were taken from the Smart data of the physical SSD drives. (LBAs written before compared to after the individual tests).

I used FIO to write to the device directly from within a VM, so I am removing the VM OS partition and format for the time being and these are theoretical test using FIO no real world operating measurements (I hope and intend to do tests with real world operations in the future).

The first test I ran was against a one drive ZFS and a ZFS mirror using the default 8k blocksize (ZFS recordsize parameter)
As expected the write amplification values were identical for single drive and mirror.

fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~5X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=1M --numjobs=1 --iodepth=1 --size=1G --name seq_read --filename=/dev/sdX (Write Amp ~2X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~6.15)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=8 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~4.38X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~2.8X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=8k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~2.15X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~14.8X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=16k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~2.13X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=32k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~2.06X)

Then the above tests were rerun with sync=disabled on the pool (from Proxmox, zfs set sync=disabled 'poolname')

fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=write --bs=1M --numjobs=1 --iodepth=1 --size=1G --buffered=0 --name seq_read --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1.55)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=8 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1.44X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1.44X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=8k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~5.44X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=16k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1.31X)
fio --ioengine=libaio --direct=1 --sync=1 --rw=randwrite --bs=32k --numjobs=1 --iodepth=64 --size=1G --buffered=0 --name XXX --filename=/dev/sdX (Write Amp ~1X)

The tests were repeated a third time with sync=standard and using the default FIO Buffer=true (sync=false implied)

fio --ioengine=libaio --rw=write --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=write --bs=1M --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name seq_read --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=8 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=8k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=512 --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~4X)
fio --ioengine=libaio --rw=randwrite --bs=16k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=32k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)

I will be continuing these tests with two additional goals, ZFS Zraid1 and Zraid2 measurements, and real world usage write amplification values.

From what I have read the best (lowest) write amplification we can expect from a ZFS volume when preforming a sync write would be ~2X, does anyone have any insight to if I am understanding what I have read about ZFS and write amplification with a sync write?

MyThoughts · Jan 26, 2021

I completed another quick test using ashift=13 and 32k blocksize (recordsize), and sync=0, this should provide some indication regarding compression.

fio --ioengine=libaio --rw=write --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~0.28X)
fio --ioengine=libaio --rw=write --bs=1M --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name seq_read --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~0.34X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=8 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~0.78X)
fio --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=8k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~0.97X)
fio --ioengine=libaio --rw=randwrite --bs=512 --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~3.69X)
fio --ioengine=libaio --rw=randwrite --bs=16k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)
fio --ioengine=libaio --rw=randwrite --bs=32k --numjobs=1 --iodepth=64 --size=1G --buffered=1 --name XXX --filename=/dev/sdX (Write Amp ~1X)

Dunuin · Jan 27, 2021

What exaclty are you testing? Are you writing to a dataset (recordsize will be used) or a zvol (recordsize will be ignored and volblocksize used instead)? Writing to zvols would be interesting too because datasets and recordsize are only used for LXCs and not for VMs.
And there should be way more padding overhead, because virtio SCSI is using 512B blocksize and ext4 inside the guest only 4K blocksize.
So that should look something like:
SSD (512B or 4K LBA) <- zpool (blocksize 8K if ashift=13) <- zvol (8K volblocksize if not changed) <- virtio SCSI (512B blocksize) <- ext4 (limited to 4k blocksize if not using huge pages)

As far as I undertand it is alway bad if something with a lower blocksize is trying to write to something with a bigger blocksize.

The wasted space by padding overhead is also interesting to see. If you are using ashift=13 and want to test a raidz1 with 5 SSDs you should always waste space if your volblocksize is below 64K (see here) and that wasted space should increase the write amplfication too.

Edit:
How are you getting write amplifications of below 1? That shouldn't be possible if compression/deduplication is disabled or random data is used to benchmark.

apoc · Jan 27, 2021

Nor related to the topic but thanks for sharing the table @Dunuin

MyThoughts · Jan 27, 2021

Dunuin said:
Edit:
How are you getting write amplifications of below 1? That shouldn't be possible if compression/deduplication is disabled or random data is used to benchmark.

Compression is on. FIO is supposed to use random data, but I have never really looked at what the random data looks like.

Dunuin said:
What exaclty are you testing? Are you writing to a dataset (recordsize will be used) or a zvol (recordsize will be ignored and volblocksize used instead)? Writing to zvols would be interesting too because datasets and recordsize are only used for LXCs and not for VMs.

We are writing to a zvol created by Proxmox when we create the VM within the GUI.

The reference to the record size was my mistake I mixed up the terms recordsize/volblocksize.
So the two zvols we created were using 8K and 32K volblocksize's respectively.

Since I wanted to minimize and additional parameters I was writing directly to the zvol block device within the VM as opposed to a filesystem in the VM on top of the zvol. The QEMU disks are showing up as devices with a 512B blocksize as you indicated. I have read that there may be methods of creating a virtual disk and override the 512B block with 4K block sizes but I have never tried this, and I am trying to avoid going too custom setting non-GUI parameters.

apoc · Jan 27, 2021

Dunuin said:
How are you getting write amplifications

there is write amplification. And there is write coalescing. So less IO (but larger chunks) going to the backend...

Gabgobie · Mar 31, 2024

Hi.

Apologies for reviving this old thread but I honestly feel like creating a new one that's referring to this one would count as spam.

I am once again trying to understand the ways of ZFS and have been reading numerous forum posts to get here. My ultimate goal being to reduce the write load on my disks and thereby prolonging their lifespan.

As your findings have shown, async writes with compression can be insanely efficient with write amplification even reaching values below 1. On the other hand sync writes can cause significant write amplification.

My current thought would be to use a high write endurance SLOG to have ZFS behave similar to async writes from the long term storage pools perspective while still keeping the data integrity benefit from sync writes. Going with a SLOG over just high write endurance drives would also cut costs as it doesn't require you to buy quite as high capacity drives. My thoughts are going towards something like the smaller optane drives here.

My main source of information is this post.

Is there anything wrong with this thought process or did I misunderstand the way a SLOG works? Have you ever tested a setup like this and the impact it can have?

Best,
Gab

UdoB · Mar 31, 2024

SLOG only helps with SYNC writes. Do you really have them in a large amount?

SLOG is NOT a cache - it is never read from it, except there is a hard crash (power fail) while sync-data is not already written - noticed and evaluated on the next boot.

Gabgobie said:
My ultimate goal being to reduce the write load on my disks and thereby prolonging their lifespan.

Yes, this could work. But again: only for sync-writes, which is not the majority of a "normal" usage pattern.

Disclaimer: this is my personal understanding...

Dunuin · Mar 31, 2024

Jup, a SLOG will move the ZIL from the normal SSD disks to the SLOG device. So those sync writes with their massive write amplification shouldn't hit your normal SSD. But you should still avoid cheap SSDs for the data disks and choose nothing with QLC NAND or missing DRAM cache for those async writes hitting the data disks.

Gabgobie · Mar 31, 2024

Hey, nice to hear from you!

Thanks for reiterating that a SLOG is not a write cache. It took me some time until I fully got that and I believe that this is important for anyone discovering this thread on the same mission as us.

To touch on my use-case, which seems to be sensible to me:
My current setup is comprised of three identical nodes. Aside from their boot mirror, they each have one ZFS SSD mirror. This mirror is supposed to host database LXCs. HA will be provided by the database itself so they won't need PVE HA. I expect them to request sync writes. I also expected most virtualization tasks to require sync going by the TrueNAS post I referenced but you seem to suggest otherwise. I'll have to look into that.
Aside from the ZFS, each node can be equipped with more disks which I intend to use for CEPH storage.

While reading on CEPH it was heavily suggested that there is a lot of latency involved and I can see why that would be. Since the databases will be replicating on their own, I don't see any point in storing them on CEPH. ZFS seems like a way better option with better performance and in this case even added redundancy over CEPH. A SLOG would be an upgrade I add later in case my theory holds true and it actually reduces the write load on my SSDs.

As for the disks I am using, they are Samsung PM883 SATA SSDs. They should be sufficient for my requirements. The SLOG is an afterthought to reduce their wear but as it stands now, I'll do the upgrade as soon as I've got the budget.

LnxBil · Apr 3, 2024

I can recommend to add a 16 or 32 GB Intel Optane as a SLOG device. It is VERY fast (with respect to the latency), so perfect for your SLOG.

With respect to databases ... you have to optimize two different goals here: write amplification and database throughput. There are guides out there, that state, that setting the recordsize to the blocksize of the database to reduce any kind of amplification, they are not wrong, yet the database will not be fast then. We previously also optimized (or tried to optimize) this by setting the recordsize to 8 KB yet in the end, the database throughput was much, much better with setting higher recordsized, or even let them be the default of 128K. Counter-intuitively (at least to us), this lead to highter throughput by increasing the read/write amplification.

Search

Search

Understanding and Minimizing ZFS Write Amplification

MyThoughts

Active Member

MyThoughts

Active Member

Dunuin

Distinguished Member

apoc

Famous Member

MyThoughts

Active Member

apoc

Famous Member

Gabgobie

New Member

UdoB

Distinguished Member

Dunuin

Distinguished Member

Gabgobie

New Member

LnxBil

Distinguished Member

We value your privacy