Improve write amplification?

Dunuin · Aug 11, 2021

Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.

guletz · Aug 11, 2021

Dunuin said:
If you set logbias=throughput the ZIL won't store the complete sync write but only its metadata

Hi,
.... my fault, sorry. After I read some other materials, I find that ZIL chain block will have less overhed if volblocksize is bigger.

Thx. a lot!

Good luck / Bafta !

guletz · Aug 11, 2021

Dunuin said:
Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.

Hi @Dunuin ,

Because you are on tests phase, maybe you can take in consideration to use ashift=13 ?

Good luck / Bafta !

Dunuin · Aug 11, 2021

Just for encryption or in general?

guletz · Aug 11, 2021

Dunuin said:
Just for encryption or in general?

Both cases if it is possible and you have time. I see cases when ashift=13 can be better on some particular ssds models.

Thx. a lot !

Dunuin · Aug 11, 2021

guletz said:
Both cases if it is possible and you have time. I see cases when ashift=13 can be better on some particular ssds models.

Thx. a lot !

One interesting thing I saw in my speadsheet:

	Write amplification guest -> host	Write amplification host -> SSDs NAND
sync 4K write	9,48x	1,25x
async 4K write	4,88x	2,36x

That was for the 8 disks striped mirror 16K volblocksize pool. Here it looks like doing async writes will just move the write amplification from the host to inside the SSD. So it halves the zfs/virtio amplification but in the end its the same because the SSD will write the double to the NAND.

Dunuin · Aug 11, 2021

Dunuin said:
Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.

I tested it here:

16k sync writes/reads that are 50% compressible read/written to a xfs partition on a zvol (volblocksize=8K) on a 4 disk striped mirror (ashift=12):

aes-256-gcm + lz4 aes-128-gcm + lz4 aes-256-gcm + no compression no encryption + lz4
Write Performance: 8 MiB/s 8,09 MiB/s 7,78 MiB/s 10,1 MiB/s
Read Performance: 29,2 MiB/s 31,9 MiB/s 29,8 MiB/s 38,8 MiB/s
W.A. fio -> guest: 1,48 x 1,48 x 1,48 x 1,48 x
W.A. guest -> host: 7,23 x 7,21 x 8,1 x 3,67 x
W.A. host -> NAND: 1,13 x 1,15 x 1,15 x 1,12 x
W.A. total: 12,09 x 12,25 x 13,78 x 6,09 x
R.A. total: 0,5 x 0,5 x 1,0 x 0,5 x

Still the same performance and doubled write amplification for both 128bit and 256bit AES. So looks like padding isn't the problem. Someone can explain this? Maybe someone of the staff got a deeper ZFS understanding and knows how encryption works on block level?

chrcoluk · Aug 12, 2021

you said big volblock gives you write amplification, didnt you also increase block/cluster size in guest to match?

I would also consider doing tests with raw file on top of datasets, you get no proxmox snapshots in UI but for now you just testing performance and amplification.

Personally I am done with 4k blocks on virtualization on any new guest OS I install, I think its just too inefficient.

Here is my understanding of how writes work on recordsize, however volblocksize is different to this.

Minimum data written is ashift typically 4k.
If its a new record then no write amplification if writing less than recordsize. Only rounding up to nearest 4k, write might be smaller than actual file size due to compression, compression can only go in multiples of ashift size for data written.
If adding data to existing record, existing record has to be read either from ARC if its there or from storage, then the record is rewritten new "and" old data, so read required to do write and then having to write extra data aka amplification. Again though compression may mitigate it.
Synchronous writes as you said go to zil first, so 2 copies written, behaviour changes a little on logbias=throughput even with no slog, Adding a slog moves one of the writes away to slog device, in practice though I find usually only small data is forced sync with exception of databases and even then I dont have databases that constantly write.

Bear in mind ssd's do their own housekeeping and this alone causes write amplification, you write data, then the ssd moves it for wear levelling and this housekeeping adds to the smart write stats.

I been messing with a windows guest today, and I had to deal with some weird issues which were unexpected and after sleep now (as it pushed me back hours) I do plan to make a small virtual disk, and to test it with dataset vs zvol, different record/block sizes, and impact as well of guest cluster size. my tests are primarily for performance but I can keep an eye on the smart data as I test.

Also a note on sync=never on zfs, it is safer than you may think, write ordering is preserved when its important, zfs has another hidden setting which is what disables all flush functions, sync=never doesnt do that, but will prevent your concern of writing second copies to zil.

Dunuin · Aug 12, 2021

chrcoluk said:
you said big volblock gives you write amplification, didnt you also increase block/cluster size in guest to match?

I used ext4 with "stripe-width" and xfs with "ws" to match the stripe-width of the guest fs to the blocksize of the zvol. That showed no difference.

chrcoluk said:
I would also consider doing tests with raw file on top of datasets, you get no proxmox snapshots in UI but for now you just testing performance and amplification.

I will test that.

chrcoluk said:
Personally I am done with 4k blocks on virtualization on any new guest OS I install, I think its just too inefficient.

Its not that easy. Atleast with linux it looks like I'm forced to use a 4K block size. I tried to increase ext4 blocksize above 4K but it told me thats not possible because the FS block size can'T be greater than the page file size of the RAM and these are 4K. So I would need to switch to hugh pages and not sure how to do that or KVM or my physical hardware is able to do that at all.

chrcoluk · Aug 12, 2021

With ext4 use -C and also enable bigalloc feature feature. Snip below from one such partition, this was originally a 4k block partition and when I moved it to 64k clusters (alongside changing volblocksize to 64k) the performance improvement was astounding.

Code:

# tune2fs -l /dev/sdb1
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   <none>
Last mounted on:          /home2
Filesystem UUID:          6fada182-ef45-416d-a5a0-7f85352561c2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index sparse_super2 filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize bigalloc metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              128000
Block count:              131071728
Reserved block count:     0
Free blocks:              100263536
Free inodes:              125964
First block:              0
Block size:               4096
Cluster size:             65536

Dunuin · Aug 12, 2021

chrcoluk said:

With ext4 use -C and also enable bigalloc feature feature. Snip below from one such partition, this was originally a 4k block partition and when I moved it to 64k clusters (alongside changing volblocksize to 64k) the performance improvement was astounding.

Code:

# tune2fs -l /dev/sdb1
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   <none>
Last mounted on:          /home2
Filesystem UUID:          6fada182-ef45-416d-a5a0-7f85352561c2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index sparse_super2 filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize bigalloc metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              128000
Block count:              131071728
Reserved block count:     0
Free blocks:              100263536
Free inodes:              125964
First block:              0
Block size:               4096
Cluster size:             65536

Thanks, I will try that. Didn't see that cluster option in the ext4 manual.

chrcoluk · Aug 13, 2021

Also look at the largefile stuff, sparse_super2, flex_bg, these reduce inodes and keep them not spread out so sequential is much more likely, I know these things most likely impact spindles, but I would expect in regards to write amplification, fragmentation would increase it, so still might be useful to you.

Dunuin · Aug 13, 2021

chrcoluk said:
Also look at the largefile stuff, sparse_super2, flex_bg, these reduce inodes and keep them not spread out so sequential is much more likely, I know these things most likely impact spindles, but I would expect in regards to write amplification, fragmentation would increase it, so still might be useful to you.

Thanks. I created a ext4 with "mkfs.ext4 -b 4096 -O extent -O bigalloc -O has_journal -C 32k" ontop of a 32K volblocksize zvol and ran two fio tests (32k random read/write. One sync one async) and there was only a minimal write amplification change compared to a default ext4 on a 8k volblocksize zvol. Need to do some more tests but looks like clustering will only help with large files.

Did you ran into any problems with this? Mkfs warned me that bigalloc is still a work in progress and the ext4 wiki also mentioned that this experimental and I should mount that ext4 with "nodelalloc" because there are known bugs that will corrupt your data if using delayed allocation.

chrcoluk · Aug 13, 2021

It has been marked as experimental for a long time, I do personally use it on multiple machines where my workload is primarily large files (only on data partitions not whole OS), I have never lost data or had filesystem instability when using it.

I expect they stuck between a rock and a hard place, they need lots of people to use it to consider it not experimental, but people wont want want to use an experimental filesystem feature.

I dont use it on any commercial servers only personal ones.

Sorry to hear it had not any meaningful effect on write amplification, it did help me a lot with performance but I never analysed it for write amplification.

Golum · Apr 18, 2022

@Dunuin do you have any tests on nvme m.2/u.2 drives ?

Most (if not all...) drives by default come with a 512 sector size lba namespace for backwards compatibility and you have to manually format them to 4096.

Since ZFS defaults to 4096 (ashift 12) I wonder how much this affects write amplification.

Here you can see it comes with Formatted LBA Size: 512

Code:

smartctl -a /dev/nvme0n1
...
Namespace 1 Formatted LBA Size: 512
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

To format it (loses all data !)

Code:

apt install nvme-cli
nvme format --lbaf=1 /dev/nvme0n1

After that you get

Code:

smartctl -a /dev/nvme0n1
...
Namespace 1 Formatted LBA Size: 4096

silverstone · Feb 2, 2025

I did my own Test Script to try to see the Impacts of Write Amplification now that I plan on converting existing 8k volblocksize Virtual Disks into 16k volblocksize Virtual Disks.

Repository and Script: https://github.com/luckylinux/proxmox-tools/blob/main/benchmark_write_amplification_guest.sh

Actually 8k volblocksize isn't so bad, I got up to 6.7x Write Amplification.
With 16k volblocksize I got up to 9.5x Write Amplification.
With 32k volblocksize I got up to 12.4x Write Amplification.

Of course the highest Write Amplification occurs at very small block_size that fio tried to Simulate (512k, 1k, 2k and to some extent 4k).

I didn't however observe any Write Amplification difference between Host Physical Device Writes and Crypt Device Writes.
Using LUKS2 with aes-xts-plain64.

@Dunuin: did you use any special Procedure to calculate Write Amplification between Physical/Crypt as well as between fio/Guest ?

chrcoluk · Apr 26, 2025

I have recently done more testing on this via both fio and iozone, tried to do both sequential and random, plus mixed random testing.

I may have over played the impact of the bigclusters feature, it looks like most of my benefit came from bumping zvol to higher size rather than the ext4 cluster size. However most of my latest testing was on an empty partition with lots of free space, there may be gains later down the line with reduced fragmentation.

A lot of amplification issues will be dependent on the software that is writing the data the size of the i/o operations? I expect its a fight between compression efficiency (assuming data compressible) and write amplification.

Search

Search

Improve write amplification?

Dunuin

Distinguished Member

guletz

Distinguished Member

guletz

Distinguished Member

Dunuin

Distinguished Member

guletz

Distinguished Member

Dunuin

Distinguished Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Golum

Member

silverstone

Well-Known Member

chrcoluk

Renowned Member

We value your privacy

	aes-256-gcm + lz4	aes-128-gcm + lz4	aes-256-gcm + no compression	no encryption + lz4
Write Performance:	8 MiB/s	8,09 MiB/s	7,78 MiB/s	10,1 MiB/s
Read Performance:	29,2 MiB/s	31,9 MiB/s	29,8 MiB/s	38,8 MiB/s
W.A. fio -> guest:	1,48 x	1,48 x	1,48 x	1,48 x
W.A. guest -> host:	7,23 x	7,21 x	8,1 x	3,67 x
W.A. host -> NAND:	1,13 x	1,15 x	1,15 x	1,12 x
W.A. total:	12,09 x	12,25 x	13,78 x	6,09 x
R.A. total:	0,5 x	0,5 x	1,0 x	0,5 x