Improve write amplification?

Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.
 
  • Like
Reactions: guletz
If you set logbias=throughput the ZIL won't store the complete sync write but only its metadata
Hi,
.... my fault, sorry. After I read some other materials, I find that ZIL chain block will have less overhed if volblocksize is bigger.

Thx. a lot!

Good luck / Bafta !
 
Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.

Hi @Dunuin ,

Because you are on tests phase, maybe you can take in consideration to use ashift=13 ?

Good luck / Bafta !
 
Both cases if it is possible and you have time. I see cases when ashift=13 can be better on some particular ssds models.

Thx. a lot !
One interesting thing I saw in my speadsheet:

Write amplification guest -> hostWrite amplification host -> SSDs NAND
sync 4K write9,48x1,25x
async 4K write4,88x2,36x
That was for the 8 disks striped mirror 16K volblocksize pool. Here it looks like doing async writes will just move the write amplification from the host to inside the SSD. So it halves the zfs/virtio amplification but in the end its the same because the SSD will write the double to the NAND.
 
Ok, so I will do a 16K sync write test comparing 128bit AES vs 256bit AES. If it is padding to round up data to the key size, then 256bit AES should double the write amplification compared to 128bit.
I tested it here:
16k sync writes/reads that are 50% compressible read/written to a xfs partition on a zvol (volblocksize=8K) on a 4 disk striped mirror (ashift=12):
aes-256-gcm + lz4aes-128-gcm + lz4aes-256-gcm + no compressionno encryption + lz4
Write Performance:8 MiB/s8,09 MiB/s7,78 MiB/s10,1 MiB/s
Read Performance:29,2 MiB/s31,9 MiB/s29,8 MiB/s38,8 MiB/s
W.A. fio -> guest:1,48 x1,48 x1,48 x1,48 x
W.A. guest -> host:7,23 x7,21 x8,1 x3,67 x
W.A. host -> NAND:1,13 x1,15 x1,15 x1,12 x
W.A. total:12,09 x12,25 x13,78 x6,09 x
R.A. total:0,5 x0,5 x1,0 x0,5 x

Still the same performance and doubled write amplification for both 128bit and 256bit AES. So looks like padding isn't the problem. Someone can explain this? Maybe someone of the staff got a deeper ZFS understanding and knows how encryption works on block level?
 
Last edited:
you said big volblock gives you write amplification, didnt you also increase block/cluster size in guest to match?

I would also consider doing tests with raw file on top of datasets, you get no proxmox snapshots in UI but for now you just testing performance and amplification.

Personally I am done with 4k blocks on virtualization on any new guest OS I install, I think its just too inefficient.

Here is my understanding of how writes work on recordsize, however volblocksize is different to this.

Minimum data written is ashift typically 4k.
If its a new record then no write amplification if writing less than recordsize. Only rounding up to nearest 4k, write might be smaller than actual file size due to compression, compression can only go in multiples of ashift size for data written.
If adding data to existing record, existing record has to be read either from ARC if its there or from storage, then the record is rewritten new "and" old data, so read required to do write and then having to write extra data aka amplification. Again though compression may mitigate it.
Synchronous writes as you said go to zil first, so 2 copies written, behaviour changes a little on logbias=throughput even with no slog, Adding a slog moves one of the writes away to slog device, in practice though I find usually only small data is forced sync with exception of databases and even then I dont have databases that constantly write.

Bear in mind ssd's do their own housekeeping and this alone causes write amplification, you write data, then the ssd moves it for wear levelling and this housekeeping adds to the smart write stats.

I been messing with a windows guest today, and I had to deal with some weird issues which were unexpected and after sleep now (as it pushed me back hours) I do plan to make a small virtual disk, and to test it with dataset vs zvol, different record/block sizes, and impact as well of guest cluster size. my tests are primarily for performance but I can keep an eye on the smart data as I test.

Also a note on sync=never on zfs, it is safer than you may think, write ordering is preserved when its important, zfs has another hidden setting which is what disables all flush functions, sync=never doesnt do that, but will prevent your concern of writing second copies to zil.
 
Last edited:
you said big volblock gives you write amplification, didnt you also increase block/cluster size in guest to match?
I used ext4 with "stripe-width" and xfs with "ws" to match the stripe-width of the guest fs to the blocksize of the zvol. That showed no difference.
I would also consider doing tests with raw file on top of datasets, you get no proxmox snapshots in UI but for now you just testing performance and amplification.
I will test that.
Personally I am done with 4k blocks on virtualization on any new guest OS I install, I think its just too inefficient.
Its not that easy. Atleast with linux it looks like I'm forced to use a 4K block size. I tried to increase ext4 blocksize above 4K but it told me thats not possible because the FS block size can'T be greater than the page file size of the RAM and these are 4K. So I would need to switch to hugh pages and not sure how to do that or KVM or my physical hardware is able to do that at all.
 
Last edited:
With ext4 use -C and also enable bigalloc feature feature. Snip below from one such partition, this was originally a 4k block partition and when I moved it to 64k clusters (alongside changing volblocksize to 64k) the performance improvement was astounding.

Code:
# tune2fs -l /dev/sdb1
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   <none>
Last mounted on:          /home2
Filesystem UUID:          6fada182-ef45-416d-a5a0-7f85352561c2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index sparse_super2 filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize bigalloc metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              128000
Block count:              131071728
Reserved block count:     0
Free blocks:              100263536
Free inodes:              125964
First block:              0
Block size:               4096
Cluster size:             65536
 
  • Like
Reactions: Dunuin
With ext4 use -C and also enable bigalloc feature feature. Snip below from one such partition, this was originally a 4k block partition and when I moved it to 64k clusters (alongside changing volblocksize to 64k) the performance improvement was astounding.

Code:
# tune2fs -l /dev/sdb1
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   <none>
Last mounted on:          /home2
Filesystem UUID:          6fada182-ef45-416d-a5a0-7f85352561c2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index sparse_super2 filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize bigalloc metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              128000
Block count:              131071728
Reserved block count:     0
Free blocks:              100263536
Free inodes:              125964
First block:              0
Block size:               4096
Cluster size:             65536
Thanks, I will try that. Didn't see that cluster option in the ext4 manual.
 
Also look at the largefile stuff, sparse_super2, flex_bg, these reduce inodes and keep them not spread out so sequential is much more likely, I know these things most likely impact spindles, but I would expect in regards to write amplification, fragmentation would increase it, so still might be useful to you.
 
  • Like
Reactions: guletz and Dunuin
Also look at the largefile stuff, sparse_super2, flex_bg, these reduce inodes and keep them not spread out so sequential is much more likely, I know these things most likely impact spindles, but I would expect in regards to write amplification, fragmentation would increase it, so still might be useful to you.
Thanks. I created a ext4 with "mkfs.ext4 -b 4096 -O extent -O bigalloc -O has_journal -C 32k" ontop of a 32K volblocksize zvol and ran two fio tests (32k random read/write. One sync one async) and there was only a minimal write amplification change compared to a default ext4 on a 8k volblocksize zvol. Need to do some more tests but looks like clustering will only help with large files.

Did you ran into any problems with this? Mkfs warned me that bigalloc is still a work in progress and the ext4 wiki also mentioned that this experimental and I should mount that ext4 with "nodelalloc" because there are known bugs that will corrupt your data if using delayed allocation.
 
Last edited:
  • Like
Reactions: guletz
It has been marked as experimental for a long time, I do personally use it on multiple machines where my workload is primarily large files (only on data partitions not whole OS), I have never lost data or had filesystem instability when using it.

I expect they stuck between a rock and a hard place, they need lots of people to use it to consider it not experimental, but people wont want want to use an experimental filesystem feature.

I dont use it on any commercial servers only personal ones.

Sorry to hear it had not any meaningful effect on write amplification, it did help me a lot with performance but I never analysed it for write amplification.
 
Last edited:
  • Like
Reactions: guletz
@Dunuin do you have any tests on nvme m.2/u.2 drives ?

Most (if not all...) drives by default come with a 512 sector size lba namespace for backwards compatibility and you have to manually format them to 4096.

Since ZFS defaults to 4096 (ashift 12) I wonder how much this affects write amplification.

Here you can see it comes with Formatted LBA Size: 512
Code:
smartctl -a /dev/nvme0n1
...
Namespace 1 Formatted LBA Size: 512
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

To format it (loses all data !)
Code:
apt install nvme-cli
nvme format --lbaf=1 /dev/nvme0n1

After that you get

Code:
smartctl -a /dev/nvme0n1
...
Namespace 1 Formatted LBA Size: 4096
 
  • Like
Reactions: takeokun

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!