Improve write amplification?

Dunuin · Sep 6, 2020

Hi,

I've got a total write amplification of around 18. Can someone hint me how to improve this?

If I run iostat on the host and sum up the writes of all vdevs, all the VMs combined are writing around 1 MB/s of data.
The vdevs are stored on a raidz1 consisting of 5x Intel S3710 SSDs (sdc to sdg) and all the SSDs combined are writing around 10 MB/s of data.
I use smartctl to monitor the host writes and nand writes of each drive and for every 1 GB of data written to the SSD the SSD is writing around 1.8GB of Data to the NAND.
So in the end the 1MB/s of real data from the guests are multipling up to 18 MB/s written to the flash.

Code:

root@Hypervisor:/var/log# iostat 600 2
Linux 5.4.60-1-pve (Hypervisor)         09/06/2020      _x86_64_        (16 CPU)

...

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.66    0.00    5.90    0.02    0.00   89.42

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1           0.00         0.00         0.00          0          0
nvme0n1           0.00         0.00         0.00          0          0
sdg             129.56         0.81      1918.89        484    1151332
sdb               4.94         0.00        25.70          0      15417
sdh               0.00         0.00         0.00          0          0
sdf             128.64         0.81      1918.76        488    1151256
sda               4.95         0.05        25.70         32      15417
sdd             129.78         0.83      1917.61        500    1150564
sde             129.89         0.81      1917.23        488    1150340
sdc             130.13         0.87      1916.58        520    1149948
md0               0.00         0.00         0.00          0          0
md1               4.06         0.05        25.13         32      15080
dm-0              4.06         0.05        25.13         32      15080
dm-1              4.06         0.05        29.87         32      17920
dm-2              0.00         0.00         0.00          0          0
zd0               0.69         0.00         8.03          0       4820
zd16              0.58         0.00         6.45          0       3868
zd32             13.13         0.89       278.59        536     167156
zd48              0.62         0.00         6.90          0       4140
zd64              0.58         0.00         6.53          0       3920
zd80              0.00         0.00         0.00          0          0
zd96              0.00         0.00         0.00          0          0
zd112             0.10         0.01         0.53          8        320
zd128             0.00         0.00         0.00          0          0
zd144             0.00         0.00         0.00          0          0
zd160             0.00         0.00         0.00          0          0
zd176             0.00         0.00         0.00          0          0
zd192             0.00         0.00         0.00          0          0
zd208             0.00         0.00         0.00          0          0
zd224             0.00         0.00         0.00          0          0
zd240             0.00         0.00         0.00          0          0
zd256             0.00         0.00         0.00          0          0
zd272             0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd320             0.00         0.00         0.00          0          0
zd336             0.00         0.00         0.00          0          0
zd352             0.00         0.00         0.00          0          0
zd368             0.00         0.09         0.00         56          0
zd384             0.00         0.00         0.00          0          0
zd400            51.87         0.16       717.30         96     430380
zd416             0.58         0.00         6.32          0       3792
zd432             0.58         0.00         6.39          0       3832
zd448             0.67         0.00         8.11          0       4868
zd464             0.60         0.00         6.36          0       3816

Host Config:
Pool is raidz1 of 5 SSDs without LOG or Cache device. Atime deactivated for the pool, compression is set to LZ4, no deduplication, ashift of 12, sync is standard. On the pool is a encrypted dataset. This dataset contains all of the zvols. All partitions are alinged to 1MB. SSDs are reported as logical size of 4k.

VM config:
Storage controller is Virtio SCSI single. All virtual disks got discard, io thread and ssd emulation enabled. Cache mode is set to "no cache". Format is Raw and block size should be the 8k default.

Guest config:
Virtio guest is installed. Virtual drives are formated as ext4 and mounted with "noatime" and "nodiratime" options. fstrim is run once a week via cron. /tmp is mounted via ramfs.

Is there anything I made wrong?

18MB/s is 568 TB per year which is really a lot of data because the VMs are just idleing plain debians without heavy use. Only 3 VMs got real applications running (Zabbix, Graylog, Emby). I chose 5x S3710 SSDs because they got a combined TBW of 18.000TB and should last some time but saving some writes would be nice non the less.

I only found 1 optimization doing a big difference and that was changing the cache mode from "no cache" to "unsafe" so the VMs can't so sync writes which will cut the write amplification in half. But that isn't really a good thing if something crashes and some DBs gets corrupt or something like that.

Dunuin · Sep 7, 2020

If I run "fdisk -l" on the guests all the QEMU Harddisk show a LBA of 512B but zvols are 8K and real disks of the pool are ashift 12. May that cause the write amplification? I wasn't able to find a way to change how KVM handles the LBA of the virtual drives. Or is it normal that 512B LBAs are shown but the 8K are really used internally?

Dunuin · Sep 8, 2020

I've read somewhere that raidz1 isn't good for write amplification and a zfs raid10 might be better but there where no numbers. Does someone tested both and can compare the them? I only find zfs performance benchmarks not mentioning the write amplification. I initially thought that raidz1 with 5 drives should result in a much better write amplification because only 25% more data is stored on the drives and not 100% more.

Right now I'm using 5x 200GB SSD as raidz1 but I'm only using 100GB space. If raid10 is more optimized for write amplification I could use raid10 with 1 spare until I need more then 400GB.
I don't know what the internal SSD write amplification was with just two mirrored SDDs but the translation from guest filesystem to host file system got a write amplification of about 7. With raidz1 the same amount of data from the guests causes a write amplification from guest to host of 10.

Any idea what would be the best disc setup to optimize write amplification?

Dunuin · Sep 10, 2020

The raidz1 pool with the 5 SSDs has a default blocksize of 8k. Could that cause more write amplification? Is it better to use a higher number like 16k or 32k if the SSDs are using 4k LBA?

Dunuin · Sep 12, 2020

I tested 4 SSDs as mirror striped zfs pool (like raid10) and the write amplification was even worse...

raidz1 of 5 SSDs: 808 kb/s writes from VMs -> 6913 kb/s writes to pool = 8.56x write amplification (without additional write amplification from SSD)

zfs mirror stripe of 4 SSDs(raid10) with standard sync writes: 883 kb/s writes from VMs -> 10956 kb/s writes to pool = 12.40x write amplification (without additional write amplification from SSD)

zfs mirror stripe of 4 SSDs(raid10) with most syncwrites deactivated: 707 kb/s writes from VMs -> 4990 kb/s writes to pool = 7.06x write amplification (without additional write amplification from SSD)

raidz1 pool (sdc/sdd/sde/sdf/sdg):

Code:

root@Hypervisor:~# iostat 3600 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           0.00         0.28         0.00       1024          0
nvme1n1           0.00         0.29         0.00       1032          0
sdf             119.93         2.72      1382.60       9780    4977356
sdc             121.43         2.70      1383.36       9716    4980080
sde             121.05         2.85      1381.88      10276    4974760
sdg             120.96         2.80      1382.17      10088    4975820
sdd             121.26         2.77      1383.42       9976    4980308
zd0               0.81         1.28         8.18       4624      29440
zd16              0.51         0.00         5.78          0      20812
zd32              0.00         0.00         0.00          0          0
zd48              0.00         0.00         0.00          0          0
zd64              0.00         0.00         0.00          0          0
zd80              0.56         0.00         6.42          0      23108
zd96              0.54         0.00         6.22          0      22408
zd112             0.56         0.00         6.45          0      23224
zd128             0.04         0.27         0.00        976          0
zd144             2.41         0.00        23.25          0      83708
zd160             0.00         0.00         0.00          0          0
zd176            26.24         0.00       326.52          0    1175480
zd192             0.00         0.00         0.00          0          0
zd208             0.00         0.00         0.00          0          0
zd224             0.00         0.00         0.00          0          0
zd240            13.21         2.57       328.63       9236    1183052
zd256             0.00         0.00         0.00         12          0
zd272             0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd320             2.64         0.00        53.82          8     193752
zd336             0.62         0.00         6.88          4      24784
zd352             0.00         0.00         0.00          0          0
zd368             0.00         0.00         0.00          0          0
zd384             0.00         0.00         0.00          0          0
zd400             0.59         0.00         6.38          0      22952
zd416             1.04         0.00        12.84          0      46224
zd432             1.59         0.01        16.37         20      58948
zd448             0.00         0.00         0.00          0          0

zfs "Raid10" (sdc/sdd/sde/sdg):

Code:

root@Hypervisor:~# iostat 3600 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           0.00         0.00         0.00          0          0
nvme1n1           0.00         0.00         0.00          4          0
sdf               0.00         0.00         0.00          0          0
sdc              77.40        13.24      2786.42      47680   10031128
sde              71.34        13.17      2691.64      47400    9689908
sdg              71.49        13.01      2691.64      46836    9689908
sdd              77.68        13.66      2786.42      49160   10031128
zd160             0.50         0.19         5.33        668      19180
zd192             0.00         0.00         0.00          0          0
zd16              0.59         0.00         6.31          0      22720
zd48              0.00         0.00         0.00          0          0
zd96              1.68         0.01        17.12         28      61640
zd208             0.00         0.00         0.00          0          0
zd224             0.57         0.00         6.02         12      21668
zd288             0.00         0.00         0.00          0          0
zd80              0.57         0.00         6.13          0      22056
zd304             0.00         0.00         0.00          0          0
zd352             2.39         0.02        23.32         72      83936
zd384             0.00         0.00         0.00          0          0
zd144             3.10         0.88        63.93       3176     230160
zd272             0.00         0.00         0.00          0          0
zd320             1.00         0.11        12.02        396      43264
zd32              0.00         0.00         0.00          0          0
zd400             0.58         0.04         6.29        128      22628
zd128             0.00         0.00         0.00          0          0
zd112             0.76         0.00         8.32          8      29956
zd368             0.00         0.00         0.00          0          0
zd0               0.61         0.28         6.44       1012      23180
zd64              0.00         0.00         0.00          0          0
zd336            26.63        10.51       329.33      37848    1185604
zd256             0.00         0.00         0.00          0          0
zd176            13.70         0.77       392.81       2756    1414112
zd416             0.00         0.00         0.00          0          0

zfs "Raid10" with "cache mode = unsafe" (ignoring syncwrites) for the 2 VMs doing most writes:

Code:

root@Hypervisor:~# iostat 1200 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf               0.00         0.00         0.00          0          0
sdc              60.49        66.98      1271.80      80380    1526156
sde              55.57        68.84      1223.11      82612    1467732
sdg              55.53        68.84      1223.11      82608    1467732
sdd              60.40        66.17      1271.80      79404    1526156
zd160             0.51         0.00         5.38          0       6460
zd192             0.00         0.00         0.00          0          0
zd16              0.98        31.05         6.85      37256       8224
zd48              0.00         0.00         0.00          0          0
zd96              1.72         0.02        18.17         20      21804
zd208             0.00         0.00         0.00          0          0
zd224             0.56         0.00         5.93          0       7120
zd288             0.00         0.00         0.00          0          0
zd80              0.57         0.00         6.00          0       7204
zd304             0.00         0.00         0.00          0          0
zd352             2.28         0.00        22.20          0      26636
zd384             0.00         0.00         0.00          0          0
zd144             3.00         0.00        53.93          0      64716
zd272             0.00         0.00         0.00          0          0
zd320             1.01         0.00        11.91          0      14292
zd32              0.00         0.00         0.00          0          0
zd400             0.57         0.00         6.26          0       7512
zd128             0.00         0.00         0.00          0          0
zd112             0.74         0.00         8.14          0       9764
zd368             0.00         0.00         0.00          0          0
zd0               0.62         0.00         6.83          0       8192
zd64              0.00         0.00         0.00          0          0
zd336            60.93         5.44       243.25       6524     291900
zd256             0.00         0.00         0.00          0          0
zd176            79.80       124.27       312.94     149124     375524
zd416             0.00         0.00         0.00          0          0

Any ideas?

Dunuin · Sep 20, 2020

I now changed volblocksize for all zvols from 8k zu 32k and set cachemode to "unsafe" for the 2 most writeheavy VMs and now my write amplification is down to 3.4x.

789,7 kb/s writes to zvol -> 3.002 kb/s raidz1 = 3,8x write amplification from guest to host:

Code:

# iostat 3600 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1           0.00         0.00         0.00          0          0
nvme0n1           0.00         0.00         0.00          0          0
sda               6.51         0.18        32.29        640     116248
sdb               6.48         0.00        32.29          0     116248
md0               0.00         0.00         0.00          0          0
md1               5.24         0.18        31.55        640     113588
sdc              49.20        29.21       600.65     105156    2162344
sdf              47.85        25.64       600.68      92304    2162460
sdd              48.53        24.73       600.14      89012    2160520
sdg              48.61        25.70       600.73      92524    2162624
sde              48.76        26.40       600.29      95024    2161044
sdh               0.00         0.00         0.00          0          0
dm-0              5.24         0.18        31.55        640     113588
dm-1              5.21         0.06        42.35        220     152468
dm-2              0.03         0.12         0.00        420          0
zd1264           81.56        13.88       323.51      49984    1164636
zd1280           77.03         1.97       307.44       7088    1106772
zd80              0.62         0.09         6.44        332      23192
zd112             2.36         0.00        23.01          8      82848
zd64              0.77         0.01         8.31         20      29900
zd128             0.55         0.00         6.12          0      22028
zd32              0.97         0.20        11.68        728      42040
zd160             3.72        97.77        51.84     351968     186620
zd48              0.57         0.02         6.40         76      23036
zd176             0.58         0.19         6.53        692      23492
zd192             0.60         0.24         6.61        868      23796
zd208             0.54         0.37         5.67       1348      20424
zd96              1.65         2.60        15.82       9364      56952
zd224             0.95         0.51         9.61       1824      34592
zd0               0.02         0.05         0.71        188       2552
zd16              0.00         0.00         0.00          0          0
zd144             0.00         0.00         0.00          0          0
zd240             0.00         0.00         0.00          0          0
zd256             0.00         0.00         0.00          0          0
zd272             0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd320             0.00         0.00         0.00          0          0
zd336             0.00         0.00         0.00          0          0
zd352             0.00         0.00         0.00          0          0
zd368             0.00         0.00         0.00          0          0
zd384             0.00         0.00         0.00          0          0
zd400             0.00         0.00         0.00          0          0

Is it important for write amplification for the guest filesystem to match the blocksize/clustersize with the 32k volblocksize of the zvol?

apoc · Sep 20, 2020

Interesting to read your observation/research results. I did a quick Google search and have come across this:
https://www.ixsystems.com/community/threads/small-file-writes-causing-high-disk-writes.54440/
Maybe it helps (I found the "near zero writes" to zil interesting...)
All the best

Dunuin · Sep 20, 2020

Dunuin said:
I initially thought that raidz1 with 5 drives should result in a much better write amplification because only 25% more data is stored on the drives and not 100% more.

Right now I'm using 5x 200GB SSD as raidz1 but I'm only using 100GB space. If raid10 is more optimized for write amplification I could use raid10 with 1 spare until I need more then 400GB.
I don't know what the internal SSD write amplification was with just two mirrored SDDs but the translation from guest filesystem to host file system got a write amplification of about 7. With raidz1 the same amount of data from the guests causes a write amplification from guest to host of 10.

The raidz1 pool with the 5 SSDs has a default blocksize of 8k. Could that cause more write amplification? Is it better to use a higher number like 16k or 32k if the SSDs are using 4k LBA?

I found this chart that showed that that the default 8k volblocksize was indeed a problem. For Raidz1 with ashift of 12 (4K LBA) you need atleast:
-for 3 discs a volblocksize of 4x LBA = 16K
-for 4 discs a volblocksize of 3x or 16x LBA = 12K (be aware, not 2^n) or 64K
-for 5 discs a volblocksize of 8x LBA = 32K
-for 6 discs a volblocksize of 5x or 8x LBA = 20K (be aware, not 2^n) or 32K

So, as a hint for all others using raidz1:
If you don't change the default volblocksize (PMXGUI ->Datacenter -> Storage -> YourPool -> Block Size) all created virtual discs are automatically volblocksize of 8K which which will in all cases (with ashift 12 or higher) use +100% more space and you wont save any space but just loose performance. Would ne nice if someone could add that to the ZFS article on the wiki. It was really hard for me to realize that and you can't change the volblocksize after creation.

tburger said:
Interesting to read your observation/research results. I did a quick Google search and have come across this:
https://www.ixsystems.com/community/threads/small-file-writes-causing-high-disk-writes.54440/
Maybe it helps (I found the "near zero writes" to zil interesting...)
All the best

90% of my data written to the pool comes from 2 VMs doing nearly only sync writes (logs stored in elasticsearch db and metrics stored in MySQL db) and sync writes will double the write amplification. I fixed this by setting "cache mode=unsafe" for these two VMs. This way all sync writes are ignored and handled as async writes. That is realy bad for data integity but I got a UPS and weekly backups of all VMs to 2 NASs. So if something goes wrong (kernel crash, hardware failing) I can just restore a backup and would only loose up to 7 days of data, what I don't care, because loosing a week of metrics and logs isn't a problem here. I just want to be sure that I always got a working backup so I don't need to setup and configure everything again.

I also had tried to "extend the transaction length to 30 seconds" like in your linked post. That had indeed lowered the write amplification but isn't really great for security so I removed it again. Normally all async writes are collected and written to the pool every 5 seconds. Increasing it to 30 sec saves write amplification, because it is easier to store a big chunk of data instead of several smaller chunks but it also makes the drive slower and if something happens you loose 6 times the data. I would use that on consumer SSDs but my new enterpise SSDs should be fine with small chucks so I removed it again.

And I found out that the recordsize of a parent dataset is just ignored if using zvols. As soon as you use a zvol the volblocksize of the zvol will be used and the recordsize is unimportant. Recordsize is only used for everything else, not being a child dataset or zvol, stored on a dataset.

What I still didn't found out is how the KVM handles the virtualization. If it would make sense to set the block allocation to the same value as the volblocksize of the zvol. I could think that might increase the write amplification if the guest os writes alot of 512B chucks if the vzol is really using 32K volblocksize.

effgee · Feb 9, 2021

I have a bit of experience with the write amplification on Proxmox using KVM / Qemu at least with Windows VM's.
The default volblocksize of 8k is very inefficient and 4k even more so.
I have a set of 6 x HDD in a RAIDZ1 config and in my testing so far, in my testing of 4k volblocksize (to matchi ntfs 4k sectors for performance) vs 128k volblocksize is giving me almost +100% write amplification.

In my testing writing a single 21 gig file to each volume, the stats are similar to this

volblocksize 4K
written 34.8G

volblocksize 128K
written 21.7G

Benchmarks for performance are almost identical between the two settings at least for Crystal Disk mark. So in my case, I will be used the 128k volblocksize for my Windows Vm images.

I read a post on reddit explaining more but I have lost it currently, they even had suggested a 1mb blocksize.

chrcoluk · Aug 11, 2021

Interesting data, yes for sure its a good idea to match guest block size to zvol, but with the exception of maybe 4k, the problem with 4k assuming you have 4k of ashift, is you then wont get any compression. Small block sizes in general I think are a problem with ZFS as host filesystem. So I am considering not making any 4k blocks on guests again, ext4 in guests can grow block size with -C cluster flag, windows supports higher block size and same with UFS/ZFS in guest, although I wouldnt put ZFS on top of ZFS.

I also considered the impact if you using dataset instead of zvol, I was originally in favour of using datasets, but thinking about the issues with the way recordsize works, I now think its a bad idea to use datasets for VM disks.

Some data for you guys to consume.

500gig ext4 partition on guest with 4k blocks and just under 100gig used, and on zvol with default 8k, disk image was sized just over 200gig.
500gig ext4 partition on guest with 64k clusters, and the same data copied to it so just under 100gig used, and on zvol with 64k size, disk image usage was around 80-90gig.

Thats on a mirror vdev, so the amplification is not unique to raidz.

I think 16k is probably the absolute smallest block size I will consider as a starting point, but 64k has proven to not be too damaging to random i/o so that might be what I use as standard moving forward.

Regarding children of datasets, a child dataset will inherit the value of its parent, and providing you never change it, it will also inherit any changes made to the parent, however if you set an override, it then becomes its own boss and uses its own value.
Volumes will of course use their own blocksize as they dont have a recordsize.

leesteken · Aug 11, 2021

Just curious: do all these measurements and experiences in this thread include setting the logical and/or "physical" sector size of the virtual drives to 4k as well? Are those settings having any effect in reducing read and write amplification?

Dunuin · Aug 11, 2021

I did a lot of benchmarking the last days. What I have learned so far:

Everything is using ashift of 12.

1.) raidz has a lower write amplification than any mirror or striped mirror combination. I think that is because the parity overhead is smaller. Any mirror/stripes mirror will loose 50% to parity so everything is written twice. A raidz1 with 3 disks and volblocksize of 16K will only got 33% parity loss and a 5 disk raidz1 with volblocksize 32K only 20% parity loss. So writing 1GiB to a mirror or striped mirror will write 2GiB but a 3 disk raidz1 will write 1,5GiB and a 5 disk raidz1 only 1,25GB.

2.) no striped mirror will give you more IOPS for single sync writes. If just one process is sync writing to the pool at the time a 2 disk mirror will give you the same IOPS (or even way better because of less overhead) as a 8 disk striped mirror. So striping won't make your pool faster, it just increases your bandwidth. So you get more IOPS with striping but that just for workloads that can be parallelized. I did dozens of benchmarks and for sync writes a single mirror always got more IOPS than a 4 disk or 8 disk striped mirror. Only for async benchmarks that run several fio tests in parallel I saw a IOPS increase for striped mirrors.

3.) using native ZFS encryption will double ZFSs write amplification (see here)

4.) using sync writes will double ZFSs write amplification (because data is first wrote to the ZIL on the disks and later again to its final position)

5.) ext4 can't handle small writes well. If you sync write 4K blocks using ext4 that will create a factor 5 write amplification inside the guest. XFS is better at that. Here the guest will only get a factor 3 write amplification. Thats because both are journaling filesystems but XFSs journaling is more efficient. For bigger writes (like 32K) there is no difference. So if you got alot of small writes you might want to use XFS and not ext4. If you don't care about data integity you can also disable the journaling for ext4 and then you got double the IOPS and only less than half the write amplification (see here).

6.) using big volblocksizes is bad because any write that is smaller than the volbockisze will amplifiy terrible. Lets say I got a volblocksize of 32K and do some 4K writes. For every 4K random write the pool will read a complete 32K block, combine that with the 4k of data and write that 32K block again. So if I got 1GiB of 4K writes this will force ZFS to read 8GiB of data and write 8GiB of data. So you get alot of write amplification and also a big read overhead. Same problem with every read operation. For every 4K block that you try to read from that pool 32K will be read. So you always get a factor 8 read amplification too.

7.) using small volblocksizes is bad too. With small volblocksizes like 4K or 8K ZFS can't really make use of the block level compression so it needs to write/read more data and the read/write amplification will be higher. It also looks like ZFS is creating alot of overhead like metadata, checksums, .... And the smaller your volblocksize is, the heavier this overhead will be compared to the data. Lets say for example there is a (fictional) 8KB of metadata. If you now want to write 4K of data ZFS will write 4K data + 8K metadata and you get a additional write amplification of factor 3. But if you use a 32K blocksize it is only 32K data + 8K metadata so only a write amplfication of factor 1,25.

8.) sync writes are always terrible for small writes...Here is my total write amplification for 4K sync random writes/reads for different pool configurations:

	5 disk raidz1 (32K volblocksize)	8 disk striped mirror (16K volblocksize)	2 disk mirror (4K volblocksize)	4 disk striped mirror (8K volblocksize)
Total Write Amplification from Fio in Guest to NAND in SSD:	44,88x	61x	56,84x	will add later
Total Read Amplification:	8x	4x	1x	will add later
Additional reads created by writes (as multiple of writes):	16,09x	7,13x	0x	will add later

And here the same for 16K sync random writes/reads:

	5 disk raidz1 (32K volblocksize)	8 disk striped mirror (16K volblocksize)	2 disk mirror (4K volblocksize)	4 disk striped mirror (8K volblocksize)
Total Write Amplification from Fio in Guest to NAND in SSD:	13,79x	20,12x	17,09x	17,16
Total Read Amplification:	2x	0,94x	1x	1x
Additional reads created by writes (as multiple of writes):	1,59x	0,75x	0,03x	0,53

So basically no matter what pool layout you use you get a horrible write amplification...and this already was with enterprise SSDs that can cache sync writes to optimize the write pattern. Don't want to know how bad that would look like with consumer SSDs.

9.) keep the bandwidth of your sata controller in mind...I for example got 10 SATA ports on my mainboards chipset but that chipset only got a 4GB/s connection between the CPU and the chipset and that connection is also used for other stuff like USB and so on. Now lets say I'm doing 4K sync random writes and got a write amplification of 61x and additional reads of 7,13x. Lets be optimistic and say that there is no other USB communication and that this 4GB/s link between CPU and chipset got no protocol overhead...
4000MB/s / 68,13 = 58,71 MB/s. So no matter how fast my SSDs would be or how much SSDs I would use, all guests combined can never write with more then 58,71 MB/s because these 58,71 MB/s of writes inside the guest will be amplified to 3581 MB/s of writes + 419 MB/s of reads on the host so that the link between CPU and chipset is the bottleneck. So if you really want to increase the performance of the pool and you got such a bad write amplification like me, basically your only option is to lower the write amplification or buy additional HBAs to increase the bandwidth between disks and the CPU. But I for example got no free PCIe slots and I would need to add 2 HBAs.

10.) I wasn't able to see any difference if formating a ext4 or xfs partition inside the guest with a stripe width or without it. Write amplification was basically always the same. I used "stripe-width" for ext4 and "sw" for xfs to match the volblocksize. So if I for exmaple got a 16K volblocksize I set the "stripe-width" or "sw" so that the guests filesystem should write 4x 4K.

By the way, I already optimized stuff as good as I can. So I already did stuff like disableing atime for ZFS and the guests file system, used fstrim instead of discard in fstab and so on. Without that the write amplification and overhead would even be higher...

So someone got an idea what I could try next to reduce the write amplification?

I'm really out of ideas...
If I can't find a solution I will just use a 5 disk raidz1 32K for my regular VMs and a LUKS encrypted single SSD LVM thin for my DBs heavy VMs like Zabbix and Graylog. The server is writing 900GB per day just by idleing and nerly all of that is caused by storing logs/metrics to DBs and wear leveling...
But I really would like to stick with ZFS for everything for data safty.

guletz · Aug 11, 2021

Dunuin said:
It was really hard for me to realize that and you can't change the volblocksize after creation.

Hi @Dunuin

If I remember(I have use this a long time ago), for such cases, you could create a new zfs dataset(from shell), then add this dataset as new zfs storage, setup the desired zvol. Then move your vHDD to this new zfs storge where you have a new zvolblock size.

Now I have many separate zfs datastore(only one zpool/node), each of them with different zvolblocksize(VM case). For CT cases, I also have some other datastores with different recordsize(only 16k, and default=128k), like this more or less:

vm-16k,vm-32k,vm-64k
ct-16k,ct-default

Also I think if you restore a backup(VM with 8k) on other zfs-datastore, the volblocksize will be as the defined value( 16k/32k for ex.)

Good luck /bafta!

Dunuin · Aug 11, 2021

guletz said:
Also I think if you restore a backup(VM with 8k) on other zfs-datastore, the volblocksize will be as the defined value( 16k/32k for ex.)

Yes, I also found that out meanwhile. Shutdown VM, create backup, overwrite VM by restoring backup, start VM again works fine to change the volblocksize.

guletz · Aug 11, 2021

Dunuin said:
Lets say for example there is a (fictional) 8KB of metadata. If you now want to write 4K of data ZFS will write 4K data + 8K metadata and you get a additional write amplification of factor 3. But if you use a 32K blocksize it is only 32K data + 8K metadata so only a write amplfication of factor 1,25.

Hi again!

This is not true, it is worst... because ZFS stores an extra copy of metadata( (up to a total of 3 copies). But you can lower this by using redundant_metadata = most, instead of default redundant_metadata = all.

Good luck / Bafta!

Dunuin · Aug 11, 2021

guletz said:
Hi again!

This is not true, it is worst... because ZFS stores an extra copy of metadata( (up to a total of 3 copies). But you can lower this by using redundant_metadata = most, instead of default redundant_metadata = all.

Good luck / Bafta!

Thanks. Will read a bit about it later and report after some tests.

guletz · Aug 11, 2021

.... and again

Thx. for your time and time .... and time that you have spend, my congratulation. IMHO, it is the best article that I read about write amplification regarding zfs/SSD.

Some opininions about your hard work for oll of us(me included here):

Inded, sync writes are very hard to avoid/optimise. I think in most cases, 4k sync I/O are an exception, because most of DBs use more then 4k(16K mysql and clones, 8k postgresql, and so on). Even so, for your test case(4k) you can do better:

"If logbias is set to 'latency' (the default) then there is no change from the current implementation. If the logbias property is set to 'throughput' then intent log blocks will be allocated from the main pool instead of any separate intent log devices (if present). Also data will be written immediately to spread the write load thus making for quicker subsequent transaction group commits to the pool."

So logbias=throughput + LOG device will lower the write amplification for sync write IO!( logbias=throughput on the data to stop ZIL from writing twice)

Good luck / Bafta !

vesalius · Aug 11, 2021

If you haven't read through these a couple links to pour over. No clue, but maybe zvols are are part of the problem?

https://github.com/openzfs/zfs/issues/11407
https://serverfault.com/questions/9...on-nvme-ssd-in-raid1-to-avoid-rapid-disk-wear

Dunuin · Aug 11, 2021

guletz said:
.... and again

Thx. for your time and time .... and time that you have spend, my congratulation. IMHO, it is the best article that I read about write amplification regarding zfs/SSD.

Thank. Nice to hear that someone thinks that useful too.

guletz said:
Inded, sync writes are very hard to avoid/optimise.

I think that is true. Atleast I wasn't able to bring the write amplification down to some more reasonable values. Except for not using encryption but I like all my data to be encrypted so that isn't really an option for me. Does someone know why ZFS native encrytion causes double the writes? Its encrypted after the compression so it shouldn't result in more data. And it would be unsafe to first write unencryped data to the disk and later replace it with the encrypted version. So I don't get why the encryption doubles the write amplification. Or will a 256bit AES key force the data to be atleast 256bit (=32K) too so if encrypting a 16K block of data 16K of zeros or random data will be added to match the 256bit key?

guletz said:
I think in most cases, 4k sync I/O are an exception, because most of DBs use more then 4k(16K mysql and clones, 8k postgresql, and so on). Even so, for your test case(4k) you can do better:

Yes, the 4k was more to see the absolute worst case. Write amplification for 8k and 16k are way better compared to 4k but still terrible.

guletz said:
"If logbias is set to 'latency' (the default) then there is no change from the current implementation. If the logbias property is set to 'throughput' then intent log blocks will be allocated from the main pool instead of any separate intent log devices (if present). Also data will be written immediately to spread the write load thus making for quicker subsequent transaction group commits to the pool."

So logbias=throughput + LOG device will lower the write amplification for sync write IO!( logbias=throughput on the data to stop ZIL from writing twice)

I'm not sure if a SLOG would really lower the write amplification:
1 sync write to 2 disk mirror with SLOG = 1 write to SLOG SSD + 1 write to 1st SSD + 1 write to 2nd SSD = 3 writes.
I wasn't able to find a clear answer how the ZIL will work if you don't got a SLOG. The ZIL will be somewhere on the SSDs and there is no fixed place so it will be written between the normal ZFS data scructures (creating holes later and causing data fragmentation) but without causing additional metadata to be written so its faster to write a intent log block than a complete data-structure commit with all its metadata and calculations. But I wans't able to find out if the intent log blocks will get mirrored too. So it could looks like this two options:
a.) 1 sync write to 2 disk mirror without SLOG = 1 write to 1st SSDs ZIL + 1 write 2nd SSDs ZIL + 1 write to 1st SSD + 1 write to 2nd SSD = 4 writes.
b.) 1 sync write to 2 disk mirror without SLOG = 1 write to 1st or 2nd SSDs ZIL + 1 write to 1st SSD + 1 write to 2nd SSD = 3 writes
If option b would be true there would be no difference in write amplification across all SSDs. But in both cases the SLOG would increase the performance, because every SSDs only need to do one write instead of 2 before reporting that 1 sync write to be finished.

I also thought about the logbias=throughput and from what I read this should make things even worse.
With the default logbias=latency the complete sync write will be first written to the ZIL (only if the block is smaller or equal 64K, if its bigger it will be handled like with logbias=throughput) and because writing to the ZIL doesn't got all the overhead like a data-structure transaction it will be faster and result in lower latency.
If you set logbias=throughput the ZIL won't store the complete sync write but only its metadata. The data of the sync write itself won't be stored in the ZIL but ZFS will do a real (and slow) data-scructure commit. This data-scructure commit will be done instantly and out of sequence so it won't be accumulated nor done as one big optimized commit every 5 seconds.
So in theory logbias=latency should cause less fragmentation and less write amplification because stuff stored in the ZIL will be written to the data structure when a normal data-structure commit is written so the data gets optimized and written together with all the other sync or async writes.

vesalius said:
If you haven't read through these a couple links to pour over. No clue, but maybe zvols are are part of the problem?

https://github.com/openzfs/zfs/issues/11407
https://serverfault.com/questions/9...on-nvme-ssd-in-raid1-to-avoid-rapid-disk-wear

Thanks, I will read that.

guletz · Aug 11, 2021

Dunuin said:
Or will a 256bit AES key force the data to be atleast 256bit (=32K) too so if encrypting a 16K block of data 16K of zeros or random data will be added to match the 256bit key?

Hi @Dunuin

I guess only that this is the case.(padding)

Good luck / Bafta !

Improve write amplification?

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Renowned Member

Distinguished Member

Famous Member