Correction of ZFS write amplification

kyprijan

New Member
Jun 8, 2023
1
0
1
Hi, i have this equipment

4 hdd 2Tb
2 consumer ssd samsung 1Tb
1 consumer ssd 256Gb

2 ssds of 1TB each are combined into a mirror and a system is installed on them
4 hdd merged into raid10 and 256gb ssd connected to them as l2arc cache

When I did this assembly I did not know about such a phenomenon as "record amplification". In 3 months, wearout was 4% , taking into account the fact that the server load was minimal. Yes, this is not much at a distance, and you can live with it, but I would like to improve this situation. The system is constantly writing something, what if install PVE on HDD raid10 with ssd as l2arc, and ssd mirror to use as storage for virtual machines?

Will the assembly lose performance? Will ssd storage be as efficient as in the current build?

Yes, I know that you can turn off atime, redirect logs to RAM, but this will not help much to reduce write amplification.
 
The Proxmox installation itself does not need anything fast and does indeed write logs and graphs constantly. Running it on (very small) HDDs will be fine and you don't need cache or anything fancy.
The I/O of the VMs and CTs depends very much on what's running inside them. Enterprise SSDs with PLP are preferred since they can handle the many IOPS and (sync) writes properly.
I did disable both pve-ha-services on a single node to reduce the writes and lots of not very useful log messages.
 
Last edited:
I did a lot of testing over the years and never got the ZFS write amplification significantly down.
- try to avoid encryption if possible (doubles write amplification for whatever reason)
- try to avoid CoW on top of CoW
- try to avoid nested filesystems
- don't use consumer SSDs without PLP as these can't cache sync writes so the SSDs can't optimize the writes for less wear
- a raidz1/2/3 isn't great as a VM storage (less IOPS and problem with padding overhead) but total write amplification will be lower, as not everythign will have to be written twice (5 disk raidz1 will only write those additional +25% parity data instead of +100% for a full copy of everything)
- the biggest problem is small randsom sync writes (so try to avoid running databases)
- write amplification....amplifies... so every small bit of data that you avoid to write will save tons of wear...

So there isn't really much you can do except for trying to avoid writes in the first place (disable logging and so on...). If you are fine with the performance I would just continue using them and as soon as one of the disks fails I would get a pair of proper (mixed-workload enterprise) SSDs that can handle those writes. Write amplification isn't really a big concern anymore when your 1TB enterprise SSD is rated to survive 20750 TB of writes and not for example just the 360TB a 1TB consumer QLC SSD is rated for.
 
Last edited:
I was running a database with heavy writing. Due to ZFS Write amplification the SSD disk very ready to replace after one year of use.
Nothing really helped to reduce it.
I had to stop using ZFS for the main database, used ext4 on LVM instead. From then on I use ZFS only on other containers then databases.


This week I have I have started the same replicated database on FreeBSD. To my surprise and when recordsize set to 8 KB (recommended for database), there is almost no write amplification at all.

The number of MB written each minute is comparable to the UFS system. I run the same database twice on the same server so I can compare.
If I do vfs.zfs.txg.timeout to 60 ( on linux it is echo 60 > /sys/module/zfs/parameters/zfs_txg_timeout). The writing on ZFS is even lower then when using traditional file system UFS + SU.

You can watch the amount of data written very easily with iostat progbram. (for freebsd iostat -w 60 -I,zpool iostat -v 60, for Linux iostat 60)
And to be sure, check the number if data written using the SMART disk diagnostic tool.

So I do believe there must be some bug regarding the ZFS on Linux. Maybe there is easy workaround, but most people do not notice this as it really becomes a problem only when you have writing intesive containers. Anyway the decrease in performance must be always present.
 
Hello,

What filesystem is running inside of the guest? The worst cases of write amplification come from using a CoW filesystem on top of a CoW filesystem.