ZFS striped/mirror - volblocksize

fanix · Aug 24, 2021

Hi I'm new in Proxmox VE, need your advice about "ZFS striped/mirror - volblocksize".
I already read a lot in this forum about guest blocksize, volblocksize comparing performance IOPS, write amplification, padding parity overhead.
But its related to RAIDZ setup, I will use 1 SSD for proxmox VE rpool and 4 x 960 GB enterprise SSD for zpool configured as striped mirrors.
I will have 1 windows server domain controler, and like 3 linux machines including databases operation.
So if I undestand correctly ther is no parity in striped/mirrors setup.
I should use different settings for windows server dataset and another for linux based machines.
I want to reach fast IOPS and "reasonable" write amplification for these SSD.
What is recomended volblocksize?
Can I use really high value like 128 KB for example?

Any advice from experinced users appreciated.

--
anything is possible

Dunuin · Aug 24, 2021

fanix said:
Hi I'm new in Proxmox VE, need your advice about "ZFS striped/mirror - volblocksize".
I already read a lot in this forum about guest blocksize, volblocksize comparing performance IOPS, write amplification, padding parity overhead.
But its related to RAIDZ setup, I will use 1 SSD for proxmox VE rpool and 4 x 960 GB enterprise SSD for zpool configured as striped mirrors.
I will have 1 windows server domain controler, and like 3 linux machines including databases operation.
So if I undestand correctly ther is no parity in striped/mirrors setup.
I should use different settings for windows server dataset and another for linux based machines.
I want to reach fast IOPS and "reasonable" write amplification for these SSD.
What is recomended volblocksize?
Can I use really high value like 128 KB for example?

I'm right now doing alot of benchmarks. I already did the 4 disk striped mirror benchmarks. I would go with 8K volblocksize (or 16K volblocksize if you plan to extend the pool later to 6 or 8 SSDs). Write and read amplification gets really terrible as soon as you try to write/read something that is smaller than your volblocksize. But very small volblocksizes are also terrible because the metadata and journal of ZFS and the guests filesystem is a fixed size so the data to metadata ratio is very bad and this causes write amplification too. So if you for example plan to run some stuff like databases I would choose the volblocksize as small as possible or atleast not bigger than the blocksize your database is using (for example 16K for MySQL, 8k for posgres etc). And sync writes got a really terrible write amplification even if you use the smallest possible volblocksize and enterprise SSDs. I got factor 58 for 4K writes, factor 16,56 for 16K writes, factor 14,72 for 32K writes and factor 4,64 for 4096K writes. Also keep in mind that striping disks won't make your pool faster. Single threaded latency will even be worse compared to just a simple mirror, it just can handle more parallel IOPS and bandwidth.

fanix · Aug 24, 2021

Right I understand your opinions. I was thinking about 16K volblocksize for dataset where linux mysql databases will live, like you wrote. But what about Windows server dataset volblosize, does it make sense to set it bigger like 128K, IOPS will be worse there but I will get rid of write amplification? I mean if on windows server vm will be any shared files its better to set higher volblocksize if I understand corectly. I already read your threads with tests appreciate your work, thanks for it.

Dunuin · Aug 24, 2021

I think there just isn't a perfect configuration. Small volblocksizes will be bad for big sequential writes, big volblocksizes will be terrible for small random writes. Very small volblocksizes will be bad because the metadata to data ratio will be terrible and data can't be compressed that much. But doing a write/read that is smaller than your volblocksize is also very terrible. So as long as you got mixed workloads something will always be bad.
Another problem is that you can't individually set the volblocksize that easy. The WebUI just doesn't allow it to for example use a 16K volblocksize for a zvol that should store a MySQL db and another 1024k volblocksize for a zvol that should store CCTV recordings. You could manually create a zvol using the CLI, rename it and attach it to a VM. That way you could set individual vollbocksizes. But as soon as you migrate that VM to another server or restore it from a backup that individual vollbocksizes won't be used anymore and it is using the global blocksize that you defined for your whole pool instead.
With LXC it is easier because they use datasets. Because PVE got no global option to set the recodsize migrating/restoring a LXC shouldn't change the recordsize and the defaults from the parent pool/dataset should be inherited. With datasets you also got no fixed blocksize. A 128K volblocksize will always read/write a full 128K block, even if you just want to access/store a 4K file. A 128K recordsize dataset is able to write undersized records. So if you for example write a 1MB file it will write it as 8x 128K records. But if you only want to store a 4K file it won't need a full 128K record so it will write something like a 4K undersized record. So using bigger recordsizes with datasets isn't as problematic as using bigger volblocksizes for zvols.

Here an example how bad the read/write amplification and read overhead can get:

This is 1024 MB of random data written by fio as 4K random sync writes with caching disabled for the host and guest. Fio was run inside a debian VM using virtio SCSI installed on a ext4 formated virtual disk that was stored on a 5 disk raidz1 with ashift=12 and volblocksize=32K.
1.) Fio writes 1024 MB and because of journaling and metadata of ext4, which is always a fixed size, the guest has written 5255 MB to the virtual disk and read 19 MB from that virtual disk.
2.) This gets amplified again on the host because ZFS adds parity and 3 copies of metadata. Also any sync writes will write everything twice because of the ZIL. Native encryption somehow also doubles all writes. And virtio virtualization also could add overhead. So these 5255 MB writes and 19 MB reads from the guest get amplified to 37.760 MB writes and 16.480 MB reads on the host.
3.) Then you got the internal write amplification inside your SSD. Here the 37.760 MB from the host get amplified to 45.960 MB that are actually written to the NAND chips of the SSDs.
So in total you just want to write 1024MB but this causes a chainreaction that writes 45.960 MB and reads 16.480 MB to do this. These gigabytes of reads would be totally avoidable. Because the volblocksize was 32K and I wrote 4K blocks, each write needed to read a full 32K block, merge it with the 4K data and write it again. Would I use a 4K volblocksize instead, or do 32K writes, there woudn't be any read overhead at all while writing.

fanix · Aug 24, 2021

Thats insane. So I will try to stick to 16K volblocksize and will see how long wil SSD sustain such workload. Thank you for your thoughts.

Dunuin · Aug 24, 2021

fanix said:
Thats insane. So I will try to stick to 16K volblocksize and will see how long wil SSD sustain such workload. Thank you for your thoughts.

Atleast you already got enterprise SSDs. My example also was with enterprise SSD but many here still use consumer SSDs because of the price. These don't got a powerloss protection and therefore can't cache sync writes in the SSDs RAM cache so writes can't be optimized and the SSDs internal write amplification should be way more terrible. Lets say it would be factor 10 instead of factor one point something without a powerloss protection. Then these 37.760 MB wouldn't be amplified to 45.960 MB but 377.600 MB.
Or something like ZFS ontop of ZFS would be terrible too.
Its really insane how easily the write amplification can grow because it doesn't add up but increase exponentially.

chrcoluk · Aug 25, 2021

ZFS on BSD has a flag to not force sync on physical device whilst still having sync in all of the filesystem layer, so I guess that could make consumer ssd's imitate enterprise one's, if linux/proxmox can do the same thing.

You can to a degree with consumer ssd's mitigate their weaknesses, manual overprovisioning, UPS etc. But I learnt something new when Dunuin revealed that enterprise ssd's treat sync writes as async to gain performance.

A quick check of openzfs docs seems to indicate 'zil_nocacheflush' does that, this if I understand right, would still have sync writes use the zil and bypass the zfs dirty cache (so sync writes still jump the queue), but zfs would no longer force the storage device to do an immediate flush for zil writes allowing it to use its internal cache like enterprise devices.

Dunuin · Aug 25, 2021

chrcoluk said:
ZFS on BSD has a flag to not force sync on physical device whilst still having sync in all of the filesystem layer, so I guess that could make consumer ssd's imitate enterprise one's, if linux/proxmox can do the same thing.

You can to a degree with consumer ssd's mitigate their weaknesses, manual overprovisioning, UPS etc. But I learnt something new when Dunuin revealed that enterprise ssd's treat sync writes as async to gain performance.

A quick check of openzfs docs seems to indicate 'zil_nocacheflush' does that, this if I understand right, would still have sync writes use the zil and bypass the zfs dirty cache (so sync writes still jump the queue), but zfs would no longer force the storage device to do an immediate flush for zil writes allowing it to use its internal cache like enterprise devices.

But the question is what happens if the (nonreduntant) PSU dies? If it dies a enterprise SSD with powerloss protection will rapidly write the SSDs RAM cache to the NAND and won't loose data. On a consumer SSD without PLP it will just loose everything in its RAM. If I tell ZFS not to flush physical devices, the ZIL won't help if the SSD handles sync writes as async writes and responds to ZFS that data was written but it wasn't in reality because it was only cached in volatile RAM?

chrcoluk · Aug 25, 2021

Dunuin said:
But the question is what happens if the (nonreduntant) PSU dies? If it dies a enterprise SSD with powerloss protection will rapidly write the SSDs RAM cache to the NAND and won't loose data. On a consumer SSD without PLP it will just loose everything in its RAM. If I tell ZFS not to flush physical devices, the ZIL won't help if the SSD handles sync writes as async writes and responds to ZFS that data was written but it wasn't in reality because it was only cached in volatile RAM?

The question is how much redundancy is enough, a decision for someone to make. If you are super paranoid to the point you want a cast iron guarantee then its not a good decision to make, but when looking at the odds, a power cut maybe once every 5-10 years, properly configured box with access to battery level, and auto shutdown configured when it reaches certain level, I would probably be ok with it personally on my personal stuff if I needed the performance or was bothered about the write amplification, I would still use enterprise kit for any employer or customers though.

You could quite easily apply the what if the capacitors fail in the ssd.

As for the flag, it wouldnt cache in ram, zil would still send direct to the device bypassing the dirty cache (ram cache). It would make it behave the same as how you described enterprise ssd behaviour.

Bear in mind the protection is a little over hyped, its not going to do much for any unwritten data in ram (async writes) which for the majority of people will be far far more than what's sitting in the device's own cache.

If you curious I dont and wont run in this configuration as I dont have a heavy write load that needs to cheat the performance, and I am not concerned about writes as I have only occasional short bursty writes.

fanix · Aug 25, 2021

One more thing I am thinking about. Is there any significant diiference between striped/mirrors of 4 SSDs and 2 mirrored VDEVs of 2 SSDs. Any advantafes related to IOPS or write amplification?

Dunuin · Aug 26, 2021

Here are some benchmarks I did (including 2 disk mirror vs 4 disk striped mirror):

So looks like two individual mirrors would be better for very small files and sync writes.

Dunuin · Aug 26, 2021

Here an updated write amplification comparison:

fanix · Aug 26, 2021

According your test I will try setup 2 mirrored VDEVs (1st for M$ Windows server and 2nd for Linux databases servers).
If there is no IOPS penalty tradeoff it should be better in POV write amplification.
Does it make sense?

Search

Search

ZFS striped/mirror - volblocksize

fanix

New Member

Dunuin

Distinguished Member

fanix

New Member

Dunuin

Distinguished Member

fanix

New Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

fanix

New Member

Dunuin

Distinguished Member

Dunuin

Distinguished Member

fanix

New Member

We value your privacy