ashift, volblocksize, clustersize, blocksize

Dunuin · Sep 19, 2020

Hi,

I used this setup...

pool (ashift=12) -> zvol (volblocksize=8k, raw) -> ext4 (blocksize=512, no clustersize)

...and got a write amplification of about 10 from virtual drive writes on guest to physical drive writes on the host.
I researched and found this chart showing parity/padding overhead in percent using different numbers of drives in a raidz1 and different volblocksizes.

The chart is showing that if using a raidz1 of 5 drives (at ashift=12) and the proxmox default 8k volblocksize (like my setup now) is causing a overhead of +100%. So in this case using raidz1 isn't saving any space compared to to just using a zfs mirror.
At a volblocksize of 32K the overhead is minimal for 5 drives (at ashift=12) with an overhead of just 25%.

So I should recreate the zvols with volblocksize of 32k instead of 8k.
I did this by:

1.) Proxmox GUI -> datacenter -> storage -> MyPool -> Changing "Block Size" from "8k" to "32k"
2.) This only works for newly created zvols so I backed up my VM and added a new virtual harddrive to it using the same size
3.) I changed boot order to CD first, mounted a Debian Live CD ISO and booted that Live Debian
4.) I used "sudo lsblk" to find out the old (sda) and new (sdb) virtual harddrive
5.) I used "sudo dd conv=sparse if=/dev/sda of=/dev/sdb bs=32K status=progress" to copy the whole contents (inlcuding partition tables, bootloader, ...) from the old virtual harddisk with blocksize of 8k to the new virtual harddisk with blocksize of 32k.
6.) I detached the old virtual harddrive from the VM (because UUIDs are the same on both drives now) and set the boot order to boot first from the new virtual harddisk

Looks like all works fine and if I use "zfs list" the new zvol only got 1/3 of the size at "USED" and 1/2 of the size of "REFER". Also write amplification looks to be a little (10%) bit lower.

Do I need to change the blocksize and/or clustersize of the ext4 partitions inside the VMs too? Or is it no problem that the VM is writing blocks of 512B if the zvol is using 32K and the physical harddisks 4K because KVM does some virtualization magic and converts it somehow?

What would be the right blocksize/clustersize for a virtual ext4 partition?

Also the chart doesn't mention the use of native encryption nor lz4 compression wich I both use. Should I use an even higher volblocksize like 64k instead of 32k for encryption/compression to be better working or is there 32k fine too?

kwinz · Jul 24, 2021

I had a similar problem. I wrote about this here: https://www.reddit.com/r/zfs/comments/opu43n/zvol_used_size_far_greater_than_volsize/
I wrote this from the perspective of exporting a ZVOL with an NTFS filesystem via iSCSI from Ubuntu server. But the same thoughts apply when you have an ext4 volume for a VM guest in Proxmox. Hope this is useful!

Also this guy has benchmarked file backed virtual machines https://www.reddit.com/r/zfs/comments/bz5ya2/zfs_kvm_ntfs_benchmarking_various_record_sizes/ Again with NTFS but the same learnings should apply. But besides this being based on files while Proxmox stores on ZVOLs, I read that recordsize with datasets is pretty much equvalent to ZVOL volblocksize. So maybe this is also useful for you.

In general the default of 8k volblocksize is quite suboptimal for raidz, and is probably catching a lot of people by surprise. I read here that the default will be changed from 8k to 16k in the future: https://www.reddit.com/r/zfs/comments/opu43n/zvol_used_size_far_greater_than_volsize/h68208o/
Something worth thinking about is if Proxmox could change the default sooner or provide better UI for this?

My current plan is to move my VMs to 16k volblocksize (that's 4 sectors a 4k with ashift=12) and my iSCSI shares hosting large files to a volblocksize of 128k (that's 32 sectors a 4k with my ashift=12 pool).

I have a question: how do you monitor the write amplification?

Dunuin · Jul 24, 2021

kwinz said:
My current plan is to move my VMs to 16k volblocksize (that's 4 sectors a 4k with ashift=12) and my iSCSI shares hosting large files to a volblocksize of 128k (that's 32 sectors a 4k with my ashift=12 pool).

Meanwhile I created like 9 ZFS pools. The easiest way to change the volblocksize is just to backup the VMs, destroy them and import them from backups after setting the new volblocksize for that pool (Datacenter -> Storage -> YourZFSStorage -> Edit -> Block Size). That way all zvols are freshly created from the backup using the new volblocksize.

kwinz said:
I have a question: how do you monitor the write amplification?

There are 2 Types of write amplification you can measure. First there is the write amplification from guest to host caused by virtualization of disks, mixed blocksizes and zfs. To measure it I run iostat 600 2 (from the "sysstat" package) on the host and at the same time inside the guest too. The second output of that command after 10 minutes (you just need to be patient for the 2nd output, it will look like nothing happens) will show how much data got written to every disk in the last 10 minutes. So inside the VM you can sum up all data written to the virtual disks of that VM. On the host you do the same and sum up all data that was written to the drives that are part of your pool. Now you can compare them. If the VM only had written 1GB to the virtual disk inside the VM but ZFS on the host has written 10GB to the physical disks you get a write amplification of factor 10.
To make it work best you should stop all other VMs and force some heavy writes inside the VM the you are measuring (for example create a 20GB file with random numbers so ZFS can'T compress them well: dd if=/dev/urandom of=/var/tmp/writetest.txt bs=20G count=1).

The harder part is the write amplification inside your SSDs. Because of the way SSDs work they will always create write amplification. Especially if they got no powerloss protection, because without it they can't cache sync writes and need to write it down as it comes in without any write optimizations that could reduce the wearout. It happens all inside the SDDs firmware so the only way to see what is happening is checking the SMART stats.
My Intel S3710 SSDs for example got these 2 SMART stats (smartctl -a /dev/sdX to check the smart stats):

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1371216
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       2735478

"Host_Writes_32MiB = 1371216" is how much data the host had send (over the lifetime) to the SSD to get written down. It is counted in 32MiB units so in total this is 32MiB * 1371216 = 43878912 MiB = 41.84 TiB.
For "NAND_Writes_32MiB = 2735478" it is the same but this counts how much data was written to the actual NAND flash chips of the SSD. So 32MiB * 2735478 = 87535296 MiB = 83.48 TiB written to the NAND flash. If you compare the 42 TiB of data that should be written and the 83 TiB of data that actually got written you see that there is a write amplification of around factor 2 over the lifetime of the SSD.
If you want more recent values you could write down the values of Host_Writes_32MiB and NAND_Writes_32MiB and do that a week later again. Subtract the first measurement from the second on and you got how much data was written within a week. Here I see a write amplification of about factor 3.

But SMART isn't standardized so you need to be lucky that your SSD manufacturer implemented some kind of SMART attribute like the "NAND_Writes_32MiB". My other Intel S3700 SSDs for example only got the Host_Writes_32MiB but not the NAND_Writes_32MiB so there is no way to see how high their write amplification is.

And write amplification always multiplies. So if there is a write amplification of factor 10 from guest to host and factor 3 from host to NAND of the SSD it is 10 * 3 and not 10+3. So the total write amplification would be factor 30 and every 1GB of data written inside the VM would cause 30GB of data written to the SSDs NAND flash.

DerDanilo · Jul 24, 2021

so what if the recommended blocksize for the storage config with DC nvmes in mirror settings then?

ashift=12, volblocksize 4k, ext4 with whatever the default in systems is. most users don't have the time or knowledge to fiddle with their blocksize.

Dunuin · Jul 24, 2021

DerDanilo said:
so what if the recommended blocksize for the storage config with DC nvmes in mirror settings then?

ashift=12, volblocksize 4k, ext4 with whatever the default in systems is. most users don't have the time or knowledge to fiddle with their blocksize.

Thats why I see atleast one person every week starting a new thread like "my storage is too small" and they don't get that just creating VMs without optimizing the volblocksize first is most of the time wasting TB or dozens of TBs of storage space. Especially if some kind of raidz is used where 8K is always too low if ashift=9 isn't used.

For a mirror/striped mirror I'm not sure what the best volblocksize would be. I totally understand the calculations behind the raidz1/2/3 and how padding overhead is created on the block level (there is a great post about that from the head engineer of OpenZFS). But I wasn't able to find any information on how that works with mirrors or striped mirrors.

My prediction would be that 4K is optimal if using just a normal mirror with ashift=12. 4K ext4 in the guest, 4k virtio SCSI controller (if you change it from the default 512B), 4K volblocksize for the zvol, 4K for the pool and 4K for the physical disks sounds like the optimum because the blocksizes aren't mixed at all. But in that case any ZFS compression shouldn't work because there is no smaller unit so and compression would be useless.

What I'm more interested in would be what happens if you use a stripe of 2, 3 or 4 mirrors. With ZFS there is no fixed stripe width like with a tradition raid.

DerDanilo · Jul 24, 2021

I have exactly the same questions regarding (striped) mirrors in ZFS. 8k seems to work fine with recent flash hardware and a mirror setting but I don't know how to optimize that because I couldn't find useful references.

tunhube · Feb 26, 2023

Is there a reason why proxmox can not automatically determine the best settings like ubuntu, other OSs do? I never had to tweak these settings on any other ZFS systems of mine.

Seeing a 4TB image eating up 7TB+ is a really, really annoying, especially after working with it for quite a while and realizing too late.
I also see there are like hundreds of threads with the same issue, the newest just opened this month.

So again, why can't this be fixed / determined correctly by PVE?

Dunuin · Feb 26, 2023

The problem is that it is not that easy. There are millions of combinations that need other values. Keep in mind that you can do stuff like striping a raidz1 of 20 disks with a raidz2 of 4 disks and a 4-disk-mirror. Hard to guess what a good volblocksize for such a pool would be. But yes, for a simple mirror, striped mirror, raidz1/2/3 created through the webUI this would be possible to set some usable volblocksize by default.
Ashift is hard again, as this depends on the disks you are using and all SSDs are lying about that sector size they internally use. You can't choose a good ashift if you are working with wrong data.

And then it really depends on your workload. Lets for example say you are running a simple 9 disk raidz3 with a ashift=12. With the default 8K volblocksize you would lose 75% of the capacity. You would need a 128K volblocksize to only lose 38% capacity. But 128K will be absolutely terrible when you are running any DBs. For a postgresql DB you would still want to run with a 8K volblocksize, even if that means you will lose a lot of capacity, as the overhead and performance would be unusabe otherwise.

chrcoluk · May 31, 2023

All my testing has been done on either single drive ZFS or mirrors.

Increasing volblocksize decreases metadata and increases compression. But also increases waste if using small file sizes.

Only few days ago I made a change on a windows guest.

With 4k NTFS on 64k volblocksize it couldnt reach bare metal write speeds on the SSDs.

The usage excluding snapshots was as following. LZ4 compression.

30.7G referenced 23.6G written. 1.33x compressratio.

After bumping NTFS to 64k at which point write speeds increased notably.

38.5G referenced 21.7G written. 1.76x compressratio.

Wasnt expecting the guest usage to go up as much as 8gig but the meta data and compression savings out did it and is physically less written data. Plus I now have matched guest to host sizes.

My DC P4600 SSD I discovered is fastest at 16k bare metal (Internally has 16k page size). So for that I would be using at least 16k on NTFS and volblocksize.

I believe the reason why the NTFS locally lost 8 gigs, is its the OS drive, and the SxS stuff generates 1000s of tiny ascii files. But that also meant those are highly compressible.

Search

Search

ashift, volblocksize, clustersize, blocksize

Dunuin

Distinguished Member

kwinz

Well-Known Member

Dunuin

Distinguished Member

DerDanilo

Famous Member

Dunuin

Distinguished Member

DerDanilo

Famous Member

tunhube

Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

We value your privacy