default block size 8k

IsThisThingOn · Apr 19, 2023

The default block size in PVE is 8k.
OpenZFS switched the default to 16k, see here
https://github.com/openzfs/zfs/pull/12406
Is there a reason why PVE doesn't do the same?

mira · Apr 19, 2023

That change is not part of any tagged release.
We'll keep an eye on it. We may change our default once it is part of the release we package.

leesteken · Apr 19, 2023

On a related note: why does Proxmox warn about using 4k? How does it waste space, when the (mirrored) pool uses 4k? I see this message when cloning VMs:

Warning: volblocksize (4096) is less than the default minimum block size (8192). To reduce wasted space a volblocksize of 8192 is recommended.

mira · Apr 19, 2023

That's actually a warning from ZFS:

Code:

$ zfs create -V 10G -b 4096 rpool/data/testvol
Warning: volblocksize (4096) is less than the default minimum block size (8192).
To reduce wasted space a volblocksize of 8192 is recommended.

[0] explains quite well the tradeoff of lower volblocksize vs higher. It also mentions that the new default will be part of the 2.2 release.

[0] https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#zvol-volblocksize

Dunuin · Apr 19, 2023

leesteken said:
Warning: volblocksize (4096) is less than the default minimum block size (8192). To reduce wasted space a volblocksize of 8192 is recommended.

I also would like to know that. Three things that come to my mind:
1.) ZFS was initially made for Solaris with 8K in mind. Maybe it was optimized/designed for 8K and you can't lower that to not break backwards-compatability?
2.) ZFS stores multiple copies of matadata. With 4K the data-to-metadata-ratio would even be worse.
3.) When using 4K sector disks this would waste space and compression can't be used (no need to compress a 4K block of data when you only can write/read full 4K sectors anyway)

Dunuin · Apr 19, 2023

mira said:
[0] explains quite well the tradeoff of lower volblocksize vs higher. It also mentions that the new default will be part of the 2.2 release.

[0] https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#zvol-volblocksize

Ok, so basically padding overhead, metdata overhead and compression factor like I thought.

IsThisThingOn · Apr 20, 2023

So what I still fail to understand is how this all applies to PVE.
My main problem is that I could not find out how a normal VM, lets say a Windows guest or a nginx webserver mostly writes.
Assuming these both mostly read/write 64k and PVE users mostly use mirror, here is how I would interpret the OpenZFS docs:

sector alignment of guest FS is crucial

PVE GUI takes care of that.

most of guest FSes use default block size of 4-8KB, so:

Larger volblocksize can help with mostly sequential workloads and will gain a compression efficiency

To gain compression efficiency and ARC cache, 64k would be the best setting for a Windows guest or Linux nginx

Smaller volblocksize can help with random workloads and minimize IO amplification, but will use more metadata (e.g. more small IOs will be generated by ZFS)

We could completly avoid that by adding an additional 16k disk to the guest, to store its 16k MySQL DB.

and may have worse space efficiency (especially on RAIDZ and DRAID)

Mostly does not applies to PVE, because you guys strongly recommend mirror to begin with

It’s meaningless to set volblocksize less than guest FS’s block size or ashift

Makes sense.

Do I have any thinking errors in the statements above?

If they are true, I think PVE could use 16k as the save default (user forget to put the DB on an additional disk, and for most DBs even does not need to), or go with a little bit risky default of 64k to get better performance and compression for most users. Maybe even set it accordingly to if the users has RAIDZ or mirror.

fabian · Apr 20, 2023

having a larger volblocksize but small reads/writes from the guest means wasted RMW cycles (e.g., random 4k/8 writes on a 64k volblocksize volume have a large overhead). you really need to tune your volumes and the guests according to your workload, there is no one size fits all solution.

Mostly does not applies to PVE, because you guys strongly recommend mirror to begin with

I wish this were true - take a look here in the forum and you will see tons of threads about people using raidz and being surprised why their space usage goes through the roof

IsThisThingOn · Apr 20, 2023

Thank you Fabian for your response, it is highly appreciated.

fabian said:
you really need to tune your volumes and the guests according to your workload, there is no one size fits all solution

Is there some rough rule of thumb or a way to find out what workload what workload a guest has?
For example I have some Windows Guest for some Software, they are mostly idling, the only write they ever see is Windows Updates. But I have no idea if Windows Updates mostly writes 4k or 64k.

Or if I setup a Debian VM as a NGINX webhost or something, what workload would that guest mostly get?

fabian said:
I wish this were true

fabian · Apr 20, 2023

Code:

zpool iostat -r

can tell you which write sizes end up on the ZFS layer at least, you can compare the write amount there and on the physical disk and on the virtual disk inside the VM to see write amplification at each step.

Dunuin · Apr 20, 2023

As far as I understand Linux and Win both use 4K block sizes by default. You can change that, but it is not that easy. Win10/11 for example will always format the system disk with a 4K cluster size and won't offer you a way in GUI to change that. And you can't change that cluster size without reformating it.
And in Linux your ext4/xfs can't use a block size that is bigger than your page size of your RAM which is usually 4K unless you increase it which might cause other problems.

And then there is the QEMU virtual disk which is reported as a 512B logical/physical block size you are writing to which sounds really bad, but I actually wasn't able to see a big performance improvement by changing it from 512B/512B to 512B/4K.

IsThisThingOn · Apr 20, 2023

fabian said:
Code:

zpool iostat -r

can tell you which write sizes end up on the ZFS layer at least, you can compare the write amount there and on the physical disk and on the virtual disk inside the VM to see write amplification at each step.

Thanks a lot. I have a surprisingly high async 4k and 8k ind, similar to 64k or 128k. So at least for my setting, 8k seems to be the optimal value. PVE defaults make perfectly sense

Dunuin · Apr 20, 2023

Looks like 4K would be the winner here on the thin-client with 5 VMs (3x Debian, 1x HAOS, 1x OPNsense) and 3 LXCs (Debian):

Code:

root@j3710:~# zpool iostat -r

VMpool        sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          4.01M      0  1.81M      0   672K      0  42.4M      0   927K      0      0      0
8K          2.67M   142K  3.16M     29   946K  51.9K  10.6M  4.99M  1.04M  89.5K      0      0
16K          156K   245K  44.1K     47   167K   119K  4.20M  2.65M  21.8K   177K      0      0
32K          209K   296K  1.83M    373   258K   153K  6.21M  1.19M  32.7K   179K  4.01M      0
64K         80.3K   471K   364K    944   110K   236K  1018K  1.43M  25.4K   250K  2.96M      0
128K        16.2K   405K  1.84M     78  12.0K   197K    482   246K  6.85K   212K  1.21M      0
256K            0      0      0      0      0      0      0      0      0      0   384K      0
512K            0      0      0      0      0      0      0      0      0      0  82.1K      0
1M              0      0      0      0      0      0      0      0      0      0  27.7K      0
2M              0      0      0      0      0      0      0      0      0      0  21.5K      0
4M              0      0      0      0      0      0      0      0      0      0  16.8K      0
8M              0      0      0      0      0      0      0      0      0      0  6.29K      0
16M             0      0      0      0      0      0      0      0      0      0  2.10K      0
----------------------------------------------------------------------------------------------

rpool         sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          33.8K      0   914K      0  1.21K      0  16.2M      0  23.9K      0      0      0
8K          6.98K      0    218      0  3.34K     11  3.65M  2.29M  6.17K  1.41K      0      0
16K         13.3K      0      4      0  66.0K     32  2.73M  1.04M  36.1K  1.88K      0      0
32K         16.4K      6    302      0  37.4K    937  1.47M   531K  17.5K  7.57K   886K      0
64K         8.61K     28    690      0  17.4K  1.47K   608K   390K  21.0K  36.1K   534K      0
128K        5.68K      4  20.0K      0   106K     94   121K  75.1K   107K  8.46K   323K      0
256K            0      0      0      0      0      0      0      0      0      0   178K      0
512K            0      0      0      0      0      0      0      0      0      0  74.2K      0
1M              0      0      0      0      0      0      0      0      0      0  17.3K      0
2M              0      0      0      0      0      0      0      0      0      0  1.05K      0
4M              0      0      0      0      0      0      0      0      0      0    294      0
8M              0      0      0      0      0      0      0      0      0      0     52      0
16M             0      0      0      0      0      0      0      0      0      0      6      0
----------------------------------------------------------------------------------------------

Someone knows how to interpret "agg" and "ind"?

IsThisThingOn · Apr 21, 2023

This is my mostly Debian VMs host:

Code:

rpool         sync_read    sync_write    async_read    async_write      scrub         trim   
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K           349K      0  3.25M      0  19.7K      0  33.0M      0  1.93M      0      0      0
8K           301K  1.95K  7.55M      0  60.5K  1.02K  7.42M  9.72M  2.29M   185K      0      0
16K         2.40K  4.88K  3.31M      0    849  2.59K  1.80M  13.7M  25.7K   340K      0      0
32K         2.83K  3.75K  1.52M      0  2.63K  3.12K  4.53M  8.79M  34.2K   167K   158K      0
64K           941  2.23K  1.71M      1    466  3.49K   501K  7.92M  19.6K   172K   156K      0
128K            3    463  2.33M      0      4  1.16K      2  1.25M   155K   206K   143K      0
256K            0      0      0      0      0      0      0      0      0      0   129K      0
512K            0      0      0      0      0      0      0      0      0      0   120K      0
1M              0      0      0      0      0      0      0      0      0      0   105K      0
2M              0      0      0      0      0      0      0      0      0      0  72.7K      0
4M              0      0      0      0      0      0      0      0      0      0  32.4K      0
8M              0      0      0      0      0      0      0      0      0      0  8.84K      0
16M             0      0      0      0      0      0      0      0      0      0  1.67K      0
----------------------------------------------------------------------------------------------

and this is my mostly Windows VMs host:

Code:

rpool         sync_read    sync_write    async_read    async_write      scrub         trim   
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          8.04M      0  8.71M      0   335K      0  97.7M      0   447K      0      0      0
8K          9.96M  82.9K  16.4M      4   406K  29.0K  18.1M  36.1M   290K   134K      0      0
16K         14.6K   252K  3.66M      7  2.15K  73.4K  2.51M  34.2M  25.4K   302K      0      0
32K          120K   296K  2.46M     51  22.9K  82.8K  28.7M  19.9M   106K   479K      0      0
64K         3.54K   329K  5.78M    670  3.31K   100K  1.26M  33.9M  38.6K  1.54M      0      0
128K        1.41K   337K  17.2M    233  4.55K   120K   214K  7.85M   462K  4.48M      0      0
256K            0      0      0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

copec · Jul 5, 2023

"zpool iostat -r" is the wrong way to look at it as it doesn't account for IOP and the metadata overhead of ZFS itself, which makes every block update under about a 64k size have about the same cost. I've been benchmarking VMs filesystems on underlying ZFS since OpenSolaris 10 days, and 64k volblocksize or recordsize has always consistently been the sweet spot. It also ends up taking negligible more memory. Most filesystems use extents and so a worst case scenario of explosive memory usage due to small updates across many arbitrary 64k blocks - virtually never happens.

IsThisThingOn · Nov 26, 2023

Come back to revive my old thread, I am currently thinking about rebuilding my Host and came back here.
Some things are still not clear to me.

#1
In the ZFS docs, there is the line

sector alignment of guest FS is crucial

Most guest FS (NTFS, ext4) while have a sector size of 4k.
Why does that not translate to setting blocksize to 4k instead of the default 8k being the better option?

#2
Why is there no option to set recordsize, when creating a new directory (dataset) on top of a pool?

#3
When creating a Windows VM with a RAW disk on a zvol, Windows will by default create a 4k NTFS partition.
Is the max write size of that VM 4k? Because the filesystem is 4k? So even if the VM writes a 1mb file, is will be split into 4k writes, and then passed down to ZFS as 4k writes?

Dunuin · Nov 26, 2023

IsThisThingOn said:
#1
In the ZFS docs, there is the line
Most guest FS (NTFS, ext4) while have a sector size of 4k.
Why does that not translate to setting blocksize to 4k instead of the default 8k being the better option?

Because ZFS comes from Solaris and there 8K was the default so everything was build around that.
And ZFS stores multiple copies of metadata so lowering the block size would make the the data to metadata ratio even worse.

IsThisThingOn said:
Why is there no option to set recordsize, when creating a new directory (dataset) on top of a pool?

You can do that by yourself using the CLI with for example zfs set recordsize=1M rpool/data. All LXCs will then inherit this recordsize.
But the recordsize is usually not that important as this isn't a fix value. Its just an upper limit and ZFS will decide per record what size to use.

IsThisThingOn said:
When creating a Windows VM with a RAW disk on a zvol, Windows will by default create a 4k NTFS partition.
Is the max write size of that VM 4k? Because the filesystem is 4k? So even if the VM writes a 1mb file, is will be split into 4k writes, and then passed down to ZFS as 4k writes?

Windows got a clustersize you can define when formating a partition with for example NTFS. It will then group the sectors to the clustersize. So yes, would be better to format the NTFS with a clustersize that matches your volblocksize. A bit annyoing that the Windows installer GUI won't let you define the clustersize so a new Win will be installed using 4K unless you manually format the disks yourself while installing.

IsThisThingOn · Nov 26, 2023

Dunuin said:
Because ZFS comes from Solaris and there 8K was the default so everything was build around that. And ZFS stores multiple copies of metadata so lowering the block size would make the data to metadata ratio even worse.

So we should ignore it and set it to 4k instead, if I don't care about storage efficiency and more about performance?

Dunuin said:
You can do that by yourself using the CLI with for example zfs set recordsize=1M rpool/data. All LXCs will then inherit this recordsize.
But the recordsize is usually not that important as this isn't a fix value. Its just an upper limit and ZFS will decide per record what size to use.

Ahh thanks a lot. So I assume that Proxmox uses the default value of 128kb, which is fine.

Dunuin said:
Windows got a clustersize you can define when formating a partition with for example NTFS. It will then group the sectors to the clustersize. So yes, would be better to format the NTFS with a clustersize that matches your volblocksize. A bit annyoing that the Windows installer GUI won't let you define the clustersize so a new Win will be installed using 4K unless you manually format the disks yourself while installing.

But even for Linux, default for ext4 is 4k. S
So a Linux guest is unable to write bigger than 4k, right?
I am asking because some people like @copec recommend to use 64k for blocksize, while I think that if the guest only has 4k, 64k may be great for compression and ARC but has an extreme high write amplification.

Dunuin · Nov 26, 2023

IsThisThingOn said:
So we should ignore it and set it to 4k instead, if I don't care about storage efficiency and more about performance?

Doesn't mean it will perform better. Lets say there is 2x 4K writes of metadata for each record when defaukt ashift=12 and redundant_metadata=all is used. When doing a single 8k write it will store 1x 8K data + 2x 4K metadata (=16K total). When doing 2x 4K writes it will store 2x 4K of data + 4x 4K of metadata (=24K total).
So when lowering the volblocksize too much the overhead of the metadata will become very big.

IsThisThingOn · Nov 26, 2023

Sorry, I am not sure if I can follow. By metadata do you mean parity data?

I also still don't understand if a guest with a 4k filesystem is able to write bigger than 4k.
Is it the case that, no the guest can't write bigger than its FS. Or is it something like, multiple 4k writes of the guest will get cached together if they are not sync, so there could be a single 64k write on ZFS that consists of 16x 4k writes inside the guest?

default block size 8k

Well-Known Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Distinguished Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

New Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

We value your privacy