default block size 8k

That change is not part of any tagged release.
We'll keep an eye on it. We may change our default once it is part of the release we package.
 
On a related note: why does Proxmox warn about using 4k? How does it waste space, when the (mirrored) pool uses 4k? I see this message when cloning VMs:
Warning: volblocksize (4096) is less than the default minimum block size (8192). To reduce wasted space a volblocksize of 8192 is recommended.
 
  • Like
Reactions: leesteken
Warning: volblocksize (4096) is less than the default minimum block size (8192). To reduce wasted space a volblocksize of 8192 is recommended.
I also would like to know that. Three things that come to my mind:
1.) ZFS was initially made for Solaris with 8K in mind. Maybe it was optimized/designed for 8K and you can't lower that to not break backwards-compatability?
2.) ZFS stores multiple copies of matadata. With 4K the data-to-metadata-ratio would even be worse.
3.) When using 4K sector disks this would waste space and compression can't be used (no need to compress a 4K block of data when you only can write/read full 4K sectors anyway)
 
So what I still fail to understand is how this all applies to PVE.
My main problem is that I could not find out how a normal VM, lets say a Windows guest or a nginx webserver mostly writes.
Assuming these both mostly read/write 64k and PVE users mostly use mirror, here is how I would interpret the OpenZFS docs:

  • sector alignment of guest FS is crucial
PVE GUI takes care of that.
  • most of guest FSes use default block size of 4-8KB, so:
    • Larger volblocksize can help with mostly sequential workloads and will gain a compression efficiency
To gain compression efficiency and ARC cache, 64k would be the best setting for a Windows guest or Linux nginx

    • Smaller volblocksize can help with random workloads and minimize IO amplification, but will use more metadata (e.g. more small IOs will be generated by ZFS)
We could completly avoid that by adding an additional 16k disk to the guest, to store its 16k MySQL DB.
    • and may have worse space efficiency (especially on RAIDZ and DRAID)
Mostly does not applies to PVE, because you guys strongly recommend mirror to begin with

    • It’s meaningless to set volblocksize less than guest FS’s block size or ashift
Makes sense.


Do I have any thinking errors in the statements above?

If they are true, I think PVE could use 16k as the save default (user forget to put the DB on an additional disk, and for most DBs even does not need to), or go with a little bit risky default of 64k to get better performance and compression for most users. Maybe even set it accordingly to if the users has RAIDZ or mirror.
 
Last edited:
having a larger volblocksize but small reads/writes from the guest means wasted RMW cycles (e.g., random 4k/8 writes on a 64k volblocksize volume have a large overhead). you really need to tune your volumes and the guests according to your workload, there is no one size fits all solution.

Mostly does not applies to PVE, because you guys strongly recommend mirror to begin with

I wish this were true - take a look here in the forum and you will see tons of threads about people using raidz and being surprised why their space usage goes through the roof ;)
 
  • Like
Reactions: IsThisThingOn
Thank you Fabian for your response, it is highly appreciated.
you really need to tune your volumes and the guests according to your workload, there is no one size fits all solution
Is there some rough rule of thumb or a way to find out what workload what workload a guest has?
For example I have some Windows Guest for some Software, they are mostly idling, the only write they ever see is Windows Updates. But I have no idea if Windows Updates mostly writes 4k or 64k.

Or if I setup a Debian VM as a NGINX webhost or something, what workload would that guest mostly get?
I wish this were true
:)
 
Code:
zpool iostat -r
can tell you which write sizes end up on the ZFS layer at least, you can compare the write amount there and on the physical disk and on the virtual disk inside the VM to see write amplification at each step.
 
As far as I understand Linux and Win both use 4K block sizes by default. You can change that, but it is not that easy. Win10/11 for example will always format the system disk with a 4K cluster size and won't offer you a way in GUI to change that. And you can't change that cluster size without reformating it.
And in Linux your ext4/xfs can't use a block size that is bigger than your page size of your RAM which is usually 4K unless you increase it which might cause other problems.

And then there is the QEMU virtual disk which is reported as a 512B logical/physical block size you are writing to which sounds really bad, but I actually wasn't able to see a big performance improvement by changing it from 512B/512B to 512B/4K.
 
Code:
zpool iostat -r
can tell you which write sizes end up on the ZFS layer at least, you can compare the write amount there and on the physical disk and on the virtual disk inside the VM to see write amplification at each step.
Thanks a lot. I have a surprisingly high async 4k and 8k ind, similar to 64k or 128k. So at least for my setting, 8k seems to be the optimal value. PVE defaults make perfectly sense :)
 
Looks like 4K would be the winner here on the thin-client with 5 VMs (3x Debian, 1x HAOS, 1x OPNsense) and 3 LXCs (Debian):
Code:
root@j3710:~# zpool iostat -r

VMpool        sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          4.01M      0  1.81M      0   672K      0  42.4M      0   927K      0      0      0
8K          2.67M   142K  3.16M     29   946K  51.9K  10.6M  4.99M  1.04M  89.5K      0      0
16K          156K   245K  44.1K     47   167K   119K  4.20M  2.65M  21.8K   177K      0      0
32K          209K   296K  1.83M    373   258K   153K  6.21M  1.19M  32.7K   179K  4.01M      0
64K         80.3K   471K   364K    944   110K   236K  1018K  1.43M  25.4K   250K  2.96M      0
128K        16.2K   405K  1.84M     78  12.0K   197K    482   246K  6.85K   212K  1.21M      0
256K            0      0      0      0      0      0      0      0      0      0   384K      0
512K            0      0      0      0      0      0      0      0      0      0  82.1K      0
1M              0      0      0      0      0      0      0      0      0      0  27.7K      0
2M              0      0      0      0      0      0      0      0      0      0  21.5K      0
4M              0      0      0      0      0      0      0      0      0      0  16.8K      0
8M              0      0      0      0      0      0      0      0      0      0  6.29K      0
16M             0      0      0      0      0      0      0      0      0      0  2.10K      0
----------------------------------------------------------------------------------------------

rpool         sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          33.8K      0   914K      0  1.21K      0  16.2M      0  23.9K      0      0      0
8K          6.98K      0    218      0  3.34K     11  3.65M  2.29M  6.17K  1.41K      0      0
16K         13.3K      0      4      0  66.0K     32  2.73M  1.04M  36.1K  1.88K      0      0
32K         16.4K      6    302      0  37.4K    937  1.47M   531K  17.5K  7.57K   886K      0
64K         8.61K     28    690      0  17.4K  1.47K   608K   390K  21.0K  36.1K   534K      0
128K        5.68K      4  20.0K      0   106K     94   121K  75.1K   107K  8.46K   323K      0
256K            0      0      0      0      0      0      0      0      0      0   178K      0
512K            0      0      0      0      0      0      0      0      0      0  74.2K      0
1M              0      0      0      0      0      0      0      0      0      0  17.3K      0
2M              0      0      0      0      0      0      0      0      0      0  1.05K      0
4M              0      0      0      0      0      0      0      0      0      0    294      0
8M              0      0      0      0      0      0      0      0      0      0     52      0
16M             0      0      0      0      0      0      0      0      0      0      6      0
----------------------------------------------------------------------------------------------
Someone knows how to interpret "agg" and "ind"?
 
Last edited:
  • Like
Reactions: IsThisThingOn
This is my mostly Debian VMs host:
Code:
rpool         sync_read    sync_write    async_read    async_write      scrub         trim   
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K           349K      0  3.25M      0  19.7K      0  33.0M      0  1.93M      0      0      0
8K           301K  1.95K  7.55M      0  60.5K  1.02K  7.42M  9.72M  2.29M   185K      0      0
16K         2.40K  4.88K  3.31M      0    849  2.59K  1.80M  13.7M  25.7K   340K      0      0
32K         2.83K  3.75K  1.52M      0  2.63K  3.12K  4.53M  8.79M  34.2K   167K   158K      0
64K           941  2.23K  1.71M      1    466  3.49K   501K  7.92M  19.6K   172K   156K      0
128K            3    463  2.33M      0      4  1.16K      2  1.25M   155K   206K   143K      0
256K            0      0      0      0      0      0      0      0      0      0   129K      0
512K            0      0      0      0      0      0      0      0      0      0   120K      0
1M              0      0      0      0      0      0      0      0      0      0   105K      0
2M              0      0      0      0      0      0      0      0      0      0  72.7K      0
4M              0      0      0      0      0      0      0      0      0      0  32.4K      0
8M              0      0      0      0      0      0      0      0      0      0  8.84K      0
16M             0      0      0      0      0      0      0      0      0      0  1.67K      0
----------------------------------------------------------------------------------------------

and this is my mostly Windows VMs host:

Code:
rpool         sync_read    sync_write    async_read    async_write      scrub         trim   
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          8.04M      0  8.71M      0   335K      0  97.7M      0   447K      0      0      0
8K          9.96M  82.9K  16.4M      4   406K  29.0K  18.1M  36.1M   290K   134K      0      0
16K         14.6K   252K  3.66M      7  2.15K  73.4K  2.51M  34.2M  25.4K   302K      0      0
32K          120K   296K  2.46M     51  22.9K  82.8K  28.7M  19.9M   106K   479K      0      0
64K         3.54K   329K  5.78M    670  3.31K   100K  1.26M  33.9M  38.6K  1.54M      0      0
128K        1.41K   337K  17.2M    233  4.55K   120K   214K  7.85M   462K  4.48M      0      0
256K            0      0      0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------
 
"zpool iostat -r" is the wrong way to look at it as it doesn't account for IOP and the metadata overhead of ZFS itself, which makes every block update under about a 64k size have about the same cost. I've been benchmarking VMs filesystems on underlying ZFS since OpenSolaris 10 days, and 64k volblocksize or recordsize has always consistently been the sweet spot. It also ends up taking negligible more memory. Most filesystems use extents and so a worst case scenario of explosive memory usage due to small updates across many arbitrary 64k blocks - virtually never happens.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!