ZFS Block Size recommendation for IO optimisation?

Mar 7, 2022
19
1
1
40
What is the recommended "Block Size" for ZFS utilizing Windows Server 2019/2022 Clients (KVM)??

The standard for NTFS Blocksize is 4K, Guidance for Exchange and SQl-Server is 64K. I don't know what the Guidance for File-Servers is. This is 2019 and maybe 2022 servers.


I have 8 NVME's running in ZFS based Raid10.

Bash:
zpool status
  pool: PRX01-ZFSRaid10
 state: ONLINE
config:

        NAME                               STATE     READ WRITE CKSUM
        PRX01-ZFSRaid10                    ONLINE       0     0     0
          mirror-0                         ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A05746A1  ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A05746C9  ONLINE       0     0     0
          mirror-1                         ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A067AA2C  ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A057469C  ONLINE       0     0     0
          mirror-2                         ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A05746C7  ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A0557E82  ONLINE       0     0     0
          mirror-3                         ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A05746B0  ONLINE       0     0     0
            nvme-WUS4BB038D7P3E3_A05746D6  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 ONLINE       0     0     0
          mirror-0                                            ONLINE       0     0     0
            ata-INTEL_SSDSC2KG240G8_BTYG026209JP240AGN-part3  ONLINE       0     0     0
            ata-INTEL_SSDSC2KG240G8_PHYG0064006A240AGN-part3  ONLINE       0     0     0

The NVME's Sector size is 4K


1. Question: If i set NTFS Blocksize to 64K, should i set ZFS Blocksize to
  • 16K ? (4K Sector Size * 4 Raid-0 Mirrors )
  • 64K ? (same as NTFS Blocksize, which gets written as 4x 16K chunks to the NVME's and and split into 4 sectors on the NVME)
My inkling is, that the 64K option is correct.

2. Is there any guidance for Windows File-Servers (think couple megs per excel file and the odd iso) and logservers like e.g. grafana ?
 
Yeah bump the volblocksize to 64k to match NTFS, that will also give you better compression as well vs the stock 8k.

Make sure ashift is 4k (ashift=12).
 
Last edited:
Yeah bump the volblocksize to 64k to match NTFS, that will also give you better compression as well vs the stock 8k.
Just keep in mind that PVE will use the same volblocksize for all zvols you create or restore from a backup. If you are sure that all workloads are using a 64K blocksize or higher that should be fine. But performance will be horrible as soon as some workload wants to read/write 4K to 32K blocks. So using something like posgres with its 8K blocksize, MySQL with 16K blocksize and most linux filesystems with 4K blocksize should be terrible slow.
By the way...I tried to install a Win10/Win11 with a 64K clustersize for the system partition...that is really a pain because the Windows Installer always uses a 4K clustersize with no GUI to change that. But maybe that might be easier with an Windows Server 2019/2022 and custom install ISOs.
Make sure ashift is 4k (ashift=9).
Ashift=12 would be 4k. Make sure not to use "ashift=9" unless you use HDDs with a physical/logical sector size of 512B/512B.
 
  • Like
Reactions: IsThisThingOn
2. Is there any guidance for Windows File-Servers (think couple megs per excel file and the odd iso) and logservers like e.g. grafana ?
That depends on the guest settings. There is no "best way" as a rule of thumb. Determine your guest blocksize, create the zvol with the correct volblocksize and align the partitions in your guest on this boundary and use the aforedetermined blocksize. Test if the performance is good.
 
Just keep in mind that PVE will use the same volblocksize for all zvols you create or restore from a backup. If you are sure that all workloads are using a 64K blocksize or higher that should be fine. But performance will be horrible as soon as some workload wants to read/write 4K to 32K blocks. So using something like posgres with its 8K blocksize, MySQL with 16K blocksize and most linux filesystems with 4K blocksize should be terrible slow.
By the way...I tried to install a Win10/Win11 with a 64K clustersize for the system partition...that is really a pain because the Windows Installer always uses a 4K clustersize with no GUI to change that. But maybe that might be easier with an Windows Server 2019/2022 and custom install ISOs.

Ashift=12 would be 4k. Make sure not to use "ashift=9" unless you use HDDs with a physical/logical sector size of 512B/512B.
Edited the post thanks for correcting my mistake.
 
So using something like posgres with its 8K blocksize, MySQL with 16K blocksize and most linux filesystems with 4K blocksize should be terrible slow. [...]
That depends on the guest settings. There is no "best way" as a rule of thumb. Determine your guest blocksize, create the zvol with the correct volblocksize and align the partitions in your guest on this boundary and use the aforedetermined blocksize. Test if the performance is good.

That leaves the following question: As far as performance is concerned: Would you rather

Option 1: use 4 disks per zfsRaid10 pool (1x 8k and 1x 64k pool)
Option 2: use NVME-CLI to create seperate namespaces for the 16K pool and 64K pool (and use them for seperate zfsRaid10's that each have 8 name spaces )
Option 3: Stick all 8 NVME's into a single zfsRaid10 Pool and create separate Storage Objects under the Datacenter > Storage View with 16K, 32K and 64K Blocksizes

The way i understand it is Blocksize / #Disks >= 4K

So if i use 4 Disks (that leaves a stripe of 2 Disks, that would accomodate >=8K)
If i use 8 Disks (that leaves a stripe of 4 Disks, that would accomodate >=16k Blocksize)

My workloads (as far as space requierments are concerned) are 30% Exchange, 50% SQL and Large Files, 10% general office Fileserver 5% "Drive C" and TMP files and 5% random linux-software (mostly monitoring,timeseries and log aggregation)


By the way...I tried to install a Win10/Win11 with a 64K clustersize for the system partition...that is really a pain because the Windows Installer always uses a 4K clustersize with no GUI to change that. But maybe that might be easier with an Windows Server 2019/2022 and custom install ISOs.
[...]
You need to do it on install using Diskpart.

basically open the command promt on the installer (afaik it is Shift +F10)

Bash:
diskpart
list disk
select disk #
list partition
select partition #
format fs=ntfs unit=<ClusterSize>
 
Last edited:
  • Like
Reactions: werter
That leaves the following question: As far as performance is concerned: Would you rather

Option 1: use 4 disks per zfsRaid10 pool (1x 8k and 1x 64k pool)
Option 2: use NVME-CLI to create seperate namespaces for the 16K pool and 64K pool (and use them for seperate zfsRaid10's that each have 8 name spaces )
Option 3: Stick all 8 NVME's into a single zfsRaid10 Pool and create separate Storage Objects under the Datacenter > Storage View with 16K, 32K and 64K Blocksizes

The way i understand it is Blocksize / #Disks >= 4K

So if i use 4 Disks (that leaves a stripe of 2 Disks, that would accomodate >=8K)
If i use 8 Disks (that leaves a stripe of 4 Disks, that would accomodate >=16k Blocksize)

My workloads (as far as space requierments are concerned) are 30% Exchange, 50% SQL and Large Files, 10% general office Fileserver 5% "Drive C" and TMP files and 5% random linux-software (mostly monitoring,timeseries and log aggregation)



You need to do it on install using Diskpart.

basically open the command promt on the installer (afaik it is Shift +F10)

Bash:
diskpart
list disk
select disk #
list partition
select partition #
format fs=ntfs unit=<ClusterSize>
Hey @Wolff What did you end up doing?
 
I have done some new testing. ext4 guest on zfs stripe host.

HDD spindles (as is most of the time for my tests, since nearly everyone now days only cares for NAND I hope my testing is useful to others who also use spindles).
Tested using ZFS stripe on host (no redundancy, use case large scratch storage).
Defaults for sync, and caching, nocache and io_uring in qemu. 4k ashift.
virtio single.
Compared LZ4 to none, intended use case incompressible data, although inodes still compress which affects things a little on random i/o, fio data seems to either not be compressible or very poor, as ratio always ended up at 1.00x with LZ4 at end of all test (started around 8-14x after newfs command for inode compression).
Testing 16k zvol and 64k zvol.
Testing 4k, 16k, and 64k ext4 cluster size. The latter 2 require bigalloc ext4 flag. 64k clusters were not tested on 16k zvol.
Only looking at write speeds and i/o delay, as read can be dealt with mostly in ARC, and read demand is much lower on intended workload.
2 mins test length aimed at exhausting ZFS dirty cache on a 2 gig file loop, VM had only 1/2 gig ram also to exhaust its own paging cache.
ext4 on guest with following creation and mount flags. (bigalloc only used when testing 16k clusters)
create=flex_bg,bigalloc,dir_index,extent,sparse_super2 -E lazy_itable_init=0,lazy_journal_init=0,num_backup_sb=1,packed_meta_blocks=1,discard -T largefile4 -m 0
mount=noatime,lazytime,nouser_xattr,nobarrier,noacl,discard,auto_da_alloc,commit=5

In i/o size 4k, large i/o queue random rw work load, 16k zvol is better. not much difference between 4k and 16k ext4 cluster size without lz4, but with lz4, 16k ext4 clusters are consistently over double the throughput.
4k ext4 16k zvol the slowest by far, in between all 64k zvol tests around 40-70% faster, 16k zvol with 16k ext4, about 120% faster, this combination lz4 vs none had little effect, but on all the 64k zvol tests, lz4 yielded an improvement of around 15% on throughput, however with a 25% regression i/o delay. Best result 16k ext, 16k zvol, no compression.

Same test but with 256k i/o, throughput much faster, however all 16k zvol only got a 5x increase, 64k zvol averaged a 21-22x increase. In this test 4k ext4 was the fastest ext4 cluster size, However only a small amount faster. lz4 won out here overall, fastest result with repeated runs is 4k ext4 and 64k zvol, lz4 compression. but only slightly. ext4 cluster size almost irrelevant on this test.

8k random write i/o with only 1 queue depth, not much to say here, there was no clear bad or clear good results across all combinations, boring, If being picky 16k zvol was about 10% slower with lz4, but recovered with no compression. moving on.

1meg i/o sequential writes, low queue depth, 16k zvol a disaster, about 12-13% of 64k zvol throughput, alongside huge latency. lz4 edges it on lower ext4 cluster size, and no compression edges it on larger ext4 cluster size. But all the 64k zvol tests are within 10% of each other. Best one over repeated runs 4k ext4, 64k zvol no compression.

Same as above but 256k i/o. Perhaps not unexpected, 16k zvol is very bad, just the gap a bit closer, about 16% of 64k zvol now on 4k ext4 and about 25-28% on 16k ext4. Fastest 16k ext4, 64k zvol lz4.

Wish I tested lze, but didnt realise fio was uncompressible until after a couple of testing runs were done. The tests where 64k zvol was better, particurly the sequential were much much better, 16k zvol had its biggest and only meaningful win on 4k random rw, about 30% faster than 64k zvol. Personally its awkrawd as the workload is neither all random or all sequential, its a bit of both. But given the huge advantage on sequential, it will be 64k zvol, probably alongside 64k, maybe 16k ext4 (64k didnt have any meaningful throughput advantage but did usually have lower i/o wait on the host without being more than error of margin slower than 16k, assuming might be write amplification related, also these tests were in perfect empty drive conditions, 64k should get a bigger advantage on fragmentation as time goes by). I am going to do ZLE tests on 64k zvol for the 4k random rw to see how that behaves, before I decide on compression. as 4k random rw on 64k zvol, had a noticable effect on i/o delay.

Did the ZLE the results were bad, worse than both no compression and LZ4 for the 4k random rw. I will be 16k ext4, and 64k zvol with LZ4 for this workload. When running mkfs, the benefits of compression are really evident and that benefit I expect to be noticeable in real world workloads with inodes even with incompressible data.
 
Last edited: