Adding ZFS pool with ashift=12; which block size option?

Dunuin

Famous Member
Jun 30, 2020
5,984
1,386
144
Germany
Why would I want to do that at first place? One thing that it is not debatable (at least I am under that impression) is the disk - ashift relationship. 512/512 =ashift 9 (2x2x2x2x2x2x2x2x2=512) and 512/4096 or 4096/4096 (not quite sure if this exists) ashift =12(2x2x2x2x2x2x2x2x2x2x2x2)=4096
If a different number for ashift is being used then padding issues start.
Writing with a higher to a lower blocksize is always fine, just not the other way round. You shouldn't loose any performance or capacity when using ashift 12 with a 512B/512B sectors disk. If you read https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz by now you will see that padding overhead will be a result of a bad relation of volblocksize, ashift and your numbers of databearing disks. So its no problem to go from ashift 9 to ashift 12,13 or 14 you just also need to increase your volblocksize by factor 8,16 or 32. And if that is useful or not depends on your workload or how high you can go with your volblocksize.
Lets say your workload contains a posgres DB that is reading/writing 8K blocks. For a raidz1/2/3 you would want to use a ashift of 9 because every disk combinations with a ashift of 12 or above (see the spreadsheet shown in the blog post) would result in a volblocksize of atleast 16K if you don't want to loose too much space to padding overhead. On the other hand using a 4 disks striped mirror would be totally fine with a ashift of 12. With a ashift of 13 only a 2 disk mirror would be useful. And a ashift of 11 would be fine for a 8 disk striped mirror.

In general using shift=9 is preferable, as long as all your disks allow that, as it will increase the range you can choose the volblocksize from. But one problem with it is the upgradeability. Ashift can only be set once at creation of the pool. If you choose a shift of 9 there, you will be limited to 512B/512B physical/logical sector HDDs and these get more and more rare until they will somewhere in the future completely disapear from the market. If 512B/4K HDDs are the only thing left to buy (or the atleast the only you can afford) then you would need to destroy and recreate that pool again with a ashift of 12. So many people just directly use a ashift of 12, even when only using 512B/512B disks, so they could easily replace the disks later with anything they got laying around.
or you could answer yes to my example which relies on my actual configuration. 4 disks in raid10 so that means two mirrors. Ashift is 9 (aka 512b each disk) 512bx2=1024k. Right? Never seen anyone using this . Ok this is the minimal but is it the optimal?
In my case for instance all VMs have underling filesystem of 4k (commpression on). What are the math afterwards to calculate the theoretical (at least) value for zvol block size those VM will be installed on? I know it has something to do with writes and reads of 4k files when block size is 1k. For instance with some calculations you see the multiplication / duplication /quad-plication (not even a word) happening during writing and reading and make an assumption if that number (1k here would be good or bad and increase it). So with all the above happening should I use
1k 2k 4k 8k which is the default 16k?
Jup 1K would be the minium volblocksize then but that wouldn't make much sense to use. For deduplication you don't want the volblocksize to be too high. For blocklevel compression you don't want the volblocksize to be too low. For workloads with big files you don't want the volblocksize to be too low as your data to metadata ratio will get worse. For workloads with alot of smaller files you want that volblocksize to be lower than most of the small files. So its really hard to choose a good volblocksize because it depends on so many factors and most people don't fully understand their own workload. If you optimize your volblocksize for one thing, you always make it worse for something else. So most of the time its more useful to choose something in the middle as a compromise, especially because PVE only allows you set the vollbocksize globally for the ZFS storage for all virtual disks (see my feature request). I would go with a 4K volblocksize with a 4 disk striped mirror using ashift 9 as this is above the minimum useful volblocksize for your pool layout, most filesystems are based on 4K blocks os its nice to match that.
PS Thank you for the tip for SSDs but I am aware of that.
Also about the link explaining the different Raidz levels in comparison with disks bein used , compression on/off, vol blockl sizes, apart from the
bolt letters I can t get it and to tell you the truth I don t right now, For you it is perfect, for me it is complicated (probably extra missing
knowledge from my part)
That article is written by the creator of ZFS himself and he goes deep into the details on how ZFS works on block level, explaining why there is padding overhead by using examples. You can even get the formula to calculate the optimum volblocksizes for each raidz1/2/3 setup if you look at the spreadsheet linked in that article.
 
Last edited:

ieronymous

Member
Apr 1, 2019
222
10
23
42
Writing with a higher to a lower blocksize is always fine, just not the other way round. You shouldn't loose any performance or capacity when using ashift 12 with a 512B/512B sectors disk. If you read https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz by now you will see that padding overhead will be a result of a bad relation of volblocksize, ashift and your numbers of databearing disks. So its no problem to go from ashift 9 to ashift 12,13 or 14 you just also need to increase your volblocksize by factor 8,16 or 32. And if that is useful or not depends on your workload or how high you can go with your volblocksize.
Lets say your workload contains a posgres DB that is reading/writing 8K blocks. For a raidz1/2/3 you would want to use a ashift of 9 because every disk combinations with a ashift of 12 or above (see the spreadsheet shown in the blog post) would result in a volblocksize of atleast 16K if you don't want to loose too much space to padding overhead. On the other hand using a 4 disks striped mirror would be totally fine with a ashift of 12. With a ashift of 13 only a 2 disk mirror would be useful. And a ashift of 11 would be fine for a 8 disk striped mirror.

In general using shift=9 is preferable, as long as all your disks allow that, as it will increase the range you can choose the volblocksize from. But one problem with it is the upgradeability. Ashift can only be set once at creation of the pool. If you choose a shift of 9 there, you will be limited to 512B/512B physical/logical sector HDDs and these get more and more rare until they will somewhere in the future completely disapear from the market. If 512B/4K HDDs are the only thing left to buy (or the atleast the only you can afford) then you would need to destroy and recreate that pool again with a ashift of 12. So many people just directly use a ashift of 12, even when only using 512B/512B disks, so they could easily replace the disks later with anything they got laying around.
Ok that would be sufficient enough at first place. So I was trying to make everything (at least theoretically) 1:1.
wIth an OS doing read/writes at 4k (possibly you could change that somehow but I don t care to go that direction) we need the underlaying storage to use 4k blocks as well.
With ashift=9 and always a raid 10 level
2 drives equal to 512 x1(mirror)=512b
4 drives equal to 512 x 2 (mirrors) = 1024b my case
6 drives equal to 512 x 3 (mirrors) = 1536b
8 drives equal to 512 x 4 (mirrors) = 2048b
10 drives equal to 512 x 5 (mirrors) = 2575b
12 drives equal to 512 x 6 (mirrors) = 3072b
14 drives equal to 512 x 7 (mirrors) = 3584b
16 drives equal to 512 x 8 (mirrors) = 4096b=4k bingo which dictates the number of disks to use in order to achieve that everything 1:1 as much as possible.
Now with ashift of 12 (512b native drives or not)
2 drives equal to 4096 x1(mirror) = 4096b = 4k block . So here you need way bigger drives to achieve this which are none existent in 2.5"/sas3/10k
4 drives equal to 4096 x 2 (mirrors) = 8192b =8k which is above what the OS reads writes.

i don t have a problem of upgradability afterwards since the drives I m using come more easily at 512/512 and I have a lot spares. In the near future ssds will be more reliable / cheaper and everything will change again.

For deduplication you don't want the volblocksize to be too high.
What considered to be too high, even though I don t use it.

For blocklevel compression you don't want the volblocksize to be too low
By default it is being used so what considered to be too low?

Despite what @guletz suggested for the user below
I changed the default 8k to 4k, to better match most of my VM's.
<<<<<Hi,

Bad idea ;) You have a raid10(striped mirror), so any 4k VM block must be split on each mirror(4k). At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.>>>>>>

Probably 8 because in his case he uses ashift 0f 12??

But seems way more logical what you said to me earlier, on the same thread,
<<<<<<When using a ashift of 9 (512B sectors) I would use a 4K volblocksize. That way it matches the 4K blocksize most filesystems
are based on and its still 8 times your sectorsize, so should be fine for block level compression and striped mirror of up to 16 disks.
Its never a problem to write/read with a bigger blocksize to/from a smaller blocksize but a big problem to do the opposite.
So for example doing 16K block operations on a Zvol with a 8K volblocksize is absolutely fine.
But doing a 8K operation on a 16K volblocksize zvol would be bad and cause double the overhead so you just get half the performance.
Going lower with the volblocksize than needed your overhead will go up because the data to metadata ratios get worse and compression
won't be that effective. But if your volblocksize is higher than the blocksize of your workload its even more worse.
So I would rather use a smaller than a bigger volblocksize.>>>>>>>>>>>>>>>>

Based on the above my current default 8k for zvol on a 4k read/write OS looses performance due to
But doing a 8K operation on a 16K volblocksize zvol would be bad and cause double the overhead so you just get half the performance =
doing 4k operations on a 8k volblocksize.


Would it have the same impact on performance instead of using ashift 9 / vzvol block size 4 to change them with ashift 12 and zvol block to 8?
Would then the answer of @guletz amke sense if the OS does read/writes at 4k?
 
Last edited:

Dunuin

Famous Member
Jun 30, 2020
5,984
1,386
144
Germany
With ashift=9 and always a raid 10 level
2 drives equal to 512 x1(mirror)=512b
4 drives equal to 512 x 2 (mirrors) = 1024b my case
6 drives equal to 512 x 3 (mirrors) = 1536b
8 drives equal to 512 x 4 (mirrors) = 2048b
10 drives equal to 512 x 5 (mirrors) = 2575b
12 drives equal to 512 x 6 (mirrors) = 3072b
14 drives equal to 512 x 7 (mirrors) = 3584b
16 drives equal to 512 x 8 (mirrors) = 4096b=4k bingo which dictates the number of disks to use in order to achieve that everything 1:1 as much as possible.
In theory yes, but volblocksize can only be a 2^x so only 512B, 1k, 2k, 4k, ... would be possible.
What considered to be too high, even though I don t use it.


By default it is being used so what considered to be too low?
There are no fixed best numbers. Mathematically deduplication ratios should get the more worse the bigger your volblocksize is because it gets harder and harder to find exact matching duplicate blocks. Lets use a book as a metaphor with a syllable, a word and a sentence as a 'volblocksize'. You will find very much identical syllables in that book, alot of identical words but just a few identical sentences.
On the other hand, choosing a very small volblocksize will result in a very huge deduplicatiln table and therefore much more RAM usage. So you might want somewhere in the middle like a word instead of a syllablesor a sentance so atleast often usedawords like 'the', 'as' and so on could be deduplicated.

With blocklevel compressions its the other way round. If you just got a syllable there is not much data that could be worked with. You can't really make a 'er', 'en' shorter. The more data you give the compression algorithm to work with, the easier it can find phrase that could be compressed.
I didn't tried too much benchmarks but what I've seen so far a volblocksize of 8k or 16k should result in a good compression ratio.
Despite what @guletz suggested for the user below

<<<<<Hi,

Bad idea ;) You have a raid10(striped mirror), so any 4k VM block must be split on each mirror(4k). At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.>>>>>>

Probably 8 because in his case he uses ashift 0f 12??
Jup. A ashift of 12 defines that 4k is the smallest block size ZFS can work with, no matter what your disk would support. So in case of a 512B/512B disk with ashift of 12 ZFS will always write/read atleast a 4K block so it will always write/read atleast 8x 512B sectors at the same time.
If you then got a 4 disk striped mirror you want atleast a volblocksize of 8K so the 8k block of the zvol can be wrote in parallel to both mirrors as a 4k block each.
If you ashift is lower, ZFS can work with smaller blocks so your volblocksize can be lower too as you then still can split it up enough to result in atleast on block for each mirror.
Would it have the same impact on performance instead of using ashift 9 / vzvol block size 4 to change them with ashift 12 and zvol block to 8?

Would then the answer of @guletz amke sense if the OS does read/writes at 4k?
With a 4 disk striped mirror both ashift=9 + volblocksize=4k and ashift=12 + volblocksize=8k should be fine. But the first one should be indeed be better because of two things:
1.) If your linux guest doesn't use "huge pages" your kernel can only read/write as 4K blocks to your RAM. And filesystems like ext4/xfs and so on can't use a bigger blocksize than your RAM is working with. So you are then basically limited to work with filesystems that read/write 4k blocks as already said writing with a smaller to a bigger blocksize is bad. So writing a 4k block to a zvol that can only work with 8k blocks (volblocksize=8k) should result in overhead. When writing a 4k block to a 4k volblocksize zvol you shouldn't get that additional overhead.
2.) You can't really make use of blocklevel compression when your volblocksize is too near to your ashift. An example:
You got a ashift=12 (ZFS can only store 4k blocks) and you use a 8k volblocksize (so your zvols will only work with 8k blocks of data). But now you enable compression and the data of the 8K block can be compressed to 75% or 6K. ZFS can lnly work with 4K blocks and 6K of data won't fit in a 4K block. Next bigger thing ZFS can store is 2x 4K blocks so 8K. So ZFS will store that 6K of data in two full 4K blocks and the result is that it still consumes the full 8K of capacity so actually the compression isn't saving any space but will consume CPU ressources. You would need atleast a compression ratio of 50% so that the 8K of data of the zvols block will fit in a single 4K block.
Now lets say you still got a volblocksize of 8K but a ashift of 9 (so ZFS will work with 512B blocks). If that 8K block is 75% compressible ZFS could store that 6K of data as 12x 512B blocks so you indeed only consume 6K and not 8K of capacity. So here it is good if the volblocksize is a multiple (lets say factor 8) of your ashift.

And there is another blocksize we haven't taked about. Virtio by default will present the virtual disk to the VM as virtual 512B/512B logical/physical sectors disk. You can set that to 512B/4K (but only by editing your VMs config file, not by using the webUI) and native 4K (4K/4K) isn't supported by KVM at all. So if you got a 512B/512B physical disk, a ashift of 9, a volblocksize of 4K and virtio using the default 512B/512B and a ext4 filesystem using 4K blocks in the guest this would result in something lke this:
Ext4 writes 4K blocks to virtual disk -> virtio writes 512B blocks to zvol -> zvol writes 4K blocks to pool -> pool writes 512B blocks to physical disks. So in theory it should be bad that virtio writes 512B blocks to a 4K zvol but running some benchmarks with virtio using 512B/512B vs 512B/4K I wasn't able to see a noticable difference in performance nor write amplification. So I guess thats not a big problem there.

So you see, the question "What volblocksize should I use?" can't be answered with a simple rule and it totally depends on the pool layout, hardware, workload and so may other factors that you really need to understand how ZFS works in detail yourself to make a good decision as any benefit also comes with a downside. So you want to choose a volblocksize that fits your actual workload so the pool is good at the things your workload requires and these benefits will cause drawbacks that hopefully aren't that important for your workload so that the benefits will more than compensate the downsides. Otherwise it would be more useful to choose something in the middle as a compromise that isn't great at anything but also not that bad.
 
Last edited:

ieronymous

Member
Apr 1, 2019
222
10
23
42
Ext4 writes 4K blocks to virtual disk -> virtio writes 512B blocks to zvol -> zvol writes 4K blocks to pool -> pool writes 512B blocks to physical disks. So in theory it should be bad that virtio writes 512B blocks to a 4K zvol but running some benchmarks with virtio using 512B/512B vs 512B/4K I wasn't able to see a noticable difference in performance nor write amplification. So I guess thats not a big problem there.
Nice example there!

You can't really make use of blocklevel compression when your volblocksize is too near to your ashift.
So in my case I want (because I believe it is better)
volblock=4k and ashift =9=512b x 2 mirrors = 1k . I don t consider this to be near
Way it is now volblock=8k and ashift =9=512b x 2 mirrors = 1k. There is way more distance than what I am trying to achieve above

In both cases I am writing with 4k (bigger) or 8k (even bigger) to ????
1k (smaller) or
512b surface (meaning because of the raid 10 disks - so 512x2)?
i believe it is 4x1k since it sees the raid10 as one device . It is another thing if that 4x1k is going to be split to 4x512b x2 of data to each mirror afterwards right?

If that 8K block is 75% compressible
Do I Set that some how how much % the compression should be? Or is it something static that is why you started with if?
 
Last edited:

Dunuin

Famous Member
Jun 30, 2020
5,984
1,386
144
Germany
Do I Set that some how how much % the compression should be? Or is it something static that is why you started with if?
You don't set that. The compression algorithm will try to compress it as much as possible. How much it is compressible only depends on the compression algorithm chosen, the blocksize and on how good the data is compressible. Some blocks won't be compressible at all (for example blocks that are part of a zip file that already got compressed before) and some blocks will be highly compressible (like zeros of an empty part of a partition). So each single block will have its own different compression ratio.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!