Adding ZFS pool with ashift=12; which block size option?

mailinglists

Renowned Member
Mar 14, 2012
641
69
93
Hi,

i have created zpool with 4 HDDs (two mirrors) and ashift=12.
I am about to add it to PM, and one of the options is block size.
By default is 8k. Should I set it to 4k?
(Doesn't ashift=12 mean we have 4k?)
 
By default is 8k
= volblocksize

Yes it is(only for VM). By default is a very bad value for most of the possible user cases.. Try to start with at least 16k-32k. You will get better compression rates, beter arc hit rate, lower metadata overhead.
For CT is another story(variable zfs block size).

ashift is the minimum (4k = 12) block size that zfs will try to write to the each disk of the pool. But could be higher.

Anyway you could create different datasets with different volume block size. Then is up to you to use the proper value for yours VM/CT that fit your use case.
 
Last edited:
  • Like
Reactions: mailinglists
= volblocksize

Yes it is(only for VM). By default is a very bad value for most of the possible user cases.. Try to start with at least 16k-32k. You will get better compression rates, beter arc hit rate, lower metadata overhead.
For CT is another story(variable zfs block size).

ashift is the minimum (4k = 12) block size that zfs will try to write to the each disk of the pool. But could be higher.

Anyway you could create different datasets with different volume block size. Then is up to you to use the proper value for yours VM/CT that fit your use case.

Tnx for hints. All is clear. It should be >= 4k.
 
Is there some rule-of-thumb or formula for the ZFS pool block size (which defaults to 8k)?

I have a ZFS mirror pool of 2 HDDs (with ashift (4k = 12)) and I am not sure how to set it up or what are the pros/cons for different use cases?
 
Is there some rule-of-thumb or formula for the ZFS pool block size (which defaults to 8k)?

I have a ZFS mirror pool of 2 HDDs (with ashift (4k = 12)) and I am not sure how to set it up or what are the pros/cons for different use cases?
I would use something between 4K and 16K then. Really depends on your workload. If you got linux guests with alot of small reads/writes use 4K so it matches the blocksize of your guests ext4/xfs filesystem. If you primarily use it as a storage for some bigger media files you could choose 16K so you get less overhead writing big files and better compression ratios.
 
  • Like
Reactions: 9bitbyte
I am not sure I nailed it!

I intend to only use linux vm and containers, no data will be directly stored on the proxmox host ZFS filesystems. Some of these vms will be hosting database servers, others docker containers with various servers, others media server like plex for my media files, a backup server, etc.
 
I intend to only use linux vm and containers, no data will be directly stored on the proxmox host ZFS filesystems.
LXCs will directly use the hosts ZFS filesystems. But for these datasets the recordsize instead of the volblocksize will be used and the recordsize is a "up to" value and not a fixed number like the volbocksize, so there is rarely a need to change it from the default 128K recordsize.
Some of these vms will be hosting database servers,
If you intend it run databases you should never use a volblocksize bigger than the blocksize of your DB. What blocksize that DB type uses depends of the DB. Usually it is 8, 16, 32 or 64K.
 
  • Like
Reactions: 9bitbyte
If you intend it run databases you should never use a volblocksize bigger than the blocksize of your DB. What blocksize that DB type uses depends of the DB. Usually it is 8, 16, 32 or 64K.
So if I understand correctly: If I have a DB that is using a 64K blocksize running in a VM, I will need to change the VM OS/FS blocksize to 64K(since most default to 4K), in addition to adjusting the zvol volblocksize to 64K?
 
I changed the default 8k to 4k, to better match most of my VM's.

Is this the only change i need to perform? (I do not have any VMs yet)

My expectation is that when a new VM is created, the accompanying zvol will be automatically set to 4k by pve?


1636025164198.png
 

Attachments

  • 1636025248631.png
    1636025248631.png
    9.8 KB · Views: 13
I changed the default 8k to 4k, to better match most of my VM's.
Hi,

Bad idea ;) You have a raid10(striped mirror), so any 4k VM block must be split on each mirror(4k). At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.

Is this the only change i need to perform? (I do not have any VMs yet)

You will need to "instruct" your VM to use the same X k block-size, at minimum for the file system of the guest.

My expectation is that when a new VM is created, the accompanying zvol will be automatically set to 4k by pve?

Correct.


Good luck / Bafta !
 
So if I understand correctly: If I have a DB that is using a 64K blocksize running in a VM, I will need to change the VM OS/FS blocksize to 64K(since most default to 4K), in addition to adjusting the zvol volblocksize to 64K?
Datasets only use the recordsize. Default is 128K and means ZFS can write records from 4K to 128K depending on the size of the file you want to write. If you write a 1KB file it will write a 4K record,if you write a 25kb file it will write a 32K record, if you write a 1MB file it will write 8x 128k records.

But if you use zvols the volblocksize will be used instead. With a 64k volblocksize any write will be 64k no matter how much or less you want to write. If you just want to write 1KB it will write full 64K. If you write 1MB it will write 16x 64k. So writing 1000x 1kb will result in 64MB written and not just 1MB so you got a write amplification of factor 64 which is really bad. A big volblocksize is really bad for all small writes because of the write amplification. A small volblocksize on the other hand is bad for all big writes, because you get alot of overhead in relation to the data you want to write. Lets for example say ZFS will write additional 16K of metadata for every block. If you want to write 4 MB to a zvol with 4K blocksize it will need to write 1000x 4K blocks. So you write 1000x 4KB of data + 1000x 16K of metadata so 20MB in total and not just 4MB. Writing the same 4MB to a zvol with a 64K vollblocksize would write 64x 64K of data + 64x 16K of metadata so 5MB in total instead of 4MB. So a small volbocksize would cause 4 times the reads/writes because of the bad overhead in relation to the data. So you really need to decide for what workload you want good and for what workload bad performance because there is no perfect jack-of-all-trades volblocksize that will work fine for all workloads. Or you use something in the middle like a 16K vollbocksize which isn't good for small nor big writes/reads but also not really bad either.

Also keep in mind that PVE will use this volblocksize for all your created virtual disks. You can manually create zvols using the CLI with another volblocksize but as soon as your restore a VM from a backup or migrate a VM it will overwrite your custom volblocksize of that zvol and use the global volblocksize instead.
So yes, if your DB only does 64K blocks a 64K volblocksize would give that DB the best performance but alot of other stuff might be terrible then. For example a MySQL DB does 16K writes so your read and write performance would be limited to 1/4 of the maximal performance because for every 16K read/write a full 64K block would need to be read/written.
 
Last edited:
  • Like
Reactions: hotspot021
I changed the default 8k to 4k, to better match most of my VM's.

Is this the only change i need to perform? (I do not have any VMs yet)
Jep.
My expectation is that when a new VM is created, the accompanying zvol will be automatically set to 4k by pve?
Yes, if you set the blocksize there all new created zvols will be created with a volblocksize of that value. All already created zvols will keep their vollbocksize until you destroy and recreate them (which a backup restore or migration will do for example).
 
Last edited:
Hi,

Bad idea ;) You have a raid10(striped mirror), so any 4k VM block must be split on each mirror(4k). At minimum, you will need to use 8K
Right, if it is a striped mirror of 4 HDDs minimum would be 8K.

"I have a ZFS mirror pool of 2 HDDs (with ashift (4k = 12))" sounded to me like its just 2 HDDs as a mirror. But "i have created zpool with 4 HDDs (two mirrors) and ashift=12." sounds like it is a striped mirror.
 
Last edited:
Bad idea ;) You have a raid10(striped mirror), so any 4k VM block must be split on each mirror(4k). At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.
Right, if it is a striped mirror of 4 HDDs minimum would be 8K.

Since I am in the process of deciding the most suitable block size for my VMs (90% ofthem will be Win Servers) as well (I also have a Raid 10 created with 4 drives), I was convinced I had to use 4k in order to avoid the padding issue. Now I noticed that also the raid type comes into the equation and the number of disks participating on that raid?.
So I have to add more notes to my mini guides for future reference. I don t get why 8k is the minimum (since it is the default one i would be on the safe side-assuming that this was the best possible blocksize) What would happen for instance if the block was 4k on that raid (how that split would happen)? ... and what if that raid10 was based on 8 drives and not 4?
Assuming all users pretty much will use mirror/raid10/raidz1 for VMs, how come there isnt a sticky thread with some examples (yes I know those examples would have to be many - but at least with a few ones there will be something a user can have to make his own assumptions).

Thank you

PS I mean each diagram on net showing that type of raid level with letters like below
1653549321265.png
Ok I get that the greater the number of mirrors the greater the number of data-junks that needs to be spread across those disks (even though raid 10 claims this to be happening simultaneously)
I am trying to understand if 4k or 8k consists of A1 ->A8 parts. If yes still I dont get how you get that <<At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.>> At the end, I can t correlate the above diagram with kilobytes.
 
Last edited:
Since I am in the process of deciding the most suitable block size for my VMs (90% ofthem will be Win Servers) as well (I also have a Raid 10 created with 4 drives), I was convinced I had to use 4k in order to avoid the padding issue. Now I noticed that also the raid type comes into the equation and the number of disks participating on that raid?.
So I have to add more notes to my mini guides for future reference. I don t get why 8k is the minimum (since it is the default one i would be on the safe side-assuming that this was the best possible blocksize) What would happen for instance if the block was 4k on that raid (how that split would happen)? ... and what if that raid10 was based on 8 drives and not 4?
Assuming all users pretty much will use mirror/raid10/raidz1 for VMs, how come there isnt a sticky thread with some examples (yes I know those examples would have to be many - but at least with a few ones there will be something a user can have to make his own assumptions).

Thank you

PS I mean each diagram on net showing that type of raid level with letters like below
View attachment 37357
Ok I get that the greater the number of mirrors the greater the number of data-junks that needs to be spread across those disks (even though raid 10 claims this to be happening simultaneously)
I am trying to understand if 4k or 8k consists of A1 ->A8 parts. If yes still I dont get how you get that <<At minimum, you will need to use 8K => 4k for first mirror + 4k for the second mirror.>> At the end, I can t correlate the above diagram with kilobytes.
For raidz1/2/3 this blog post basically describes it perfectly: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz
And for a mirror/stripe/striped mirror you just use the formula: "sectorsize * number of striped vdevs"
So when using ashift=12 that would mean you are using a 4K sectorsize. And if you got 8 disks in a striped mirror aka raid10 that would mean you got 4 mirror vdevs that are striped together. So 4K sectorsize * 4 striped vdevs = minimum 16K volblocksize.
 
Thanks for the link (I ll read it later on)
So when using ashift=12 that would mean you are using a 4K sectorsize.
I think its not when it is why you use ashift =12. For 512e drives (4k physical -> 512 logical) you use ashift of 12 in order to avoid padding / shifting or how is that called.
Is it <<you are using>> or your drives use 4k sectors physically?
For drives like mine which are native 512 (physical and logical) I used ashift of 9 to avoid the above issue.
So according to what you mentioned in your post, I am using what , 512b sector size (not I, my drives are natively)
I have 4 drives (a stripe of 2 mirrors), so 512b sector size x 2 striped vdevs = 1k volblocksize minimum? That number is weird,
 
The ashift defines what the smallest blocksize that can be used will be. If you got 8 disks in a striped mirror with disks that use native 512B/512B logical/physical sectors but you use a ashift of 12 (=4k blocks used per drive) you still got a minimum volblocksize of 4x 4kb = 16k. If you use ashift=9 (512B blocksize per disk) it would be 4x 512B = 2KB minimum volblocksize.

Also keep in mind when using SSDs that they will lie about the physical/logical blocksize. Internally they all should use something like 8K, 16k or even bigger but not a single manufacturer will tell you whats actually used internally because it would make them look bad. So there it makes no sense to use a ashift of below 12 as you just move the write amplificaion from the host to the SSD.
 
Last edited:
If you got 8 disks in a striped mirror with disks that use native 512B/512B logical/physical sectors but you use a ashift of 12
Why would I want to do that at first place? One thing that it is not debatable (at least I am under that impression) is the disk - ashift relationship. 512/512 =ashift 9 (2x2x2x2x2x2x2x2x2=512) and 512/4096 or 4096/4096 (not quite sure if this exists) ashift =12(2x2x2x2x2x2x2x2x2x2x2x2)=4096
If a different number for ashift is being used then padding issues start.

If you use ashift=9 (512B blocksize per disk) it would be 4x 512B = 2KB minimum volblocksize.
or you could answer yes to my example which relies on my actual configuration. 4 disks in raid10 so that means two mirrors. Ashift is 9 (aka 512b each disk) 512bx2=1024k. Right? Never seen anyone using this . Ok this is the minimal but is it the optimal?
In my case for instance all VMs have underling filesystem of 4k (commpression on). What are the math afterwards to calculate the theoretical (at least) value for zvol block size those VM will be installed on? I know it has something to do with writes and reads of 4k files when block size is 1k. For instance with some calculations you see the multiplication / duplication /quad-plication (not even a word) happening during writing and reading and make an assumption if that number (1k here would be good or bad and increase it). So with all the above happening should I use
1k 2k 4k 8k which is the default 16k?

PS Thank you for the tip for SSDs but I am aware of that.
Also about the link explaining the different Raidz levels in comparison with disks bein used , compression on/off, vol blockl sizes, apart from the
bolt letters I can t get it and to tell you the truth I don t right now, For you it is perfect, for me it is complicated (probably extra missing
knowledge from my part)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!