Blocksize / Recordsize / Thin provision options

ieronymous · Oct 7, 2024

Hi

Every time I configure a new Proxmox node for production environment it takes 1-2 years and each time I stuck on that volblock size value.
As of 2024 noticed it is set as 16k default. My questions are as follows:

- how come and that parameter <blocksize> isn't available to choose or modify during creation of the zfs raid on node->disks->zfs path?

- Given the fact that I have a z-raid10 with x4 (512e=512 logical and 4k physical block) ssds (enterprise ones) for the VMs and they are all
Window Server 2019 DC / SQL / RDS / X2 WIN11 which are all ntfs formatted so 4k filesystem, what is the best value to set to the zvol ?
Compression is enabled to default so lz4, Dedubl =no and ashift=12.
In my old setup had changed it to 4k but still not sure if it's the best value for performance and avoid wearing out the drives too quickly as well.

- Read somewhere that this value (blocksize) can t be changed afterwards but I don't think that is true, since it would come in contradiction with the ability GUI gives you to change it's value. Else it would be a greyed out option.

-Both zfs get all and zpool get all commands don't give info about volblocksize. Is there a command to check the current block size of a zvol via cli?

Record size option has a default value of 128k which corresponds to the filesystem of zfsVOL while blocksize is the block size of it if I can recall.
Does this has to be changed accordingly if I change the volblocksize to 4k. In general which of the two or both do i need to change for the above
configuration of mine?

As about the thin provision checkbox when going to Datacenter->storage->name_of_storage_you_created->options .
If someone used for VM's storage raw space instead of qcow2 is there a point of enabling it?
I know what it does, what I don't know is it's effect on raw storages.

Thank you in advance.

PS All these years experimenting with proxmox installation/configuration I have kept my own documentation in order not asking same things and have a quicker way of finding configuration parameters. Yet those questions above still are in questionmark in my mind so please don't answer with general links where somewhere inside there is a line that maybe maybe not answer my question. I would be greatful for answers as close if not exactly for my use case since this is the configI follow to all setups.

Thank you once more.

ieronymous · Oct 8, 2024

@Dunuin we had a similar debate 2-3 years ago.

According to your sayings: (ZFS Raid 10 mirror and stripe or the opposite)
<<.....And stuff like the volblocksize of a zvol can only be set once at creation and that attribute is read-only after that. >>

So how come and there is the ability to change the value (default 16k) to something else?

<<Recordsize is ONLY used for datasets, not zvols. Zvols ignore the recordsize and use the volblocksize instead. In general all VMs should use zvols and all LXCs use datasets. So that recodsize will only effect LXCs but not VMs. If you want your VMs virtual disks (zvols) to operate with a 16K blocksize you need to set the volblocksize to 16K and not the recordsize.>>

Nice, this answers my recordsize question but after that you re saying if I want my VMs to operate at 16k I need to set it accordingly,
This rises 2 questions
-I know that I can set it but what you re saying comes in contradiction to your first paragraph <<volblocksize of a zvol can only be set once at creation and that attribute is read-only after that>>
-This specific value 16k it s not only to what I want the VMs to operate but the hardware lies underneath. Right?

<<Did you check the "thin" checkbox when creating the pool using the GUI? In that case your virtual disks should be thin-provisioned and it doesn't really matter if you got unprovisioned space because this unused space won't consume any capacity. For ZFS to be able to free up space you need to use a protocol like virtio SCSI that supports TRIM and also tell the VM (VM config) and guest OS to use TRIM/discard.>>

If I can recall, I didn't have a choice to do it during creation of the VM storage but afterwards from Datacenter. Of course since we've had that talk, many things might have changed. Still question remains though.... since the storage is raw-disk-storage type, is enabling thin-provision has an impact at all?

Also said about thin provisioning:

<<Under "Datacenter -> Storage -> YourPool -> Edit" there is a "thin provisioning" checkbox but I also guess that only works for newly created virtual disks. But doing a backup+restore also should result in a "new" VM. Thats how I change my volblocksize for already created virtual disks, which also can only be set at creation.>>

Maybe you were supposed to say ..... but I also guess that only works for newly created Vms instead of Virtual Disks?
Another paragraph were you confirm that you can change the blocksize afterwards.

I had a question about raw images and snapshots but I guess you answered it as well with
<<If you use ZFS + RAW and do a snapshot PVE will use ZFSs native snapshot functionality.
If you use a qcow2 ontop of a ZFS dataset PVE should use the qcow2s snapshot functionality.
So you can snapshot with both but they behave differently. With ZFS snapshots you can for example only roll back but never redo it, because while rolling back ZFS will destroy everything that was done after taking the snapshot you rolled back to. So rolling back is a one-way road. With qcow2 snapshots you can freely jump back and forth between snapshots.>>

But my more crucial question still remains. According to ....
<<The volblocksize should be calculated by a multiple of the blocksize the drives are using (so the used ashift). If you for example use a 3 disk raidz1 you want atleast a vollbocksize of 4 times the sector size. With ashift=12 each sector would be 4K and with ashift=9 a sector only would be 512B. So with ashift of 9 the minimum useful volblocksize would be 2K (4x 512B) but with a ashift of 12 the volblocksize should be atleast 16K (4x 4K). Using a bigger volblocksize always works, so you could also use a 16K volblocksize with ashift=9 but not the other way round.>>

.....mine as mentioned in my initial post are 512e ssd drives. Even if they lie about it and using pages instead of sectors due to the technology differentiation with the spinners, the are 512b logical and 4096b physical. Since physical is what we care about in order to setup the ashift value,
ashift = 12 it is. Also they are in a raid10 configuration which involves 4 of them. With all these info and the fact 90% ofthe VMs are going to be WinServers with ntfs filesystem so 4k what is the optimal blocksize for that storage??

I ll make an assumption here. Used ashift 12, so 4k for the drives. Since the volblocksize should be a multiple of 4k (block size the drives are using) and the drives are 4 , that means 16k for teh blocksize which is already default value?
If what matters though is the number of mirrors x block size of the disks then we have 2 mirrors x 4 k = 8k for the block size.
Now if we have to take into consideration the way a write and a read action how it splits to disks in order to choose a more optimal blocksize then Im overburned and I can't continue from here on. This is as far as I can go.

Thank you in advance

ieronymous · Oct 9, 2024

..... anyone? Do you all leave everything on auto-pilot = default settings?

ieronymous · Oct 9, 2024

Trying to make into an example all parameter values that can have an effect on the zvol, I came up with the following example:
Even though compression is enabled I won't include it in the calculation even though I should (I don't know how though).
Also the drives are SSDs, so we just simulate those sector sizes since SSDs are using pages instead.
Yet they need somehow to comply with old rules OSes dictate.

Rule of thumb: Its always bad to write data with a lower block size to a storage with a greater block size. You can't avoid that when transferring data from virtio to the zvol though.

For 512B/4096k physical disks, 4 the number of them, z-raid10 their raid level type, a ashift of 12, a volblocksize for storage of 16K and virtio driver (during VM creation) using the default 512b/512b and a NTFS filesystem using 4K blocks in the guest this would result in something like this:
16k volblocksize case:
-NTFS writes 4K blocks to virtual disk
since virtio only works with 512b (read/write) this means 512bx8(amplification factor)=4k blocks

-virtio writes in 512b blocks to zvol (needs to write 4k to it)
since zvol's blocksize is 16k this means (16k-512b)=15.4k of lost space for each of the 8 times virtio is going to feed the zvol in order to fill it
with that 4k of data. Total junk data: 15.4k x 8 = 123.2k . So now zvol has stored 16k x 8 = 128k that needs to pass to the pool

-zvol writes in 16K blocks to pool
I don't know if there is a transformation going on here since in my mind, pool uses the zvol, so it's like talking for the same thing
and those 16k x 8=128k are passed as 16k x 8=128k to the pool

-pool writes in 16k blocks to physical disks (they accept 4k blocks though)
Now that 16k are splitting into 2 chunks of 8k for each of the mirrors. Now there is a differentiation here. If the first mirror will split that 8k data even further to 4 k for one disk and 4k for the other, this would be ideal and no additional overhead here. If not, we have amplification for a
second time, since 8k of data is going to be transferred to both of the drives, since they are mirrored and I think this is what happens.
These drives though, accept 4k blocks and not 8k, therefore the problem and they will need to use 2x4k of their blocks to store
the original OS data. This x2 amplification needs to happen x8 more times in order for that initial 4k of data, transfer from the OS
to the real drives of the pool.

Having the above as main example, analogues with 8k and 4k would be (without explanation)

8k volblocksize case:

-NTFS writes 4K blocks to virtual disk (we always have that x8 amplification(8x512b=4k))

-virtio writes in 512b blocks to zvol (needs to write 4k to it)
since zvol's blocksize is 8k this means (8k-512b)=7.4k of lost space for each of the 8 times virtio is going to feed the zvol in order to fill it
with that 4k of data. Total junk data: 7.4k x 8 = 59.2k . So now zvol has stored 8k x 8 = 64k that needs to pass to the pool

-zvol writes in 8K blocks to pool (as before same thing for me. I don't know if needs to be mentioned)

-pool writes in 8k blocks to physical disks (they accept 4k blocks though)
Now that 8k are splitting into 2 chunks of 4k for each of the mirrors. Once more we have a problem here depend on what's going on
afterwards. If the first mirror will split that 4k of data even further to 2k for one disk and 2k for the other, that means
each of the drives will use 4k each, for something that is 2k and the extra 2k will be padding/junk data.
If that extra split doesn't happen, 4k of data is going to be transferred to both of the drives (they are mirrored)
and I think this is what happens. These drives also accept 4k blocks and here we have an optimal transfer at least at this layer.

4k volblocksize case:

-NTFS writes 4K blocks to virtual disk (we always have that x8 amplification(8x512b=4k))

-virtio writes in 512b blocks to zvol (needs to write 4k to it)
since zvol's blocksize is 4k this means (4k-512b)=3.4k of lost space for each of the 8 times virtio is going to feed the zvol in order to fill it
with that 4k of data. Total junk data: 3.4k x 8 = 27.2k . So now zvol has stored 4k x 8 = 32k that needs to pass to the pool

-zvol writes in 4K blocks to pool (as before same thing for me. I don't know if needs to be mentioned)

-pool writes in 4k blocks to physical disks (they accept 4k blocks though)
Now that 4k are splitting into 2 chunks of 2k for each of the mirrors. Once more we have a problem here depend on what's going on
afterwards. If the first mirror will split that 2k of data even further to 1k for one disk and 1k for the other, that means
each of the drives will use 4k each, for something that is 1k and the extra 3k will be padding/junk data.
If that extra split doesn't happen, 2k of data is going to be transferred to both of the drives (they are mirrored)
and I think this is what happens. These drives though accept 4k blocks and each of the drives will use only one block in which
have the data will be padding.

Conclusion: Still none as of what would be the best choice for my case which I described it in my initial post

Don't take anything of the above as a fact, unless a way more experienced user confirms it or disproves it.

LnxBil · Oct 9, 2024

First, this is just too much text to keep an overview and you don't quote properly. There is a quote button ...
I don't know if I understood everything correctly, yet I try my best to answer:

ieronymous said:
-Both zfs get all and zpool get all commands don't give info about volblocksize. Is there a command to check the current block size of a zvol via cli?

zpool can't, it's on a different level, yet zfs can:

Code:

root@proxmox-zfs-storage ~ > zfs get volblocksize zpool/iscsi/vm-7777-disk-1
NAME                        PROPERTY      VALUE     SOURCE
zpool/iscsi/vm-7777-disk-1  volblocksize  4K        -

ieronymous said:
Read somewhere that this value (blocksize) can t be changed afterwards but I don't think that is true, since it would come in contradiction with the ability GUI gives you to change it's value. Else it would be a greyed out option.

AFAIk, an already created zvol cannot be changed, you need to create a new one and copy over the data (all cli).

ieronymous said:
As about the thin provision checkbox when going to Datacenter->storage->name_of_storage_you_created->options .
If someone used for VM's storage raw space instead of qcow2 is there a point of enabling it?
I know what it does, what I don't know is it's effect on raw storages.

It does thin provisioning, so if you enable it, you can create as many empty as you like, whereas if you don't check it, every volume you create will immediately reserve the space you want and that space is not available to other volumes anymore.

ieronymous said:
This specific value 16k it s not only to what I want the VMs to operate but the hardware lies underneath. Right?

If makes no sense to have 4k volblocksize on a ashift value of 12, because that is already 1:1 matched and will render compression useless. Therefore you use multiples of the ashift value in order to be able to use compression, which will yield better throughput, while having a little read and write amplification. This is a long term experience value, yet you can change it if you want.

If you would want to have dedup enabled, it would be better to have a matching volblocksize to your guest filesystem blocksize, which is most of the time 4K unless specified otherwise. This will lead to the best deduplcation ratio, yet 4K blocks for deduplication is not very performant.

ieronymous said:
.....mine as mentioned in my initial post are 512e ssd drives. Even if they lie about it and using pages instead of sectors due to the technology differentiation with the spinners, the are 512b logical and 4096b physical. Since physical is what we care about in order to setup the ashift value,
ashift = 12 it is. Also they are in a raid10 configuration which involves 4 of them. With all these info and the fact 90% ofthe VMs are going to be WinServers with ntfs filesystem so 4k what is the optimal blocksize for that storage??

optimal with respect to what? That is the question, because it depends. There is no one answer that fits all.

If you would like to optimize for low disk usage, I'd go with ashift=9, compression and volblocksize of 4K. This will not be the most performant array, yet it would be the one with the smallest disk usage.

The more general purpose setup is the default PVE one with ashift=12 and volblocksize=16K and compression.

Your Windows 4K blocksize is also not optimized. Normally you would optimize the data disk of the VM for the data on the disk, so e.g. if you have a lot of files greater than 64k, you would want to have a bigger cluster size for NTFS to get better performance, lower fragmentation and fewer blocks.

ieronymous said:
Rule of thumb: Its always bad to write data with a lower block size to a storage with a greater block size. You can't avoid that when transferring data from virtio to the zvol though.

No idea what you mean by that, yet the default is to use SCSI virtio, and not virtio itself.

ieronymous said:
-virtio writes in 512b blocks to zvol (needs to write 4k to it)
since zvol's blocksize is 16k this means (16k-512b)=15.4k of lost space for each of the 8 times virtio is going to feed the zvol in order to fill it
with that 4k of data. Total junk data: 15.4k x 8 = 123.2k . So now zvol has stored 16k x 8 = 128k that needs to pass to the pool

if that virtio stuff is correct, it depends on caching.

ieronymous said:
Now that 4k are splitting into 2 chunks of 2k for each of the mirrors.

correct in the ashift 9 case, but it's vdev, not mirror.

In a ashift 12 case: No, it's always written in ashift size, so one vdev gets the block (and it is the written twice, once per vdev member) and the other gets nothing.

ieronymous · Oct 9, 2024

LnxBil said:
AFAIk, an already created zvol cannot be changed, you need to create a new one and copy over the data (all cli).

Yes but if the zvol is empty why wouldn't you?

LnxBil said:
It does thin provisioning, so if you enable it, you can create as many empty as you like, whereas if you don't check it, every volume you create will immediately reserve the space you want and that space is not available to other volumes anymore.

Nice to know

LnxBil said:
zpool can't, it's on a different level, yet zfs can:

Figured it out myself while trying different things, thanks anyway for giving the exact command.

LnxBil said:
If makes no sense to have 4k volblocksize on a ashift value of 12, because that is already 1:1 matched and will render compression useless. Therefore you use multiples of the ashift value in order to be able to use compression, which will yield better throughput, while having a little read and write amplification. This is a long term experience value, yet you can change it if you want.

-This is the main issue though. How to decide if it makes sense or not? (see my examples with 4 - 8 -16k)
-Also what if compression is useless afterwards? Set it to off and less cpu cycles will be used. I don't get why compressed data are more performant than raw data.
-Using either 8 or 16k as volblocksize are already multiples of 4k.

LnxBil said:
If you would want to have dedup enabled, it would be better to have a matching volblocksize to your guest filesystem blocksize, which is most of the time 4K unless specified otherwise. This will lead to the best deduplcation ratio, yet 4K blocks for deduplication is not very performant.

Since that I stated in my initial post that deduplication = off that is good to know in general.

LnxBil said:
optimal with respect to what? That is the question, because it depends. There is no one answer that fits all.

Optimal value for VMs best operation. I know that SQL is a special use case by self but let's see them all as Windows Server VMs that manage small to medium files and the filesystem is 4k. I m not setting this up for a bank or something similar, so that is why I blend SQL server as a usual windows server OS. It isn't my point of interest here (even if should be). Probably Domain Controller and smb shares are more of my interest here.

If I would add to the formula of interests the ssd wear, then you probably answered with the <<I'd go with ashift=9, compression and volblocksize of 4K. This will not be the most performant array, yet it would be the one with the smallest disk usage>>.

LnxBil said:
The more general purpose setup is the default PVE one with ashift=12 and volblocksize=16K and compression.

Yet my examples shows that 8k would be ideal, since the last step indicates that the 8k split to 2x4k chunks for each of the mirrors matches their block size.
Of course it all depends to whether my initial assumptions on how all the data transfer happens from guest to the disks
and all the intermediate steps between is correct.
Also it is crucial as a knowledge if my assumption that the final step between the pool to mirrored disks transfer, only splits the 8k to 2x4k and each mirror gets those 4k is correct or the already 4k split goes into one division more so 2k and then, each of the 4 disks have for each 4k data, 2k useful data and 2k of padding.
Again .... see my examples and by all means, correct everything is wrong.

LnxBil said:
Your Windows 4K blocksize is also not optimized. Normally you would optimize the data disk of the VM for the data on the disk, so e.g. if you have a lot of files greater than 64k, you would want to have a bigger cluster size for NTFS to get better performance, lower fragmentation and fewer blocks.

That would never be optimized, since smb shares include small to large files and I'm a lonely IT cowboy to guard 75 people.
That renders every effort to optimize smb useless since my occupation involves way to many fields for a person alone.
It shouldn't but yet, it does.

LnxBil said:
No idea what you mean by that, yet the default is to use SCSI virtio, and not virtio itself.

I don't mean anything more than the copy-paste of advanced members here who mentioned it at first place while giving example on how the data transfers from the guest OS to the physical drives. Maybe the misunderstand is about virtio are the drivers used if you choose SCSI virtio as interface for the VM.

LnxBil said:
if that virtio stuff is correct, it depends on caching.

since no extra storage is used for that, so here it means the amount of ram the PVE is allowed to use?
Mine is 128gb. Yet I fail to see how it helps to the procedure. At least none mentioned so I haven't documented it yet.

LnxBil said:
correct in the ashift 9 case, but it's vdev, not mirror.

In a ashift 12 case: No, it's always written in ashift size, so one vdev gets the block (and it is the written twice, once per vdev member) and the other gets nothing.

Here is the juice and I have to understand what you mean.

Dunuin · Oct 9, 2024

ieronymous said:
- how come and that parameter <blocksize> isn't available to choose or modify during creation of the zfs raid on node->disks->zfs path?

No idea. I personally would have coded it with such an option and also with a MAYBE useful default blocksize preselected depending on the pool layout and number of disks (like TrueNAS does it).

ieronymous said:
Read somewhere that this value (blocksize) can t be changed afterwards but I don't think that is true, since it would come in contradiction with the ability GUI gives you to change it's value. Else it would be a greyed out option.

Only possible by destroying and recreating that zvol (so for example backup+restore of a VM).

ieronymous said:
- Given the fact that I have a z-raid10 with x4 (512e=512 logical and 4k physical block) ssds (enterprise ones) for the VMs and they are all
Window Server 2019 DC / SQL / RDS / X2 WIN11 which are all ntfs formatted so 4k filesystem, what is the best value to set to the zvol ?
Compression is enabled to default so lz4, Dedubl =no and ashift=12.
In my old setup had changed it to 4k but still not sure if it's the best value for performance and avoid wearing out the drives too quickly as well.

4K is never a good idea when using ZFS. Would be better to use 8K and install your Win VMs using a 8K NTFS clustersize. The new 16K default saves some space because of a better compression ratio and a better data-to-metadata-ratio but will be worse when doing small (4K/8K) IO.

ieronymous said:
Both zfs get all and zpool get all commands don't give info about volblocksize. Is there a command to check the current block size of a zvol via cli?

zfs get volblocksize

ieronymous said:
Record size option has a default value of 128k which corresponds to the filesystem of zfsVOL while blocksize is the block size of it if I can recall.
Does this has to be changed accordingly if I change the volblocksize to 4k. In general which of the two or both do i need to change for the above
configuration of mine?

Recordsize is only used for datasets. So will be used by LXCs but not by VMs.

ieronymous said:
As about the thin provision checkbox when going to Datacenter->storage->name_of_storage_you_created->options .
If someone used for VM's storage raw space instead of qcow2 is there a point of enabling it?
I know what it does, what I don't know is it's effect on raw storages.

ieronymous said:
If I can recall, I didn't have a choice to do it during creation of the VM storage but afterwards from Datacenter. Of course since we've had that talk, many things might have changed. Still question remains though.... since the storage is raw-disk-storage type, is enabling thin-provision has an impact at all?

If you want to use zvols natively (and not qcow2 files on top of a dataset where you won't be able to make use of ZFS features) you have to always use "raw" format. And "raw" supports thin provisioning. What it does when NOT setting the thin checkbox is defining a "refreservation" value for that zvol and telling ZFS to always reverse the needed space. See: https://docs.oracle.com/cd/E19253-01/819-5461/gazvb/index.html

ieronymous said:
So how come and there is the ability to change the value (default 16k) to something else?

You can change that value for the ZFS storage via the webUI but this won't change the volblocksize of existing zvols. Only newly created zvols will use that new value. Thats why you need to destroy and recreate the zvols (what a backup restore for example does).

ieronymous said:
-This specific value 16k it s not only to what I want the VMs to operate but the hardware lies underneath. Right?

Not always possible to find the perfect solution. You should try to avoid writing a smaller block on top of a bigger block to reduce write/read amplification. If your hardware and/or pool layout is limiting how low your ashift/volblocksize could be you would need to increase the blocksize of your guestOSs filesystems or services if possible.

ieronymous said:
.....mine as mentioned in my initial post are 512e ssd drives. Even if they lie about it and using pages instead of sectors due to the technology differentiation with the spinners, the are 512b logical and 4096b physical. Since physical is what we care about in order to setup the ashift value,
ashift = 12 it is. Also they are in a raid10 configuration which involves 4 of them. With all these info and the fact 90% ofthe VMs are going to be WinServers with ntfs filesystem so 4k what is the optimal blocksize for that storage??

Not that easy to answer. My guess would be that the usual raid rule "sectorsize * stripe width" for a raid10 should also apply here. So 4K * 2 = 8K when using ashift=12. Best way is still to try out multiple volblocksizes, benchmark them using fio and then choose the best performing one for your target workload. Its also possible to use different volblocksizes for different virtual disks on the same pool. So you could use different volblocksizes for different workloads.

ieronymous said:
.....mine as mentioned in my initial post are 512e ssd drives. Even if they lie about it and using pages instead of sectors due to the technology differentiation with the spinners, the are 512b logical and 4096b physical. Since physical is what we care about in order to setup the ashift value,
ashift = 12 it is. Also they are in a raid10 configuration which involves 4 of them.

They probably also lie about the 4K physical. Internally the SSD might work with 8K or even 16K when reading. And writing may even be higher as it can only erase multiple cells at once. But when using Enterprise SSDs with PLP that SSD atleast could cache all the writes in DRAM and optimize the writes for less wear and better performance.

ieronymous said:
With all these info and the fact 90% ofthe VMs are going to be WinServers with ntfs filesystem so 4k what is the optimal blocksize for that storage??

I ll make an assumption here. Used ashift 12, so 4k for the drives. Since the volblocksize should be a multiple of 4k (block size the drives are using) and the drives are 4 , that means 16k for teh blocksize which is already default value?
If what matters though is the number of mirrors x block size of the disks then we have 2 mirrors x 4 k = 8k for the block size.
Now if we have to take into consideration the way a write and a read action how it splits to disks in order to choose a more optimal blocksize then Im overburned and I can't continue from here on. This is as far as I can go.

I would use ashift=12, 8K volblocksize and inside the VM format the NTFS filesystem with 8K clustersize. That way NTFS will still work with 4K blocks but try to group 2x4K blocks if possible to do 8K writes.

ieronymous said:
Also the drives are SSDs, so we just simulate those sector sizes since SSDs are using pages instead.
Yet they need somehow to comply with old rules OSes dictate.

Yes and they want to deliver good results in IOPS benchmarks which are usually done with 4K so the firmware should be optimized to work well with 4K sectors.

ieronymous said:
Rule of thumb: Its always bad to write data with a lower block size to a storage with a greater block size. You can't avoid that when transferring data from virtio to the zvol though.

Correct. Also keep in mind that your virtio virtual disks is using 512B sectors by default.

But I wasn't seeing a noticeable difference when doing benchmarks and comparing 512B and 4K blocksize for the virtio disks.

ieronymous said:
-Also what if compression is useless afterwards? Set it to off and less cpu cycles will be used. I don't get why compressed data are more performant than raw data.

LZ4 is super performant. Especially when working with slower disks (HDDs and maybe even SATA SSDs) it might be better to waste some CPU cycles on compression and needing to read/write less data to the disks. If your HDD can only handle 150MB/s and you are able to compress it to 50% before writing, you can write with 300MB/s to that disk. So the speedup you gain by compression using LZ4 is usually bigger than the slowdown by waiting for the compression to happen. As long as your disk and not your CPU is the bottleneck, try to enable LZ4.

waltar · Oct 9, 2024

Dunuin said:
If your HDD can only handle 150MB/s and you are able to compress it to 50% before writing, you can write with 300MB/s to that disk.

If that would be the truth than zfs would be the fastest filesystem ever nor has any other one even all the imaginable features also. It's unbelieveable why other filesystem still exists anymore than ...

waltar · Oct 9, 2024

One think against the theory would be the recordsize mapped to pool config for disk (single/mirror/raidz*) because the disk becomes it small data bonbon between all the sector disk seeks of 8-10ms ... just calculate the the time in ms required to write the data which is around 1/100 of the seek, so the heads are just reading where they are, spit a little data and seek again, so in realitiy nearly no writes occur while most time per sec reading/searching next sector for.

Dunuin · Oct 9, 2024

waltar said:
If that would be the truth than zfs would be the fastest filesystem ever nor has any other one even all the imaginable features also. It's unbelieveable why other filesystem still exists anymore than ...

Why not? Did you test it? Its not uncommon here that I'm able to read/write faster to/from my pool than the hardware would allow. You just need lots of well-compressible data. See for example some LZ4 compression benchmarks:

Compressor Ratio Compression Decompression
LZ4 default (v1.9.0) 2.101 780 MB/s 4970 MB/s

If my SATA protocol can handle 550MB/s but my CPU can decompress with 4970 MB/s it will read 550MBs of compressed data per second from the disk, then decompresses it to RAM which results in 1155 MB of uncompressed data. So in the end I'm receiving 1155 MB of data per second from a disk where the protocol only physically supports up to 600 MB/s. Will be less in practice because of the terrible overhead of ZFS. But its not black magic. Guess why people in the past zipped files before uploading/downloading them. Way faster to download the smaller archive and decompress it later than downloading that big uncompressed file directly. Nothing else here, just that the throughput of SATA/SAS is the bottleneck instead of the internet bandwidth.

waltar · Oct 9, 2024

Nothing against possible compression ratios and computed time for which should better than disk I/O to not trottle it.
Depends on data while most big one is optimized and binary already then but yeah best compression ratios and performance get with dd and if=/dev/null which excels to mem speed even on single zfs disk

ieronymous · Oct 10, 2024

Dunuin said:
No idea. I personally would have coded it with such an option and also with a MAYBE useful default blocksize preselected depending on the pool layout and number of disks (like TrueNAS does it).

Thanks for showing up. You started the party without me

(probably due to time difference ).

Dunuin said:
4K is never a good idea when using ZFS.

Depends. It was you 2 years before (all my knowledge of this matter, comes from documentation of our past conversation (of course i don't get many aspects of it)) who agreed with me using 4k but!!!! the difference then was the fact I was using all sas spinners of 512n drives. It was again a
z-raid10 with 4 drives and after giving me examples we ended up that 4k would be best to avoid extra unnecessary padding. Since then and 2 days before making the post tried to locate and read all your answers to other members on that subject where you stated that you don't get it either the way parity works on raidz10 and your examples always were with raiz1 and that damn (just kidding here

) excel of values. Back then was 2020-2021 so know you may have a better grasp of that matter. After all we all create raiz10 storages for our VMs and that guy who creatd or participated with open zfs had only examples of raiz types. Why not raid10 which is the most performant and most dominant for storing VMs? Rhetorical question though.

Dunuin said:
Would be better to use 8K and install your Win VMs using a 8K NTFS clustersize. The new 16K default saves some space because of a better compression ratio and a better data-to-metadata-ratio but will be worse when doing small (4K/8K) IO.

Installing again VMs is out of the question, since this is just a new server where the old VMs are going to be migrated and nothing more.
Now, if the underlying storage is better (and it is) than the previous one, then that is the main goal here of my initial post.
The new drives are SSDs than spinners, bigger capacity 1.92 each instead of 1.2Tb of spinners and now all the drives are 512e instead of the old 512n. The configuration will be the same and it has already been mentioned. Raid10 with ashift=12 and I m trying for the best possible values for blocksize of the storage. Of course now, I have one more factor to worry about, the wear of the ssd disks. Blocksize plays an important role to that.

Here comes another big question. By definition 512b is a thing of the past, a way to present the disk to the OS even if the actual block size is bigger.
Does zfs cares about the logical or physical blocksize of the disk? I was really confident that it cares only for the physical one, else I have to recalculate everything again.

What would be the problem with 8k as blocksize and VMs staying the way they are at 4k.After all virtio scsi operates (read / write) at 512b by default, so it will never be 1:1 ratio the data reads or writes.

Dunuin said:
1.zfs get volblocksize 2.Recordsize is only used for datasets. So will be used by LXCs but not by VMs. 3.You can change that value for the ZFS storage via the webUI but this won't change the volblocksize of existing zvols. Only newly created zvols will use that new value. Thats why you need to destroy and recreate the zvols (what a backup restore for example does). 4."raw" supports thin provisioning. What it does when NOT setting the thin checkbox is defining a "refreservation" value for that zvol and telling ZFS to always reverse the needed space. See: https://docs.oracle.com/cd/E19253-01/819-5461/gazvb/index.html 5.Yes and they want to deliver good results in IOPS benchmarks which are usually done with 4K so the firmware should be optimized to work well with 4K sectors.

Thanks for the above. At least those are double confirmed now and added to my documentation,

Dunuin said:
Not always possible to find the perfect solution. You should try to avoid writing a smaller block on top of a bigger block to reduce write/read amplification. If your hardware and/or pool layout is limiting how low your ashift/volblocksize could be you would need to increase the blocksize of your guestOSs filesystems or services if possible.

Yeah, as already mentioned, not possible. So what we re trying to do here is the least <<damage>> with the new h/w and the same VMs.
Probably I get what you mean by saying the issue with pool layout limitation. If I was for example to set it with 8 drives so 4 mirrors of 2 drives each, then the simplest rule to calculate volblocksize would be 4k (from ashift value) x 4 (number of mirrors in the pool) = 16k volblocksize for the storage. So in theory, on that part we would have optimal 1:1 data transfer without amplification (leaving out compression dedup and such since I mention the simplest way of calculating it). Was i right about the formula though (volblocksize would be 4k (from ashift value) x 4 (number of mirrors in the pool) = 16k volblocksize for the storage)?

Dunuin said:
Not that easy to answer. My guess would be that the usual raid rule "sectorsize * stripe width" for a raid10 should also apply here. So 4K * 2 = 8K when using ashift=12

Probably you answered my above assumption.

Dunuin said:
They probably also lie about the 4K physical. Internally the SSD might work with 8K or even 16K when reading. And writing may even be higher as it can only erase multiple cells at once. But when using Enterprise SSDs with PLP that SSD atleast could cache all the writes in DRAM and optimize the writes for less wear and better performance.

I knew about their internal mechanism of reading / writing , didn t know though that internal ssd caching optimize writes for less wear!!! and better overall performance. By dram you mean SSD's internal memory right?

Dunuin said:
I would use ashift=12, 8K volblocksize and inside the VM format the NTFS filesystem with 8K clustersize. That way NTFS will still work with 4K blocks but try to group 2x4K blocks if possible to do 8K writes.

Once more, not possible in my situation, so with VMs staying as they are at 4k would you go with 8 or 16 volblocksize, taking into consideration as a first factor reducing SSD wear?

Dunuin said:
Correct. Also keep in mind that your virtio virtual disks is using 512B sectors by default. But I wasn't seeing a noticeable difference when doing benchmarks and comparing 512B and 4K blocksize for the virtio disks.

Your answer already documented since our last conversation. Yet there is a command to add inside the VMs conf I think, that
Your words at Sep 24, 2021
<<You can add this line to the config file of your VM:
args: -global scsi-hd.physical_block_size=4k
That way virtio SCSI is using 4K and not 512B blocksize.>>
and a reply to you were ...
<<Yeah that's basically what's being done in the Bugzilla.
But it only changes the physical sector size, so blocks are still reported 512/4096 instead of 4Kn.>>
See i ve done my homework in this post. I m not one of many who just asks without research. Hahhahah you some kind of <<complained / nagged>> for those type of members.

Dunuin said:
LZ4 is super performant. Especially when working with slower disks (HDDs and maybe even SATA SSDs) it might be better to waste some CPU cycles on compression and needing to read/write less data to the disks. If your HDD can only handle 150MB/s and you are able to compress it to 50% before writing, you can write with 300MB/s to that disk. So the speedup you gain by compression using LZ4 is usually bigger than the slowdown by waiting for the compression to happen. As long as your disk and not your CPU is the bottleneck, try to enable LZ4.

Also as mentioned compression is left to it's default value so on = LZ4 . Isn't on means LZ4? If I recall had find a command to check if LZ4 is being used.

When you say <<If your HDD can only handle 150MB/s and you are able to compress it to 50% before writing>>, what do you mean by that?
User isnt in control of the percentage of data to be compressed but the compression process itself. Am I skipping knowledge chapters here again?

Thank you for your online presence and time.

LnxBil · Oct 10, 2024

ieronymous said:
What would be the problem with 8k as blocksize and VMs staying the way they are at 4k.After all virtio scsi operates (read / write) at 512b by default, so it will never be 1:1 ratio the data reads or writes.

the ashift is the important part, not the physical disk blocksize unless ashift is lower than the disk blocksize. It is only relevant for potential read/write amplification.

Just benchmark it yourself.

ieronymous said:
Optimal value for VMs best operation.

That does not exist ... it depends as stated numerous times by numerous people. Find your optimal solution or just use the PVE default.

ieronymous said:
Isn't on means LZ4?

In PVE 8.2 it means according to zfsprops(7):

Code:

The current default compression algorithm is either lzjb or, if the lz4_compress feature is enabled, lz4.

ieronymous · Oct 11, 2024

LnxBil said:
the ashift is the important part, not the physical disk blocksize unless ashift is lower than the disk blocksize. It is only relevant for potential read/write amplification.

Just benchmark it yourself.

Ok but with SSDs not advertising the true internal page sizing most of the times (all the times now that I come to think about it) the ashift value would be wrong nevertheless. Already read some posts mentioning with SSDs always go with ashift =13 -> 8k blocksize for disks but how would you know if this is optimal after all.

As for benching storage after creation, since I m not a build architect (those damn values though tents to make you one)I wouldnt know what to test with what program what values to enter and how to interpret results afterwards

LnxBil · Oct 11, 2024

ieronymous said:
Ok but with SSDs not advertising the true internal page sizing most of the times (all the times now that I come to think about it) the ashift value would be wrong nevertheless. Already read some posts mentioning with SSDs always go with ashift =13 -> 8k blocksize for disks but how would you know if this is optimal after all.

Yes, 8k pagesizes is also what I read. SSD internals and caching involved is very hard to predict from the outside, so this is always a problem.

ieronymous said:
As for benching storage after creation, since I m not a build architect (those damn values though tents to make you one)I wouldnt know what to test with what program what values to enter and how to interpret results afterwards

Understandably, maybe you have some poor-mans-benchmark like installing an OS or running a database export. Something that you actually do a lot which can be compared.

For more synthetic stuff, there is always fio, which is THE benchmark tool for raw performance benchmarks. I started this page a couple of years ago and it is a good introduction, yet not complete ...

ieronymous · Oct 11, 2024

LnxBil said:
For more synthetic stuff, there is always fio, which is THE benchmark tool for raw performance benchmarks. I started this page a couple of years ago and it is a good introduction, yet not complete ...

I know about fio in general. Since it is installed in the OS how come and it benches the raw storage? It seems to bench the OS layer where data gets accessed. You mentioned it to your link as well <<you should always benchmark the final layer on which you access your data>>.
So in my case, trying to find optimal values for ashift and blocksize (if I find the way to do it I ll test ashift=12 and volblock=8 / 16 / 32 and ashift=13 and volblock= 8 / 16 / 32), which program to use and what to bench? Raw storage or inside from OS?

ieronymous · Oct 16, 2024

Well, with examples this time, I believe that my initial urge of 8k block size and ashift 12 and not the default 16 according to my easy-plain calculations waw correct ..... I guess.

So I run Iometer (I couln t find the benefit to jsut bench the underlying raw storage) inside a WinServ2019 guest and after each test with various parameters, backed it up removed it, destroyed storage, re-created with other parameters and re-run the tests.

VM specs:
4 core / 4Gb (on purpose to avoid ram usage) / 100gb on virtio scsi single emulated storage (raw), with ssd emulation, discard, guest agent io threadenabled. Storage thin provisioned.

IO meter specs:
# outstanding I/Os: 16 (default 1)
Disk target: 33554432 sectors = 16Gb (Default 0)
Update frequency: 2sec (irrelevant)
Run time: 30 secs (1 min took too long for all these tests and changed it)
Ramp up time: 10secs
Record Results: none (irrelevant)

Created 20 scenarios, 10 with 1 worker (5 tests aligned and the other half not) and the another ten with 4 workers as follows:
aligned 100% Sequential write (1 Worker):
aligned 100% Sequential write (4 Worker):
aligned 100% Random write (1 Worker):
aligned 100% Random write (4 Worker):
aligned 100% Sequential Read (1 Worker):
aligned 100% Sequential Read (4 Worker):
aligned 100% Random Read (1 Worker):
aligned 100% Random Read (4 Worker):
aligned 50% Read 50% write 50% Random 50% Sequential (1 Worker):
aligned 50% Read 50% write 50% Random 50% Sequential (4 Worker):
100% Sequential write (1 Worker):
100% Sequential write (4 Worker):
100% Random write (1 Worker):
100% Random write (4 Worker):
100% Sequential Read (1 Worker):
100% Sequential Read (4 Worker):
100% Random Read (1 Worker):
100% Random Read (4 Worker):
50% Read 50% write 50% Random 50% Sequential (1 Worker):
50% Read 50% write 50% Random 50% Sequential (4 Worker):

Finally tested the following storage scenarios underneath
ashift 12 with 4k / 8k / 16 / 1M
ashift 13 with 4k / 8k / 16 / 1M
...after the second day bored to fill everything in the xlsx file I have attached and did only the 4k since it is the one i care about
since Windows ntfs is a 4k file system.

PS I have frozen the left pane in the xl in order to compare it with the values of each column on the right easier.

Any thoughts? @LnxBil @waltar @Dunuin

waltar · Oct 16, 2024

Lots of data and/but no clear winner to go or maybe could saw best to your excel table in old pve default with "volblocksize=8k & ashift=12" ... or what is your conclusion for yourself ?
But yeah ... well ... to bring more confusion together while your winner is in theory the worst ... https://discourse.practicalzfs.com/t/hard-drives-in-zfs-pool-constantly-seeking-every-second/1421

ieronymous · Oct 16, 2024

waltar said:
Lots of data and/but no clear winner to go or maybe could saw best to your excel table in old pve default with "volblocksize=8k & ashift=12" ... or what is your conclusion for yourself ?

For me there is a winner, I just need to replicate that assumption from others as well. I m not a storage built architect. This think is a job by itself.
ashift 12 and block of 8 instead of default 16. Has almost in all situation even by a little better IOPS and more importantly latency is less than the other measurements. Specially in the write field (both alignment or not, random or sequential), a filed way important for the wear of SSDs 8k gives better results against all the other measurements.

To tell you the truth I was expecting better results with ashift 13 since SSDs tent to use bigger than 4k page (block in the HDD glossary) sizes, but maybe didn't see it because I didn t test all the block spectrum above 8k and maybe it would be where it would shine. Yet, even if it did, I care about 4k blocks so..... I have a winner and he is my initial combination of 8k/ashift 12. Of course I am expecting the others to jump in and agree as well or not and explain why.

waltar said:
But yeah ... well ... to bring more confusion together while your winner is in theory the worst ... https://discourse.practicalzfs.com/t/hard-drives-in-zfs-pool-constantly-seeking-every-second/1421

Well it isn't the worst since ......
That link had some interesting facts like

-<<Can I minimize seeks even further, with an even more aggressive topology change?
Absolutely! Ditch the Z1 and go to mirrors, and now you’re only forcing a seek on two drives per write instead of three drives per write.
You’ll also get significantly higher performance and faster resilvers out of the deal.>>
Well Im already there. The whole issue of the guy on that post is different than mine. For starters he has raidz1 on HDDs
and I have raid10 on SSDs.

-<<zvols dont have a variablew block size>>
No news here.

--<<Every block (aka “record” for dataset or “volblock” for zvols) written to a RAIDz vdev must be split up evenly amongst n-p drives,
where n is the number of drives in the vdev and p is the parity level.>>
...still not my case here

-<<HOWEVER, if you’re using replication, the properties on the target are irrelevant
your existing blocks are sent as-is and written as-is.>> I don't use replication.

--<<Compression can still cause blocks to require fewer sectors to write than normally, which does still mean even “ideal width” vdevs
will often come out “uneven” in terms of on-disk layout…>>
Maybe when you can match the storage layout with the workload, you might be better with compression set to off and not having the uneven sectors to split to the drives.,

-<<I am a little concerned that you might not actually get the results you expect from this,
depending on how Proxmox is migrating those VMs under the hood–if it’s using OpenZFS replication for the task,
your block size won’t change despite the volblocksize serring being different; volblocksize is immutable once set,
and replication does not resize blocks, it just moves them as-is.>>
This might be of concern since I have all the VMs created and I ll just migrate them.
My migration is going to be like. VMs backed up to a TrueNas nfs share and from there to the new storage via an nfs share.
So it is going from zvol (blocksize) -> Dataset (recordsize) -> zvol (blocksize).

- <<if you migrate a VM’s disk from one storage pool to another (let’s say I’m moving it from zstore-8k to zstore-32k),
it will adopt the volblocksize of the new storage pool.>>
This is what I hope for.

LnxBil · Oct 22, 2024

ieronymous said:
-<<I am a little concerned that you might not actually get the results you expect from this,
depending on how Proxmox is migrating those VMs under the hood–if it’s using OpenZFS replication for the task,
your block size won’t change despite the volblocksize serring being different; volblocksize is immutable once set,
and replication does not resize blocks, it just moves them as-is.>>
This might be of concern since I have all the VMs created and I ll just migrate them.
My migration is going to be like. VMs backed up to a TrueNas nfs share and from there to the new storage via an nfs share.
So it is going from zvol (blocksize) -> Dataset (recordsize) -> zvol (blocksize).

- <<if you migrate a VM’s disk from one storage pool to another (let’s say I’m moving it from zstore-8k to zstore-32k),
it will adopt the volblocksize of the new storage pool.>>
This is what I hope for.

Maybe just nomenclature, yet you can set the new volblocksize, create another entry for a storage in PVS under a different dataset on your current pool where the data is and just move the data online to the "new" storage via PVE GUI. You will not need a backup & restore in that case, the data is copied, not the underlying ZFS.

Blocksize / Recordsize / Thin provision options

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Famous Member

Famous Member

Distinguished Member

Famous Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Attachments

Famous Member

Well-Known Member

Distinguished Member

We value your privacy