zfs creation considerations - weird GUI issue

ieronymous · Apr 20, 2021

I managed to create a zpool consisting of 4 x4tb drives and gave me all the possibilities before creation even raidz2. i was going for Raid10 but better to be able to loose any 2 drives than 2 specific ones (so data integrity more critical here than performance). Weird thing is I was expecting after creation of the zpool to see a capacitance between 6.2 - 7.something TB but to my surprise I noticed a lottery number of 14.5TB!!!! This cant be. even though I ve read that you can use raidz2 with four drives the link below
https://raidcalculators.com/zfs-raidz-capacity.php refuses to calculate the capacity mentioning <<RAID Z2 requires 5 drives or more>>

If you check the uploaded picture you can see that indeed displays 14.5TB both in gui and in shell
:~# zpool list HH
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
HH 14.5T 1.51M 14.5T - - 0% 0% 1.00x ONLINE -

What is the deal here? Am I missing something?

By the way, gui options only allow you to build a zfs storage not to remove one?

Dunuin · Apr 20, 2021

Zpool is always showing you raw storage not usable storage. If you want to see usable storage you can run zfs list instead of zpool list.
zpool list should show you something like 16TB and zfs list around 8 TB.
Also keep in mind that you need to increase your volblocksize before creating your first VM or you will waste alot of capacity. Look at this table here for raidz2. If you use the default volblocksize of 8K you will loose 67% of your raw capacity (and can only use 1/3 of your capacity) if using ashift of 12. You need a volblocksize of 24K or 256K if you want only to loose half of your raw capacity. You won't see this wasted space directly. It will still show you 8TB available but everything you write will consume 150% of the space it should. Because everything is 50% bigger, because of bad padding if your volblocksize is too small, your 8TB will be full after writing 5.33 TB. And with ZFS you should never fill up your pool more than 80 or 90 %. Over 80% it will get slow and over 90% it will switch into panic mode. So right now with the default volblocksize you can only use 4.266TB if you limit the quota of the pool to 80%.

If you are running workloads with small writes (like DBs) or want to run VMs from that pool (where you need alot of IOPS) you really should consider using a stripped mirror (raid10) instead of raidz2.
Raid never replaces a backup, so its not that bad if you loose a pool because you should always be able to rebuild it from your backups.

If you really need the high availability of raidz2 you could also add 2 more drives and stripe two mirrors with 3 drives each. That way you get a stripped mirror where any 2 drives may fail.

ieronymous · Apr 21, 2021

Dunuin said:
should show you something like 16TB and zfs list around 8 TB.

Actually raw is 14.5Tb and usable is 6.83Tb. Thank you for the clarification of the diffrence between zpool and zfs list.

Dunuin said:
If you are running workloads with small writes (like DBs) or want to run VMs from that pool (where you need alot of IOPS) you really should consider using a stripped mirror (raid10) instead of raidz2.

That is the plan actually. For data I have 4 x4Tb NAS drives 7200Rpm (Red Pro and IronWolf Pro) in what I thought to be best Raidz2 (do you advise otherwise? The data to be stored, after a portion of that pool will be given to WinServ2019, will be user files of A.D and backups of Vm's which l will be around 3 or 4 in number)

For VMs, still expecting 4x1.2Tb sas 2.5 inch 10K Rpm 2.5 inch drives in Raid 10 and from the same pool of disks a portion will be given to store sql databases (These databases after compression are automatically stored on Cloud also) . Do I need to adjust the default volblocksize for Raid 10 also?

Dunuin said:
If you really need the high availability of raidz2 you could also add 2 more drives and stripe two mirrors with 3 drives each. That way you get a stripped mirror where any 2 drives may fail.

Isnt that you describing a RAid 10 that could automatically be done if i would choose Raid10 as type with 6 drives or above 4 drievs you have to manually create the Raid10 type?

Dunuin said:
Also keep in mind that you need to increase your volblocksize before creating your first VM or you will waste alot of capacity. Look at this table here for raidz2. If you use the default volblocksize of 8K you will loose 67% of your raw capacity (and can only use 1/3 of your capacity) if using ashift of 12. You need a volblocksize of 24K or 256K if you want only to loose half of your raw capacity. You won't see this wasted space directly. It will still show you 8TB available but everything you write will consume 150% of the space it should. Because everything is 50% bigger, because of bad padding if your volblocksize is too small, your 8TB will be full after writing 5.33 TB. And with ZFS you should never fill up your pool more than 80 or 90 %. Over 80% it will get slow and over 90% it will switch into panic mode. So right now with the default volblocksize you can only use 4.266TB if you limit the quota of the pool to 80%.

Ok left this for last, since you introduced me here to a new level of info. I hope to see this asap because what I am trying to setup here, is going to be in a production environment and I dont want to be in the position of reconfiguring everything from scratch. I have setup Prox several times for different purposes and never seen that info before, neither noticed it (of course by the way you describe storage would first have to reach that 80-90% first).

So, how do I
-Check the volblocksize which currently being used (I noticed you mentioned 8K but how can I check it )
-increase the volblocksize and measure the best number for that?

Dunuin said:
If you use the default volblocksize of 8K you will loose 67% of your raw capacity (and can only use 1/3 of your capacity) if using ashift of 12.

Does that mean I had to change ashift from 12 (I thought it was a good number for ssd disks) to a different number during storage creation, and if yes from what it depends that number?

Dunuin said:
So right now with the default volblocksize you can only use 4.266TB if you limit the quota of the pool to 80%

Where do I set/change that GUI or with CLI command (probably with both ways since GUI is a picture of what is going underneath)

Sorry for my too many observations - questions but you suddenly (and I appreciate that) introduced a hell of a nice info here and I d like to understand the way it works not that you set the knowledge fire up. I would be more than grateful to help me out with the above.

Thank you in advance!!!

Dunuin · Apr 21, 2021

ieronymous said:
For VMs, still expecting 4x1.2Tb sas 2.5 inch 10K Rpm 2.5 inch drives in Raid 10 and from the same pool of disks a portion will be given to store sql databases (These databases after compression are automatically stored on Cloud also) . Do I need to adjust the default volblocksize for Raid 10 also?

8K should be fine for raid10. IF your DBs are important and if you want to maximize performance you should look what blocksize your DBs are using. Sometimes the will write with 16K or 32K blocks and it could be a little bit faster if your pool matches these.
But you really should consider using a SSD pool as a storage for your VMs and DBs. HDDs are really crappy for parallel writes with lots of IOPS even if they are SAS 10K disks.

ieronymous said:
Isnt that you describing a RAid 10 that could automatically be done if i would choose Raid10 as type with 6 drives or above 4 drievs you have to manually create the Raid10 type?

Yes that like raid10, but with 2 stripes of 3 HDD in mirror and not 3 stripes of a 2 HDD mirror. So everything is mirrored on 3 drives so any two may fail.

ieronymous said:
So, how do I
-Check the volblocksize which currently being used (I noticed you mentioned 8K but how can I check it )
-increase the volblocksize and measure the best number for that?

You can set the volblocksize for a zfs storage via GUI: "Datacenter -> Storage -> select your ZFS Storage -> Edit -> Block Size". 8K will be default. Every zvol (virtual HDD for VMs) will be created with this Blocksize as "volblocksize". And this value can only be set at creation and can't be changed later. If you want to edit this for your existing zvols you need to create a new virtual HDDs (after you changed the block size for the storage of cause), copy everything from the old zvols to the new ones and destroy the old ones. You can use dd to copy everything on block level. Use the search function or google, there are some tutorials on how to do this because this is a common ZFS beginner problem.

You can use this to show you the volblocksize of your zvols: zfs get volblocksize -t volume

ieronymous said:
Does that mean I had to change ashift from 12 (I thought it was a good number for ssd disks) to a different number during storage creation, and if yes from what it depends that number?

No, what ashift you choose depends on the drives you are using. If your HDDs are all using a logical and physical sektor size of 512B you can choose ashift=9. That way ZFS would use a 512B blocksize. But keep in mind that most new drives will use a sector size of 4K and 512B drives are getting rare. If you create a pool with ashift=9 you won't be able to replace a drive with a 4K sector size model later. But a big benefit of ashift=9 would be that you may lower your volblocksize for example to 4K and that way reduce write amplification because a guests filesystem with fixed 4K block size may write to a 4K pool instead of a 8K pool. Its always bad to write data with a lower block size to a storage with a greater block size.

And for SSDs thats complicated...most people will choose a ashift of 12 or 13 but you just need to test it yourself what will give the best performance for your SSDs. Most SSDs will tell you that they are using a logical/physical sector size of 512B or 4K but internally they can only write in much larger blocks (128K or something like that). And no manufacturer will tell you what is used internally.

I just asked what ashift you are using because that table is calculation with sectors. If you are using ashift=12 (4K sectors) and 4 drives in raidz2 with a volblockisze of 8K that would be call B5 in the table with 67% capacity loss. If ashift would be 9, each sector only would be 512B so the 8K volblcoksize would be 16 sectors and therefore cell B19 with only 52% capacity loss.

ieronymous said:
Where do I set/change that GUI or with CLI command (probably with both ways since GUI is a picture of what is going underneath)

You can use zfs list used,available YourPool to list the pool size. Add "used" and "available" and you get your pool size. Multiply that that with 0.8 or 0.9 and use it as the size of the quota. You can set the quota for a pool using this command: zfs set quota=1234G YourPool
There are way more quota options. You can set them for users/group, choose if snapshots should be included or not, ...

guletz · Apr 21, 2021

Dunuin said:
what ashift you choose depends on the drives you are using. If your HDDs are all using a logical and physical sektor size of 512B you can choose ashift=9.

Yes, but think to the future. It will be possible thay after 3-5 years you will not find very easy new HDD with 512 B, in this case is better to use ashift 12/4k now.

guletz · Apr 21, 2021

Dunuin said:
lower your volblocksize for example to 4K and that way reduce write amplification because a guests filesystem with fixed 4K block size may write to a 4K pool instead of a 8K pool.

By default any linux OS will write with 512. If you want any other non-default block size you will need to format your linux FS with a bigger value! Is not so simple but is not impossible. Similar it is with other non-linux OS.

Good luck / Bafta !

guletz · Apr 21, 2021

Hi,

Also note that with compression on, some figures can be different.... 128k can be 16 k ..so the waste capacity could be very different, depending if your data is compressible or not.

Another ideea to think is that if you will use ashift 9, this will create a lot of metadata (compared with ashift 12) and you will lose iops, arc performance, and maybe others ....

Good luck / Bafta !

ieronymous · Apr 22, 2021

Dunuin said:
8K should be fine for raid10.

Which means that you loose how much of the available space?

To my understanding there is no way to set parameters in a way you dont loose any size except the one you lose due to parity reasons. i mean 4 drives of 4Tb, instead of 16Tb you go with 7.2 with RaiZ2 and you lost this way some Tbs. On top of that how can you not loose any more space. I already mentioned my specs . 4x4Tb 7200Rpm drives in Raid Z2

Dunuin said:
But you really should consider using a SSD pool as a storage for your VMs and DBs. HDDs are really crappy for parallel writes with lots of IOPS even if they are SAS 10K disks.

Way to expensive and I dont have a budget to spend in the IT department. I am trying to save everywhere I can. For enterprise SSDs I need better HBA controller to passthrough NCQ, Trim , Discard..etc..etc but that is not the point here I need to setup other things first. Thank you for your interest though.

Dunuin said:
Yes that like raid10, but with 2 stripes of 3 HDD in mirror and not 3 stripes of a 2 HDD mirror. So everything is mirrored on 3 drives so any two may fail.

Can you do this in GUI or command line only?

Dunuin said:
You can set the volblocksize for a zfs storage via GUI: "Datacenter -> Storage -> select your ZFS Storage -> Edit -> Block Size". 8K will be default.

I found the tab in the GUI a little after asking. So you didnt advice me what exactly number to put there. Drives are (incuding the 2 mirror SSDs for local Prox installation) 512 bytes logical, 4096 bytes physical. So change 8k to ? What is the rule to calculate the volblocksize when the sas drives arrive?

ieronymous · Apr 22, 2021

guletz said:
By default any linux OS will write with 512. If you want any other non-default block size you will need to format your linux FS with a bigger value! Is not so simple but is not impossible. Similar it is with other non-linux OS.

Even though you are not referring to me and this is true for the Linux OS block size, my intention is for Win VMs so the filesystem will be NTFS / 4k (at least I think WinServ19 still using that) So bottom line
Hdd's with 512b logical and 4k physical
Compression lz4 on
So which ashift and volblocksize?

By the way, does choosing (ticking the option) thin provisioning plays a role to ashift / volblocksize (i know what thin provision does from lvms, but with zfs you never know)

This link has nice info about the subject https://superuser.com/questions/1383136/downsides-of-8kb-cluster-size-for-ntfs-on-top-of-zvol#:~:text=Zvols have a volblocksize property,benefit from a smaller volblocksize .
with 2 good assumptions
<<<<However, since NTFS can do compression, if you want to use NTFS's compression instead for some reason, you should make volblocksize equal to the underlying disks' block size, to give NTFS the smallest possible unit to compress stuff into (wasting the least amount of space when it has to round up the compressed data to the next block size), and make the logical block size in NTFS match the database's desired block size. This will also not create any storage overhead through read-modify-write or extra space.>>>
which to my head clicks as an answer.
It is like saying use 4k if you want to use ntfs compression and uncheck (my understanding here) compression of zfs or let it 8k and use zfs compression

Also good point <<Compression on the zvol may make sense if your volblocksize > ashift and you aren't running a compressed or encrypted file system on top of it.>>
Since ashift by default is 12 and vol blockzise 8 it seems that it is not good idea to use compressioN?

Dunuin · Apr 22, 2021

ieronymous said:
Which means that you loose how much of the available space?

To my understanding there is no way to set parameters in a way you dont loose any size except the one you lose due to parity reasons. i mean 4 drives of 4Tb, instead of 16Tb you go with 7.2 with RaiZ2 and you lost this way some Tbs. On top of that how can you not loose any more space. I already mentioned my specs . 4x4Tb 7200Rpm drives in Raid Z2

I found the tab in the GUI a little after asking. So you didnt advice me what exactly number to put there. Drives are (incuding the 2 mirror SSDs for local Prox installation) 512 bytes logical, 4096 bytes physical. So change 8k to ? What is the rule to calculate the volblocksize when the sas drives arrive?

There are 2 ways. Lower the ashift of the pool (what is not possible with your 4K drives) or increase the volblocksize. Look at the table and take the lowest volblocksize that is not wasting too much space. 24K would be a good point to start with for a raidz2 pool with 4 drives and ashift of 12.
Like guletz already said, that is theoretically the best value but there is stuff like blocklevel compression so the results may vary depending of the type of data. You can test 8K and 24K, write 100GB of test data to the pool and look how long it took to write it and how much space these 100GB consume on that pool. With a volblocksize of 8K that should take 130GB on the pool and with a volblocksize of 24K only 100GB.

ieronymous said:
Way to expensive and I dont have a budget to spend in the IT department. I am trying to save everywhere I can. For enterprise SSDs I need better HBA controller to passthrough NCQ, Trim , Discard..etc..etc but that is not the point here I need to setup other things first. Thank you for your interest though.

For ZFS onboard SATA would be totally fine. And there are some not that expensive enterprise SATA SSDs (Intel S4610 3.84TB for around 800€ with 22 PB TBW). If not everything needs to be super fast you could store the DBs/VMs on a fast SSD pool (hot storage) that doens't need to be that big and cold storage could be a HDD pool you mount into the VMs using NFS.

ieronymous said:
Can you do this in GUI or command line only?

Only CLI i think. But its a one-liner so shouldn't be a big problem as long you don't want to be proxmox installed and boot from that pool too:

zpool create MyPool -f -o ashift=12 mirror /dev/disk/by-uuid/disk1 /dev/disk/by-uuid/disk2 /dev/disk/by-uuid/disk3 mirror /dev/disk/by-uuid/disk4 /dev/disk/by-uuid/disk5 /dev/disk/by-uuid/disk6

And you would need to tell proxmox that there is a new pool that can be used as a ZFS storage. That could be done by GUI (Datacenter -> Storage -> Add -> ZFS).

guletz · Apr 22, 2021

ieronymous said:
my intention is for Win VMs so the filesystem will be NTFS / 4k (at least I think WinServ19 still using that)

The size of the block depends a lot by your use case(file server, DBs, and so on).

ieronymous · Apr 22, 2021

guletz said:
The size of the block depends a lot by your use case(file server, DBs, and so on).

I came to that conclusion too,since one Vm would be for SQL Server 19 adn another one for A.D / DNS / DHCP Server, but since I ll give portions of the same raidz disks I cant change volblocksize for each VM. If I set it once that s it to my understanding for the future VMs to be created.

ieronymous · Apr 22, 2021

Dunuin said:
increase the volblocksize to 24k

This seems a better way to go than lowering the ashift and also if for 100gb it takes as much space then you are talking about 1-1 here,

Dunuin said:
24K would be a good point to start with for a raidz2 pool with 4 drives and ashift of 12

Nice even though I think I ll create a raid 10 for sata drives also. Do your numbers still valid? By the way you cant delete a zfs raid type from GUI are you? Only thing found is to remove the zfs from Database but then again it is just that, missing the option for the extra step to make the disks as individuals visible to node again.

New edit:
Never mind found that myself the old traditional way being done in Linux Just letting it here in case someone else needs it
1.Datacenter -> Storage -> zfs and remove the zpool_name you want to recreate

2.From shell issue the command zpool destroy zpool_name

3.fdisk -l | grep Disk to check which ones to be removed
Run fdisk for all the drives to clear the partitions like
fdisk /dev/sd{a,b,c} and used d for deletion and w for writing up cahnged made

Dunuin said:
that expensive enterprise SATA SSDs (Intel S4610 3.84TB for around 800€ with 22 PB TBW)

Ok, probably you are working in Dubai or something . I am trying to set everything up with a total cost under 1300 so that is why I mentioned about the budget and SSDs.

ieronymous · Apr 22, 2021

Dunuin said:
Only CLI i think. But its a one-liner so shouldn't be a big problem

Thanks for the command even though I remembered I ve done the same before 2 years when Prox didnt have the GUI option of zfs mirror creation, but I did it for 4 drives (even though after your advice it seems quite logical to just add the third drive to the stripe before issuing the mirror option)
By the way no room / power cabling for 6 drives. the setup is 4 x 2.5inch sas drives of 1.2Tb for the VMs and and SQL installation (I am going to give the VM a portion of that pool just to be different from the VM) and 4 x 3.5inch sata 7.200rpm nas drives for A.D data and backups o fVMs and probably Sql backups too
I know if you add up the above drives are more than 6 (8 in total) but there are in quadrants of 3.5 and 2.5 inches drives, as the Workstation can accommodate places for,

Search

Search

zfs creation considerations - weird GUI issue

ieronymous

Well-Known Member

Attachments

Dunuin

Distinguished Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

guletz

Distinguished Member

guletz

Distinguished Member

guletz

Distinguished Member

ieronymous

Well-Known Member

ieronymous

Well-Known Member

Dunuin

Distinguished Member

guletz

Distinguished Member

ieronymous

Well-Known Member

ieronymous

Well-Known Member

ieronymous

Well-Known Member

We value your privacy