ZFS pool layout and limiting zfs cache size

fahadshery · Dec 2, 2021

Hi All,

I have a server with `192G` RAM. I boot it of off a single SSD.

I created a ZFS pool composed off 5 x 800GB enterprise SSDs in a `raidz` configuration. This pool is created just to create VMs/containers only and is used for local storage.

Now when I checked the zfs summary I see this:

As you can see the ARC max size is about 95 GiGs. I know by default it would aim to use 50% of the available memory. I think its just too much?
I am looking for advice: do I need to change it somehow? if yes then what's the best number? Or should I just leave it as is? If I leave it as is then will it allow to stand new VMs? I am barely running 3 VMs (with largest RAM allocated being 32GB) and out of 192GB total RAM, it is showing that it has used up 151GB already?
What could be the best way forward?

Thanks in advanced!

Dunuin · Dec 2, 2021

fahadshery said:
Hi All,

I have a server with `192G` RAM. I boot it of off a single SSD.

I created a ZFS pool composed off 5 x 800GB enterprise SSDs in a `raidz` configuration. This pool is created just to create VMs/containers only and is used for local storage.

Keep in mind that using raidz you basically only get the IOPS performance of a nearly single SSD. So in terms of IOPS your pool will be slower than just using a single 800GB SSD. A VM storage likes IOPS, where a striped mirror (raid10) would be preferable. Also keep in mind that with raidz1 of 5 disks (with 4K blocksize) you need to increase your volblocksize from the default 8K to atleast 32K or you will loose most of your storage due to padding (you will loose 50% instead of 20% of your raw storage). But a 32K volbocksize would be terrible if you are planning to run some MySQL or posgres DBs in your VMs.

fahadshery said:
Now when I checked the zfs summary I see this:
View attachment 31993

As you can see the ARC max size is about 95 GiGs. I know by default it would aim to use 50% of the available memory. I think its just too much?

Yes, you can lower your ARC size. See here.

fahadshery said:
I am looking for advice: do I need to change it somehow? if yes then what's the best number?

Rule of thumb is 4GB + 1GB per 1TB of raw storage if you don't plan to use deduplication. So something like 8GB for the ARC would be fine. But the bigger your ARC is, the more snappy your pool will be. So feel free to make it bigger if you don't need all that RAM. If some processes will need more RAM your ARC will also slowly shrink down until its minimum size is reached. So that RAM isn't wasted if your ARC is a bit bigger. Free RAM is wasted performance, so really not a problem if your RAM utilization is always at up to 80%.

fahadshery · Dec 3, 2021

Dunuin said:
Keep in mind that using raidz you basically only get the IOPS performance of a nearly single SSD. So in terms of IOPS your pool will be slower than just using a single 800GB SSD.

Sorry I didn't understand what you mean by having IOPS slower than a single SSD?

Dunuin said:
A VM storage likes IOPS, where a striped mirror (raid10) would be preferable.

These SSDs are 12Gbps and a bit expensive ones to run in a homelab so I wanted to squeeze as much useable storage as I possibly could. I guess I could try them in a mirror config as suggested because I currently don't need much of this storage anyway. This storage is just for the VMs and containers which are backed up to a truenas box anyway.

Dunuin said:
Also keep in mind that with raidz1 of 5 disks (with 4K blocksize) you need to increase your volblocksize from the default 8K to atleast 32K or you will loose most of your storage due to padding (you will loose 50% instead of 20% of your raw storage). But a 32K volbocksize would be terrible if you are planning to run some MySQL or posgres DBs in your VMs.

Yes, I am running MySQL and Postgre for dev and production. So what's the optimal volblocksize for my setup? and most importantly, how do I change it?
I was also thinking to mount SMB/iSCSI share from truenas in an ubuntu vm to run databases but I haven't tuned any of this stuff over in truenas yet. BTW my truenas is also running SSDs in its pool. So that's a separate issue now

Dunuin said:
Yes, you can lower your ARC size. See here.

thank you, I will try this.

Dunuin said:
Rule of thumb is 4GB + 1GB per 1TB of raw storage if you don't plan to use deduplication. So something like 8GB for the ARC would be fine. But the bigger your ARC is, the more snappy your pool will be. So feel free to make it bigger if you don't need all that RAM.

So I was thinking to make the ARC size of about 20G because I run Plex and other services.

Dunuin said:
If some processes will need more RAM your ARC will also slowly shrink down until its minimum size is reached. So that RAM isn't wasted if your ARC is a bit bigger. Free RAM is wasted performance, so really not a problem if your RAM utilization is always at up to 80%.

High RAM useage worried me initially and I started to explore this issue... so do you think I should leave it as is and can create more VMs and containers and ARC will automatically shrink to accommodate the new VMs/containers without me worrying about it?

Finally, I am blown away by your detailed reply. Much appreciated mate

fahadshery · Dec 3, 2021

@Stefan_R seems to think that I should leave the ARC useage alone. So I will trust the judgement of the experts here

Dunuin · Dec 3, 2021

fahadshery said:
Sorry I didn't understand what you mean by having IOPS slower than a single SSD?

Raidz IOPS doesn't scale with the number of drives. So no matter if you use 5, 10 or 100 drives in a raidz, that pool isn't faster than a single drive alone...atleast for IOPS. But with raidz you are forced to choose a bigger volblocksize compared to a single drive so possibly more overhead when doing writes smaller than 32K to your pool. Lets say you random write with 4K blocks to a raidz1 with 32K volblocksize. Now you need to read or write 32KB for each 4K block so your performance will only be 1/8th compared to a single disk or mirror that is working with a volblocksize of 4K and only needs to write/read 4K for each 4K block.
Then a raidz isn't doing magic. Its doing complex parity calculations that a single disk or (striped) mirror doesn't need to do. So you loose performance again. And then each disk might need to wait for all the 4 other disks. You loose performance again. So in theory raidz1 of 5 disks should get the same IOPS performance of a single disk. But because of all the additional overhead and added complexity of a raidz pool, a single disk pool will get better IOPS.

fahadshery said:
These SSDs are 12Gbps and a bit expensive ones to run in a homelab so I wanted to squeeze as much useable storage as I possibly could. I guess I could try them in a mirror config as suggested because I currently don't need much of this storage anyway. This storage is just for the VMs and containers which are backed up to a truenas box anyway.

Yes, I am running MySQL and Postgre for dev and production. So what's the optimal volblocksize for my setup? and most importantly, how do I change it?
I was also thinking to mount SMB/iSCSI share from truenas in an ubuntu vm to run databases but I haven't tuned any of this stuff over in truenas yet. BTW my truenas is also running SSDs in its pool. So that's a separate issue now

Then you definitly want to use a striped mirror of 4 disks so your volblocksize can be lower. With that the default 8K volblocksize would be good (you could change the volblocksize pool wide by changing the pools "Block size" under Datacenter -> Storage -> YourPool -> Edit. But keep in mind that the volblocksize can only be set at creation of a zvol. So to change it you would need to destroy and recreate all virtual disks). Would also perfectly fit the 8k blocksize the posgres is writing with. And would be fine for the 16K blocksize too that MySQL is using. Its always fine to write with a bigger blocksize to a lower blocksize but not the other way round.
Right now with a raidz1 of 5 disks and a 8K volblocksize you loose 20% of your raw capacity to parity + 30% of your pool capacity (you can't see that because it is indirectly caused by every zvol being 60% bigger than it should be) to padding overhead. So you loose the same 50% capacity like running a striped mirror with 8K volblocksize...just that a striped mirror would get 2 times better IOPS and a better reliablity because UP TO 2 drives may fail without loosing data.
And I guess if you are already using TrueNAS you know that you shoulnd't fill up your pool more than 80% or it will get slow until it switches into panic mode at 90%.

fahadshery said:
High RAM useage worried me initially and I started to explore this issue... so do you think I should leave it as is and can create more VMs and containers and ARC will automatically shrink to accommodate the new VMs/containers without me worrying about it?

Finally, I am blown away by your detailed reply. Much appreciated mate

fahadshery said:
@Stefan_R seems to think that I should leave the ARC useage alone. So I will trust the judgement of the experts here

In general a big ARC should be fine. But there are some cases where your host suddenly needs free RAM and sometimes ZFS can't free up the ARC fast enough. In that case the OOM killer can kick in and kill some guests. If something like that happens it might be useful to limit the ARC so you always got some free RAM.

fahadshery · Dec 3, 2021

Dunuin said:
Raidz IOPS doesn't scale with the number of drives. So no matter if you use 5, 10 or 100 drives in a raidz, that pool isn't faster than a single drive alone...atleast for IOPS. But with raidz you are forced to choose a bigger volblocksize compared to a single drive so possibly more overhead when doing writes smaller than 32K to your pool. Lets say you random write with 4K blocks to a raidz1 with 32K volblocksize. Now you need to read or write 32KB for each 4K block so your performance will only be 1/8th compared to a single disk or mirror that is working with a volblocksize of 4K and only needs to write/read 4K for each 4K block.
Then a raidz isn't doing magic. Its doing complex parity calculations that a single disk or (striped) mirror doesn't need to do. So you loose performance again. And then each disk might need to wait for all the 4 other disks. You loose performance again. So in theory raidz1 of 5 disks should get the same IOPS performance of a single disk. But because of all the additional overhead and added complexity of a raidz pool, a single disk pool will get better IOPS.

Ah! this makes sense, no wonder why I was experience lower read/write speeds thanks for the explanation!

Dunuin said:
Then you definitly want to use a striped mirror of 4 disks so your volblocksize can be lower. With that the default 8K volblocksize would be good (you could change the volblocksize pool wide by changing the pools "Block size" under Datacenter -> Storage -> YourPool -> Edit. But keep in mind that the volblocksize can only be set at creation of a zvol. So to change it you would need to destroy and recreate all virtual disks).

OK, I will need to recreate the zfs pool in a mirror config. So which option to choose?

Mirror or RAID10?

Dunuin said:
Would also perfectly fit the 8k blocksize the posgres is writing with. And would be fine for the 16K blocksize too that MySQL is using. Its always fine to write with a bigger blocksize to a lower blocksize but not the other way round.

So the Default 8K would work perfectly for both MySQL and Postgres?

Dunuin said:
Right now with a raidz1 of 5 disks and a 8K volblocksize you loose 20% of your raw capacity to parity + 30% of your pool capacity (you can't see that because it is indirectly caused by every zvol being 60% bigger than it should be) to padding overhead. So you loose the same 50% capacity like running a striped mirror with 8K volblocksize...just that a striped mirror would get 2 times better IOPS and a better reliablity because UP TO 2 drives may fail without loosing data.

Got that!

Dunuin said:
And I guess if you are already using TrueNAS you know that you shoulnd't fill up your pool more than 80% or it will get slow until it switches into panic mode at 90%.

I have the 11 x 1.8TB SSDs in the TrueNas VM. Would the same principle apply? Whats the best tuneing params for those?

Dunuin said:
In general a big ARC should be fine. But there are some cases where your host suddenly needs free RAM and sometimes ZFS can't free up the ARC fast enough. In that case the OOM killer can kick in and kill some guests. If something like that happens it might be useful to limit the ARC so you always got some free RAM.

perfect! thank you so much for your time and great n00b free explanation. Can't thank you enough mate!

Dunuin · Dec 3, 2021

fahadshery said:
OK, I will need to recreate the zfs pool in a mirror config. So which option to choose?
View attachment 32017

Mirror or RAID10?

Raid10 and then select 4 disks. Then it will create a striped mirror (so two mirrors striped together). You could add the 5th disk later to work as a hot spare so it will take the place of a failed disk in case of the the other 4 disks dies.

fahadshery said:
So the Default 8K would work perfectly for both MySQL and Postgres?

Jep, 8K should be the ideal blocksize for a 4 disk striped mirror (raid10) and posgres.

fahadshery said:
I have the 11 x 1.8TB SSDs in the TrueNas VM. Would the same principle apply? Whats the best tuneing params for those?

How did you setup that pool and what are you using it for? Volblocksize only matters if you got zvols and in general they are only used by TrueNAS if you run some VMs there too or if you are using iSCSI. Everything else should use datasets which are using a recordsize of 128K. But in contrast to the volblocksize the recordsize isn't fixed. So with a recordsize of 128K ZFS can write data as 4K up to 128K records. So changing the recordsize isn't that important.
And if you primarily use that TrueNAS for SMB/NFS shares to store some files raidz1/2/3 isn't that bad because you don't need that much IOPS for that workload, as long as you don't primarily store very small (few KB) files.

Am I right that you virtualize your TrueNAS on your PVE server? How do you get your 11 SSDs into that VM then?

fahadshery · Dec 3, 2021

Dunuin said:
Raid10 and then select 4 disks. Then it will create a striped mirror (so two mirrors striped together). You could add the 5th disk later to work as a hot spare so it will take the place of a failed disk in case of the the other 4 disks dies.

Got it! thanks

Dunuin said:
Jep, 8K should be the ideal blocksize for a 4 disk striped mirror (raid10) and postgres.

Super . Question is because I will be loosing 50% of useable storage. What's the best way to run these databases? I mean There is one Postgres database with storage size of more than 900GB! Should I just take a backup on TrueNas and clear in the proxmox storage? What's your opinion on this?

Dunuin said:
How did you setup that pool and what are you using it for? Volblocksize only matters if you got zvols and in general they are only used by TrueNAS if you run some VMs there too or if you are using iSCSI.

I passed the drives by using this. I created two pools. 1 with 6 drives in raidz2 config and the other with 5 drives in a raidz config. All are 1.8TB SSDs.
I don't like creating VMs in TrueNas. So all the VMs will be in the proxmox.
I do use iSCSI for different reasons such as nextcloud data, my website and blog. Some MySQL and Postgres data store etc. etc.
I also have some datasets which I mount in a VM over in proxmox using SMB share using some other reasons such as taking a backup of certain files and folders.

Dunuin said:
Everything else should use datasets which are using a recordsize of 128K. But in contrast to the volblocksize the recordsize isn't fixed. So with a recordsize of 128K ZFS can write data as 4K up to 128K records. So changing the recordsize isn't that important.

Got it!

Dunuin said:
And if you primarily use that TrueNAS for SMB/NFS shares to store some files raidz1/2/3 isn't that bad because you don't need that much IOPS for that workload, as long as you don't primarily store very small (few KB) files.

Yes, SMB shares for Apple time machine. Some open shares to store files.

Dunuin said:
Am I right that you virtualize your TrueNAS on your PVE server?

That's correct!

Dunuin said:
How do you get your 11 SSDs into that VM then?

I passed the drives using this tutorial.

Dunuin · Dec 3, 2021

fahadshery said:
Got it! thanks

Super . Question is because I will be loosing 50% of useable storage. What's the best way to run these databases? I mean There is one Postgres database with storage size of more than 900GB! Should I just take a backup on TrueNas and clear in the proxmox storage? What's your opinion on this?

With just 5 drives your options would be:

	Raw storage:	Parity loss:	Padding loss:	Keep 20% free:	Real usable space:	8K random write IOPS:	8K random read IOPS:	big sequential write throughput:	big sequential read throughput:
5x 800 GB raidz1 @ 8K volblocksize:	4000 GB	- 800 GB	-1200 GB	-400 GB	≈1600 GB or 1490 GiB	≈1x	≈1x	4x	4x
4x 800 GB str. mirror @ 8K volblocksize:	3200 GB	- 1600 GB	0	-320 GB	≈1280 GB or 1192 GiB	≈ 2x	≈4x	2x	4x
5x 800 GB raidz1 @ 32K volblocksize:	4000 GB	- 800 GB	0	-640 GB	≈ 2560 GB or 2384 GiB	≈ 0.25x	≈0.25x	4x	4x

So you need to decide if you want capacity or performance.

fahadshery said:
I passed the drives by using this. I created two pools. 1 with 6 drives in raidz2 config and the other with 5 drives in a raidz config. All are 1.8TB SSDs.
I don't like creating VMs in TrueNas. So all the VMs will be in the proxmox.
I do use iSCSI for different reasons such as nextcloud data, my website and blog. Some MySQL and Postgres data store etc. etc.
I also have some datasets which I mount in a VM over in proxmox using SMB share using some other reasons such as taking a backup of certain files and folders.

Then you got the same problem like with your raidz1. Using raidz2 of 6 disks with a 8K volblocksize you loose up to 67% of your raw capacity (so only 2 of 6 disks usable if only using zvols and 4 of 6 disks usable if only using datasets). You would need to increase the volblocksize to atleast 16K to only loose 33% raw capacity for both datasets and zvols. So thats bad again if you want to store posgres on a iSCSI. But without increasing the volblocksize from 8K to 16K your 800GB posgres DB would consume 1600GB to be stored on that pool, because of the padding overhead everything stored in a zvol would be double in size.

fahadshery said:
I passed the drives using this tutorial.

That will work but keep in mind that this is no real physical passthrough. All your disk TrueNAS can see are still virtualized and so create some overhead. That also why you can't use SMART monitoring in TrueNAS and why TrueNAS will report all your SSDs to be 512B/512B logical/physical blocksize even if the real drives may be 512B/4K logical/physical blocksize.
If you want TrueNAS to be able to directly access the real drives without the additional abstraction/virtualization layer in between you would need to get a PCIe HBA card, attach the 6 SSDs to it and use PCI passthrough to passthrough the whole HBA with all disks attached to it into the TrueNAS VM.

fahadshery · Dec 4, 2021

Dunuin said:
With just 5 drives your options would be:

Raw storage: Parity loss: Padding loss: Keep 20% free: Real usable space: 8K random write IOPS: 8K random read IOPS: big sequential write throughput: big sequential read throughput:
5x 800 GB raidz1 @ 8K volblocksize: 4000 GB - 800 GB -1200 GB -400 GB ≈1600 GB or 1490 GiB ≈1x ≈1x 4x 4x
4x 800 GB str. mirror @ 8K volblocksize: 3200 GB - 1600 GB 0 -320 GB ≈1280 GB or 1192 GiB ≈ 2x ≈4x 2x 4x
5x 800 GB raidz1 @ 32K volblocksize: 4000 GB - 800 GB 0 -640 GB ≈ 2560 GB or 2384 GiB ≈ 0.25x ≈0.25x 4x 4x

So you need to decide if you want capacity or performance.

WOWZA!!!! I wish there was a tool that would offer recommended ZFS pool options e.g. Get user input i.e. Do you prefer performance or storage? How many drives do you have? Are they SSDs or Spindles? Will you be running databases or just hosting files? Are you hosting VMs or not? And then offer 2-3 options with a recommended option based on what user selected. Would you be interested in building that tool and make it online with me? I am a software engineer and can help with code and share it with experts such as TechnoTim, Linus etc. etc. to fine tune it??
Now coming back to what I am thinking to do is to build based on your middle option in the above table. And use the 5th SSD to host Postgres only? These SSDs are 12Gbps and that should be good? I could take regular backup of this SSD on my TrueNas in case of disk failure? Do you think its a good idea?

Dunuin said:
Then you got the same problem like with your your raidz1. Using raidz2 of 6 disks with a 8K volblocksize you loose up to 67% of your raw capacity (so only 2 of 6 disks usable if only using zvols and 4 of 6 disks usable if only using datasets). You would need to increase the volblocksize to atleast 16K to only use 33% raw capacity for bith datasets and zvols. So thats bad again if you want to store posgres on a iSCSI. But without increasing the volblocksize from 8K to 16K your 800GB posgres DB would consume 1600GB to be stored on that pool, because of the padding overhead everything stored in a zvol would be double in size.

This is a bad news ! I am using both iSCSI (mounted on ubuntu to host nextcloud data, Postgres and MySQL data etc.) and datasets (for file storage and backups). I am passing 11 x SSDs (1.8TB per disk) to TrueNas and built two pools. One with 6 disks in raidz2 and one with 5 disks in a raidz config. Now that I am thinking to move Postgres and MySQL over to my newly created raid10 pool with 4 SSDs. What's the best config for these 11 SSDs over in TrueNas?

Dunuin said:
That will work but keep in mind that this is no real physical passthrough. All your disk TrueNAS can see are still virtualized and so create some overhead. That also why you can't use SMART monitoring in TrueNAS and why TrueNAS will report all your SSDs to be 512B/512B logical/physical blocksize even if the real drives may be 512B/4K logical/physical blocksize.
If you want TrueNAS to be able to directly access the real drives without the additional abstraction/virtualization layer in between you would need to get a PCIe HBA card, attach the 6 SSDs to it and use PCI passthrough to passthrough the whole HBA with all disks attached to it into the TrueNAS VM.

Unfortunately, that would require me to change my backplane and connect an additional controller in IT mode. My current backplane is a 16 drive backplane connected to H310 mini controller in IT mode. Now, If I want to split the backplane (i.e. 8 drives in one and the other 8 in a second backplane and both backplanes are connected to 2 different HBA controllers), I'll need to replace it with obviously with 2 backplanes, add the cables for power and signal and SAS connection. I'll also need a 2nd HBA controller for the 2nd backplane. This is a huge task and I am trying to source parts but in the mean time I have to pass those 11 x SSDs directly.

Dunuin · Dec 4, 2021

fahadshery said:
WOWZA!!!! I wish there was a tool that would offer recommended ZFS pool options e.g. Get user input i.e. Do you prefer performance or storage? How many drives do you have? Are they SSDs or Spindles? Will you be running databases or just hosting files? Are you hosting VMs or not? And then offer 2-3 options with a recommended option based on what user selected. Would you be interested in building that tool and make it online with me? I am a software engineer and can help with code and share it with experts such as TechnoTim, Linus etc. etc. to fine tune it??
Now coming back to what I am thinking to do is to build based on your middle option in the above table. And use the 5th SSD to host Postgres only? These SSDs are 12Gbps and that should be good? I could take regular backup of this SSD on my TrueNas in case of disk failure? Do you think its a good idea?

The problem is that there are so many factors you need to take into account, because everyone got a different workload, that that such a tool would be hard to understand how to use it right. And all the numbers I calculated above are just theoretical and based on formulas how ZFS should behave. There are alot of other things that will change the real capacity/performance:
- what blocksize are the SSDs internally working with (which no manufacturer will tell you in any datasheet)
- what type of compression algorithm you use
- if you use deduplication or not
- how good your data is deduplicatable
- how good your data is compressible
- ratio of datasets and zvols used
- ...
So its always a good idea to also benchmark different setups to see what in realitiy works best for you. But that again is hard because you would need to create your own individual benchmark that will match your realworld workload (read/write ratio, sequential/random ratio, different blocksizes used, if your workload is highly or less parralelizable, if your are more writing on blocklevel to zvols or writing on filelevel to datasets, different caching modes, different protocols, encryption or not, how good data is compressible, how good data is deduplicatable, checking that the CPU or HBA or PCIe link between HBA and CPU are not bottlenecking, sync/async writes, ...).
So that is a super complex topic and you basically need to understand how ZFS works in detail to be able to make good decisions.

fahadshery said:
This is a bad news ! I am using both iSCSI (mounted on ubuntu to host nextcloud data, Postgres and MySQL data etc.) and datasets (for file storage and backups). I am passing 11 x SSDs (1.8TB per disk) to TrueNas and built two pools. One with 6 disks in raidz2 and one with 5 disks in a raidz config. Now that I am thinking to move Postgres and MySQL over to my newly created raid10 pool with 4 SSDs. What's the best config for these 11 SSDs over in TrueNas?

8K volblocksize for your posgres is only possible with:
- a single disk
- 2 disk mirror
- 4 disk striped mirror.
For everything else you would need to increase the vollbocksize above 8K to not loose too much capacity to padding.

16K volblocksize for your MySQL would be possible with:
- a single disk
- 2 disk mirror
- 4, 6 or 8 disk striped mirror (6 disk would not be ideal)
- 3 disk raidz1

And iSCSI adds another layer of overhead and the network stack will add additional latency so I would also recommend to create a local pool for your DBs. A 8K pool should be fine for all kinds of DBs if you want to run them all from the same pool. If you want even better performance you could create a dedicated pool for each DB that matches its blocksize for sync writes (so a 8K volblocksize for posgres, 16K for MySQL and 64K for MSSQL and so on). That way ZFS could write tha data with the blocksize the DB is nativly using and each DB would got its own pool so the workload is split between pools so each pool gets less IO to handle.

Running 11x 1.8 TB SSDs you got 19.8TB raw storage.

	usable space @ 100% datasets:	usable space @ 100% zvols:	8K random write IOPS:	8K random read IOPS:	big seq. write throughput:	big seq. read throughput:	disk may fail:
11 disk raidz1 @ 8K:	14.4 TB	7.92 TB	1x	1x	10x	10x	1
11 disk raidz2 @ 8K:	12.96 TB	5.28 TB	1x	1x	9x	9x	2
11 disk raidz3 @ 8K:	11.52 TB	3.96 TB	1x	1x	8x	8x	3
11 disk raidz1 @ 64K:	14.4 TB	14.4 TB	0.125x	0.125x	10x	10x	1
11 disk raidz2 @ 64K:	12.96 TB	12.04 TB	0.125x	0.125x	9x	9x	2
11 disk raidz3 @ 32/64K:	11.52 TB	10.56 TB	0.25x / 0.125x	0.25x / 0.125x	8x	8x	3
8 disk str. mirror @ 16K:	5.76 TB	5.76 TB	2x	4x	4x	8x	1-4
3 disk raidz1 @ 16K:	2.88 TB	2.88 TB	0.5x	0.5x	2x	2x	1
4 disk str. mirror @ 8K:	2.88 TB	2.88 TB	2x	4x	2x	4x	1-2
7 disk raidz1 @ 32/64K:	10.8 TB	8.06 TB	0.25x / 0.125x	0.25x / 0.125x	6x	6x	1
2 disk mirror @ 4/8K:	1.44 TB	1.44 TB	1x	2x	1x	2x	1
9 disk raidz1 @ 64K:	11.53 TB	11.53 TB	0.125x	0.125x	8x	8x	1

So for a file server for big files (best capacity and sequential read/writes) I would go with "11 disk raidz1 @ 64K" (but bad IOPS so bad for DBs and small files and bad reliability). But reliablity could be increased with using "11 disk raidz2 @ 64K" instead.

If you primarily want a fast but small storage I would go with a "8 disk str. mirror @ 16K" for decent IOPS (for 16K and above IOPS like a MySQL DB wants it even would be 4x read and 8x write and sequential reads are great too) + "3 disk raidz1 @ 16K" so you got a small second pool that even could be used for MySQL so you don'T waste those drives. Both together would be 8.64TB of usable space.

Best allrounder would be "4 disk str. mirror @ 8K" (small but fast pool that even could run posgres for your zvols) + "7 disk raidz1 @ 32/64K" (big pool great for datasets with medium to big files). So that would result in 13.68TB (10.8TB for datasets + 2.88TB for zvols/datasets) of usable storage which isn't that much below the optimum of usable capacity (14.4TB) that you would get with a "11 disk raidz1".

If you don't need that much space for zvols you can also use "2 disk mirror @ 4/8K" + "9 disk raidz1 @ 64K".

And fusion pools might also be an option. In that case you only need to create a single pool instead of two pools.
You could basically use the same 2 options as above but instead of creating a additional mirror / striped mirror pool you could add the mirror / striped mirror as "special devices".

An example:
You create a normal raidz1 pool with 7 disks as raidz1. But than you add another 4 disks to that tool as "special devices" as a striped mirror. Now you can tell TrueNAS to store everything that uses a blocksize of for example 4k to 64K to the striped mirror and everything that is 128K or bigger to the normal raidz1. That way you only got one pool but all files bigger than 128K will be stored on that 7 disk raidz1 and all files smaller 128K (and also all zvols as long as they a vollbocksize of 64K or lower) will be stored on the 4 disk striped mirror instead. And also all metadata will be stored on that striped mirror too, so that even speeds up the 7 disk raidz1 because it won't be hit by all the IOPS caused by metadata.
You can choose for every dataset which files should be stored on the special devices by changing the datasets "special_small_blocks" attribute.
You can for example set your "special_small_blocks" to 16K. In that case all metadata, all 8K volblocksize zvols and all files smaller than 16K will be stored on that striped mirror and all files 16K or bigger will be stored on that raidz1. And you can set that for each dataset so if you don't want a dataset to use the striped mirror your could set the special_small_blocks to 0 and if you want all data to be written to that striped mirror you can set it to 1024K.
That way stuff that needs good IOPS (so zvols, small files and metadata) will be automatically stored on the fast striped mirror and all big files that just need good throughput and space efficiency will be stored on the raidz. That way data is send to that storage that will work best for that workload. As soon as your special devices are full all new small data (metadata, small files, zvols) will spill over to the raidz (so it gets slower but atleast keeps working).

So that also might be a interesting choice for your setup.

And if you got a lot of sync writes (from all your DBs) it might also be useful to get a very durable NVMe SSD with a as low as possible latency (Intel Optane would be best but also expensive) and add it as a SLOG. That way you would get a very fast write cache for sync writes boosting your pools performance even more. But its useless for async writes, so you would really need to check first if it would make sense.

fahadshery · Dec 7, 2021

Dunuin said:
The problem is that there are so many factors you need to take into account, because everyone got a different workload, that that such a tool would be hard to understand how to use it right. And all the numbers I calculated above are just theoretical and based on formulas how ZFS should behave. There are alot of other things that will change the real capacity/performance:
- what blocksize are the SSDs internally working with (which no manufacturer will tell you in any datasheet)
- what type of compression algorithm you use
- if you use deduplication or not
- how good your data is deduplicatable
- how good your data is compressible
- ratio of datasets and zvols used
- ...

Yes but people like me who are new to ZFS, it will give them a good starting point? I mean something is better than absolute nothing? We can incorporate the questions you raised as well as mine and give them some recommendations to look at? It's like a decision tree type Q&A and then boom, here is what you 'could' look at with some pros and cons?

Dunuin said:
So its always a good idea to also benchmark different setups to see what in realitiy works best for you. But that again is hard because you would need to create your own individual benchmark that will match your realworld workload (read/write ratio, sequential/random ratio, different blocksizes used, if your workload is highly or less parralelizable, if your are more writing on blocklevel to zvols or writing on filelevel to datasets, different caching modes, different protocols, encryption or not, how good data is compressible, how good data is deduplicatable, checking that the CPU or HBA or PCIe link between HBA and CPU are not bottlenecking, sync/async writes, ...).

This made me go on some investigative work. I have 12Gbs SSDs but my raid controller can support up to 6Gbs so my 12Gbs drives negotiate and run at 6Gbs . What I am now doing is to split up my 16 drives backplane to 2 x 8 direct access backplanes. both will be attached to HBA cards that support 12Gbs speeds. I have decided to pass one LSI controller to TrueNas VM directly and that way I will be able to see 8 drives in TrueNas. I will be using these for storage mainly (as a file server).

Dunuin said:
So that is a super complex topic and you basically need to understand how ZFS works in detail to be able to make good decisions.

8K volblocksize for your posgres is only possible with:
- a single disk
- 2 disk mirror
- 4 disk striped mirror.
For everything else you would need to increase the vollbocksize above 8K to not loose too much capacity to padding.

16K volblocksize for your MySQL would be possible with:
- a single disk
- 2 disk mirror
- 4, 6 or 8 disk striped mirror (6 disk would not be ideal)
- 3 disk raidz1

I will create a 3 disk raidz1 local ZFS pool at 16K volblock for local storage or MYSQL database. I don't wish to pass the drives directly to TrueNas because of not being able to run QEMU agent etc. (the point you raised earlier). Therefore will be passing the LSI controller directly.

Dunuin said:
And iSCSI adds another layer of overhead and the network stack will add additional latency so I would also recommend to create a local pool for your DBs. A 8K pool should be fine for all kinds of DBs if you want to run them all from the same pool. If you want even better performance you could create a dedicated pool for each DB that matches its blocksize for sync writes (so a 8K volblocksize for posgres, 16K for MySQL and 64K for MSSQL and so on). That way ZFS could write tha data with the blocksize the DB is nativly using and each DB would got its own pool so the workload is split between pools so each pool gets less IO to handle.

Running 11x 1.8 TB SSDs you got 19.8TB raw storage.

usable space @ 100% datasets: usable space @ 100% zvols: 8K random write IOPS: 8K random read IOPS: big seq. write throughput: big seq. read throughput: disk may fail:
11 disk raidz1 @ 8K: 14.4 TB 7.92 TB 1x 1x 10x 10x 1
11 disk raidz2 @ 8K: 12.96 TB 5.28 TB 1x 1x 9x 9x 2
11 disk raidz3 @ 8K: 11.52 TB 3.96 TB 1x 1x 8x 8x 3
11 disk raidz1 @ 64K: 14.4 TB 14.4 TB 0.125x 0.125x 10x 10x 1
11 disk raidz2 @ 64K: 12.96 TB 12.04 TB 0.125x 0.125x 9x 9x 2
11 disk raidz3 @ 32/64K: 11.52 TB 10.56 TB 0.25x / 0.125x 0.25x / 0.125x 8x 8x 3
8 disk str. mirror @ 16K: 5.76 TB 5.76 TB 2x 4x 4x 8x 1-4
3 disk raidz1 @ 16K: 2.88 TB 2.88 TB 0.5x 0.5x 2x 2x 1
4 disk str. mirror @ 8K: 2.88 TB 2.88 TB 2x 4x 2x 4x 1-2
7 disk raidz1 @ 32/64K: 10.8 TB 8.06 TB 0.25x / 0.125x 0.25x / 0.125x 6x 6x 1
2 disk mirror @ 4/8K: 1.44 TB 1.44 TB 1x 2x 1x 2x 1
9 disk raidz1 @ 64K: 11.53 TB 11.53 TB 0.125x 0.125x 8x 8x 1

Is it possible for you share this spreadsheet with formulas in? I would like to check these configs with 8 disks at different volblocks and look at some of these numbers to make a final decision?

Dunuin said:
So for a file server for big files (best capacity and sequential read/writes) I would go with "11 disk raidz1 @ 64K" (but bad IOPS so bad for DBs and small files and bad reliability). But reliablity could be increased with using "11 disk raidz2 @ 64K" instead.

is there an easier way to set 64K in TrueNas? Haven't seen anyone doing it really...

Dunuin said:
If you primarily want a fast but small storage I would go with a "8 disk str. mirror @ 16K" for decent IOPS (for 16K and above IOPS like a MySQL DB wants it even would be 4x read and 8x write and sequential reads are great too) + "3 disk raidz1 @ 16K" so you got a small second pool that even could be used for MySQL so you don'T waste those drives. Both together would be 8.64TB of usable space.

This is the config I will be having. 8 disks raidz1 @16 or 32K??? Not sure what the impact would be because I just want these 8 drives to use purely for storage. The other 3 disks pool would be for MySQL or performance related stuff.

Dunuin said:
And fusion pools might also be an option. In that case you only need to create a single pool instead of two pools.

I do want two pools purely because It's a little easier to upgrade. If I have one pool then I will need all the 11 drives to upgrade at the same time because if there is even a single drive with lower capacity it would pull the whole pools capacity down with it.???

Dunuin said:
You could basically use the same 2 options as above but instead of creating a additional mirror / striped mirror pool you could add the mirror / striped mirror as "special devices".

An example:
You create a normal raidz1 pool with 7 disks as raidz1. But than you add another 4 disks to that tool as "special devices" as a striped mirror. Now you can tell TrueNAS to store everything that uses a blocksize of for example 4k to 64K to the striped mirror and everything that is 128K or bigger to the normal raidz1. That way you only got one pool but all files bigger than 128K will be stored on that 7 disk raidz1 and all files smaller 128K (and also all zvols as long as they a vollbocksize of 64K or lower) will be stored on the 4 disk striped mirror instead. And also all metadata will be stored on that striped mirror too, so that even speeds up the 7 disk raidz1 because it won't be hit by all the IOPS caused by metadata.

Interesting idea. But I think for my use case I am gone stick with 8 drive pool in TrueNas and a 3 drive local pool.

Lastly, I am so grateful for your time and helping me out on my journey. Have a great festive season!

Dunuin · Dec 8, 2021

fahadshery said:
Is it possible for you share this spreadsheet with formulas in? I would like to check these configs with 8 disks at different volblocks and look at some of these numbers to make a final decision?

Sorry, did that in my head to there is no spreadsheet I could share.

For datasets usable capacity is:
(TotalNumberOfDrives - ParityDrives) * DiskSize * 0,8
The "* 0.8" is because 20 percent pool should be kept free.
So a 8x 1.8TB disk raidz1 would be:
(8 - 1) * 1.8TB * 0.8 = 10.08 TB
And a 8x 1.8TB disk striped mirror would be:
(8 - 4) * 1.8TB * 0.8 = 5.76 TB

For zvol usable capacity you additionally need to calculate the padding overhead and add it to the parity loss. Like it is done in this spreadsheet which shows the total capacity loss for sectors (one sector is the blocksize caused by your ashift. So a shift of 12 will result in 4K sectors. So if that spreadsheet refers to 4 sectors it actually means 4x 4K = 16K volblocksize in case your pool was setup with a ashift of 12) vs number of drives for raidz1/2/3.

IOPS are equal to the number of stripes. A simple raidz1/2/3 would always only read/write with the IOPS performance of a single drive (=1x) no matter how many drives that raidz consists of. Two raidz1 striped together would be 2x. 4 mirrors striped together (so a striped mirror of 8 disks) would be 4x and so on.
But in case of zvols you need to keep in mind that nothing smaller than your volblocksize can be read/written. If you want to random read 1000x 8K blocks from a zvol with 32K volblocksize that will cause 1000x 32K read operations (so 8000 instead of 2000 4k operations) and you read 4 times the amount of data so you only get 1/4 the IOPS. With writes it is even more worse because in addition to writing 1000x 32K (instead of 1000x 8K) you also get additional 1000x 32K reads, because it will read the 32K from the pool to the RAM first, then add/edit 8K of that 32K block in RAM and then write that 32K block again to the pool. So that actually will result in 16000 instead of 2000 4k operations.
For sequential instead of random writes the additional reads when writing aren't that bad because the ARC will cache it. So for 1000x 8k sequential writes it only needs to read 250x 32k blocks ( instead of 1000x 32k reads when doing random writes).

For raidz1/2/3 the big sequential write and read throughput is equal to the number of data bearing drives (so TotalNumberOfDrives - ParityDrives).
For a mirror or striped mirror the big sequential write throughput is equal to TotalNumberOfDrives - ParityDrives and the read throughput is equal to the total number of drives.

And if you stripe something you need to multiply the blocksize by the number of stripes. So if you got 4 mirrors with 4K blocksize you want a volblocksize of 4x 4K = 16K. Also keep in mind that ZFS only allows blocksizes that are a multiple of 2^X. So in theory a 6 disk striped mirror would require 3x 4K = 12K but only 8K or 16K would be allowed so you need to choose between 8K and 16K which both aren't optimal.

And all this is based on fixed sizes which can be different in reality. So in theory a 32K vollbocksize zvol would read/write 32K of data. But most of the time you want atleast lz4 blocklevel compression so the real size of that block depends on the data of that block. If you got a 32K block of a already compressed video file that might not be compressible at all and the 32K data block will need to read/write full 32K. But a 32K block of a database might be highly compressible and will end up to be only 8K of compressed data. So the theoretical calculation with 32K won't fit for that single block of data. So the better your data is compressible the more your real world results will differ from that theoretical calculations.

fahadshery said:
is there an easier way to set 64K in TrueNas? Haven't seen anyone doing it really...

When you create a new zvol in TrueNAS you can set the blocksize using the GUI (but check it afterwards...with my TrueNAS server it ignores my input). Or you can create that zvol manually using the CLI then then you can set the vollbocksize parameter (zfs create -V 10GB -o volblocksize=16k YourPool/YourNewZvol).
If you mean the "special_small_blocks", that can also be set for each dataset using the GUI. And ZFS will hand down all attributes to the childs. So if you got a dataset with a 64K special_small_block and create a zvol ontop of that that zvol will inherit the 64K special_small_block from the dataset.

fahadshery said:
This is the config I will be having. 8 disks raidz1 @16 or 32K??? Not sure what the impact would be because I just want these 8 drives to use purely for storage. The other 3 disks pool would be for MySQL or performance related stuff.

With a 8 disk raidz1 you would loose:

Volblocksize @ ashift=12:	Parity + Padding loss of raw capacity:
4K	50%
8K	50%
16K	33%
32K	20%
64K	20%
128K	16%
256K	14%
512K	14%
1M	13%

So you always need to make compromises. For the lowest padding overhead you would need the biggest volblocksize. In my opinion 32K is the sweetspot here. Increasing the volblocksize more will only give very little overhead improvement. Going too low with the volblocksize makes no sense, because the lower your volblocksize the smaller the capacity advantage compared to a striped mirror will be.

fahadshery said:
I do want two pools purely because It's a little easier to upgrade. If I have one pool then I will need all the 11 drives to upgrade at the same time because if there is even a single drive with lower capacity it would pull the whole pools capacity down with it.???

Special devices don't need to be the same size as the other normal drives. And in case of a raidz or simple mirror replacing a small drive with a bigger one won't give you more capacity until you replaced all the drives. I for example got 4x 8TB HDDs raidz1 + 3x 200GB SSDs in a three-way-mirror as special devices. Later I will add another 4x 8TB HDD and recreate that pool so I get a 8x 8TB HDD raidz2 + 3x 200GB SSDs in a three-way-mirror as special devices. Thats totally fine because I just want the special devices to store metadata and rule of thumb there is that the special devices capacity should match around 0.3 percent of the pool capacity (so (64TB - 16TB) * 0.003 = 147 GB special devices).
So would be totally fine if you only replace all of your 8 SSDs or only all of your 3 special device SSDs. So in that case its not different to 2 independant pools.
With a striped mirror replacing driver with bigger ones is easier because you don't need to replace all at the same time. Using a 8 disk striped mirror of 1.8TB you could replace disks with bigger ones in pairs. So running a striped mirror with 2x 3,6TB + 2x 1,8TB + 2x 1,8TB + 2x 1,8TB would be possible (but not really recommended because ZFS won't actively redistrubute the data across all stripes so you loose performance until it has passivly equilized over time). Its also easier to add new drives to a striped mirror. You can always add another mirror that that striped mirror to increase its capacity and performance.
Since this year its also possible to add new drives to a raidz1/2/3 but this will only increase the capacity but not the performance.

fahadshery · Dec 19, 2021

Right,

First thing first. I want to thank you for your time and efforts to educate me. This thread is extremely valuable for me and here is summary of my own research + your main points in case someone else stumbles upon here:

NFS:
1. Turn off sync and ATIME for datasets that will be mounted using NFS. Significant IOPS improved. (I ran iozone benchmarks along with blogging and Postgres benchmarks). Obviously there are negatives associated to this as well. so make sure you do a little bit of research on your own
Datasets:
1. Record size can be increased from 128K default to 1M for datasets that would be used for storing Media files (movies, other media etc.)
2. With 192GB RAM, I doubt cache will help. SLOG isn't needed unless you have iSCSI or NFS with reliability issues.
3. RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk for large blocks of data, compressing them, and storing the parity in an efficient manner.
4. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most people come to ZFS for, and what a vast majority of the information out there is about. The other is storage of small, randomly written and randomly read data. This includes such things as database storage, virtual machine disk (ESXi VMDK, etc) storage, and other uses where lots of updates are made within the data. This is optimal for mirrors.
iSCSI:
1. volblocksize only matters if you got zvols, can ONLY be set at zvol creation time, in general they are only used by TrueNAS if you run some VMs within TrueNas or if you are using iSCSI. Everything else should use datasets which are using a recordsize of 128K by default.
2. volblocksize & sparse properties can only be set at the time of ZVOL creation (Again making it super clear)
3. the "standard" process for iSCSI means the incoming data lands in RAM only. So the last few seconds of data written are volatile - if your TrueNAS system crashes, they're gone for good, and VMware won't know that until it goes to read them and gets garbage or an error back. Setting sync=always forces ZFS to ensure the data is on stable (non-volatile) storage before replying back, which closes the gap for data integrity.
4. sparse allows the ZVOL to only consume as much space as is physically written, rather than the logical space.
Proxmox:
1. 1 socket CPU was a little bit quick at READ AND WRITE operations but 2 sockets assigned to a VM are better at READ ONLY operations. According to the benchmarks
2. There is no real physical passthrough. All your disk TrueNAS can see are still virtualized and so create some overhead. That also why you can't use SMART monitoring in TrueNAS and why TrueNAS will report all your SSDs to be 512B/512B logical/physical blocksize even if the real drives may be 512B/4K logical/physical blocksize.
Performance
1. You need to use mirrors for performance.
2. SLOG (separate Intent Log Drive) - speed improvements can be gained from only a few gigabytes of separate log storage. Your storage pool will have the write performance of an all-flash array with the capacity of a traditional spinning disk array. This is why we ship every spinning-disk TrueNAS system with a high-performance flash SLOG and make them a standard option on our FreeNAS Certified line.
3. If you need more IOPS (performance), use fewer disks per stripe. If you need more usable space, use more disks per stripe.
4. RAIDZ2 vdev may end up about as fast as a single drive, which is fine for archival storage, but not good for storing lots of active VM's on or databases
TrueNas:
1. Plan to use lots of vdevs. For performance
2. RAIDZ vdev tends to adopt the IOPS characteristics of the slowest component member.

I have been thinking to install TrueNas on my local-lvm drive because if the proxmox reboots or I upgrade it, then it automatically boots up first. I am booting proxmox from a separate SSD. I never used my local-lvm for anything else ever! By doing so, I could pass my 2 SAS3 HBAs to the TrueNas VM and get TrueNas to manage my pools and disks. (Previously, I was passing one HBA to TrueNas with 8 disks attached) and the other HBA was used within Proxmox for a local ZFS pool for VM's and containers. The only concern I have is if there will be alot of READ/WRITES to this SSD because it will be hosting both the proxmox and the TrueNas VM???

After deliberations and a lot of agony, Here is How I am planning to layout my drives in TrueNas:

So in a nutshell:

I will have 2 separate stripped mirrors. 1 consist of 4 x 800GB SSDs and the other will be 4 x 1.8TB SSDs.
Then for file storage, just have one big pool of 7 x 1.8T drives

Why did I do it??? because these were my main use-cases and whether I needed High/Low IOPS:

Postgres and MySQL databases
1. Requires high IOPS
Media for Plex
1. Low IOPS - High sequential
Persistent Storage for Kubernetes (production and dev)
1. Mix really. Depends what they are doing and where the storage is.
System Backups (both windows and Macs)
1. Low IOPS
Hosting Website and blogs
1. Low IOPS
Creating VM's and Containers using Proxmox
1. High IOPS
NFS mounts in Proxmox to host VMs or mount them inside a VM
1. High IOPS
iSCSI as a storage drive for Windows VM's to host Steam Library or other games
1. High IOPS

I didn't need Fusion pools because I don't have any spinning disks at all. (i.e. All 12Gbs SSDs)

I want to thank you again for your help and efforts. Have a great festive season!

fahadshery · Dec 25, 2021

Hi,

Sorry to keep this thread alive. but this is how I created a zvol for Postgres in truenas. is this what you mean by 8K volblocksize for Postgres?

I have set the block size to 8KiB as discussed above. I am planning to export it to my proxmox as iSCSI share and then pass it to my ubuntu VM *somehow* because I am unsure if I could pass this zvol directly to a guest VM.

Dunuin · Dec 25, 2021

fahadshery said:
Hi,

Sorry to keep this thread alive. but this is how I created a zvol for Postgres in truenas. is this what you mean by 8K volblocksize for Postgres?

Yes. But check if afterwards zfs get vollbocksize YourPool/postgres because here it ignores what I choose in the webUI and still uses the default values.

fahadshery said:
I have set the block size to 8KiB as discussed above. I am planning to export it to my proxmox as iSCSI share and then pass it to my ubuntu VM *somehow* because I am unsure if I could pass this zvol directly to a guest VM.

Why do you add it to proxmox at all? Why not directly use that iSCSI from within your VM?

fahadshery · Dec 25, 2021

Dunuin said:
Why do you add it to proxmox at all? Why not directly use that iSCSI from within your VM?

Just me loosing my mind lol. I will do it exactly this way (by mounting it directly in my VM)

AngryAdm · Dec 25, 2021

fahadshery said:
Hi All,

I have a server with `192G` RAM. I boot it of off a single SSD.

I created a ZFS pool composed off 5 x 800GB enterprise SSDs in a `raidz` configuration. This pool is created just to create VMs/containers only and is used for local storage.

Now when I checked the zfs summary I see this:
View attachment 31993

As you can see the ARC max size is about 95 GiGs. I know by default it would aim to use 50% of the available memory. I think its just too much?
I am looking for advice: do I need to change it somehow? if yes then what's the best number? Or should I just leave it as is? If I leave it as is then will it allow to stand new VMs? I am barely running 3 VMs (with largest RAM allocated being 32GB) and out of 192GB total RAM, it is showing that it has used up 151GB already?
What could be the best way forward?

Thanks in advanced!

You should in fact increase the arc_max size or alternatively, do nothing, that will be fine aswell.
ARC memory will be released if the system needs the memory for anything else.

Maher Khalil · Feb 21, 2022

fahadshery said:
Right,

First thing first. I want to thank you for your time and efforts to educate me. This thread is extremely valuable for me and here is summary of my own research + your main points in case someone else stumbles upon here:

NFS:

Turn off sync and ATIME for datasets that will be mounted using NFS. Significant IOPS improved. (I ran iozone benchmarks along with blogging and Postgres benchmarks). Obviously there are negatives associated to this as well. so make sure you do a little bit of research on your own

Datasets:

Record size can be increased from 128K default to 1M for datasets that would be used for storing Media files (movies, other media etc.)

With 192GB RAM, I doubt cache will help. SLOG isn't needed unless you have iSCSI or NFS with reliability issues.

RAIDZ (including Z2, Z3) is good for storing large sequential files. ZFS will allocate long, contiguous stretches of disk for large blocks of data, compressing them, and storing the parity in an efficient manner.

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most people come to ZFS for, and what a vast majority of the information out there is about. The other is storage of small, randomly written and randomly read data. This includes such things as database storage, virtual machine disk (ESXi VMDK, etc) storage, and other uses where lots of updates are made within the data. This is optimal for mirrors.

iSCSI:

volblocksize only matters if you got zvols, can ONLY be set at zvol creation time, in general they are only used by TrueNAS if you run some VMs within TrueNas or if you are using iSCSI. Everything else should use datasets which are using a recordsize of 128K by default.

volblocksize & sparse properties can only be set at the time of ZVOL creation (Again making it super clear)

the "standard" process for iSCSI means the incoming data lands in RAM only. So the last few seconds of data written are volatile - if your TrueNAS system crashes, they're gone for good, and VMware won't know that until it goes to read them and gets garbage or an error back. Setting sync=always forces ZFS to ensure the data is on stable (non-volatile) storage before replying back, which closes the gap for data integrity.

sparse allows the ZVOL to only consume as much space as is physically written, rather than the logical space.

Proxmox:

1 socket CPU was a little bit quick at READ AND WRITE operations but 2 sockets assigned to a VM are better at READ ONLY operations. According to the benchmarks

There is no real physical passthrough. All your disk TrueNAS can see are still virtualized and so create some overhead. That also why you can't use SMART monitoring in TrueNAS and why TrueNAS will report all your SSDs to be 512B/512B logical/physical blocksize even if the real drives may be 512B/4K logical/physical blocksize.

Performance

You need to use mirrors for performance.

SLOG (separate Intent Log Drive) - speed improvements can be gained from only a few gigabytes of separate log storage. Your storage pool will have the write performance of an all-flash array with the capacity of a traditional spinning disk array. This is why we ship every spinning-disk TrueNAS system with a high-performance flash SLOG and make them a standard option on our FreeNAS Certified line.

If you need more IOPS (performance), use fewer disks per stripe. If you need more usable space, use more disks per stripe.

RAIDZ2 vdev may end up about as fast as a single drive, which is fine for archival storage, but not good for storing lots of active VM's on or databases

TrueNas:

Plan to use lots of vdevs. For performance

RAIDZ vdev tends to adopt the IOPS characteristics of the slowest component member.

I have been thinking to install TrueNas on my local-lvm drive because if the proxmox reboots or I upgrade it, then it automatically boots up first. I am booting proxmox from a separate SSD. I never used my local-lvm for anything else ever! By doing so, I could pass my 2 SAS3 HBAs to the TrueNas VM and get TrueNas to manage my pools and disks. (Previously, I was passing one HBA to TrueNas with 8 disks attached) and the other HBA was used within Proxmox for a local ZFS pool for VM's and containers. The only concern I have is if there will be alot of READ/WRITES to this SSD because it will be hosting both the proxmox and the TrueNas VM???

After deliberations and a lot of agony, Here is How I am planning to layout my drives in TrueNas:

So in a nutshell:

I will have 2 separate stripped mirrors. 1 consist of 4 x 800GB SSDs and the other will be 4 x 1.8TB SSDs.
Then for file storage, just have one big pool of 7 x 1.8T drives

Why did I do it??? because these were my main use-cases and whether I needed High/Low IOPS:

Postgres and MySQL databases

Requires high IOPS

Media for Plex

Low IOPS - High sequential

Persistent Storage for Kubernetes (production and dev)

Mix really. Depends what they are doing and where the storage is.

System Backups (both windows and Macs)

Low IOPS

Hosting Website and blogs

Low IOPS

Creating VM's and Containers using Proxmox

High IOPS

NFS mounts in Proxmox to host VMs or mount them inside a VM

High IOPS

iSCSI as a storage drive for Windows VM's to host Steam Library or other games

High IOPS

I didn't need Fusion pools because I don't have any spinning disks at all. (i.e. All 12Gbs SSDs)

I want to thank you again for your help and efforts. Have a great festive season!

In your photo, you mentioned RAID10, do you main raid z2?

fahadshery · Nov 27, 2022

Dunuin said:
With just 5 drives your options would be:

Raw storage: Parity loss: Padding loss: Keep 20% free: Real usable space: 8K random write IOPS: 8K random read IOPS: big sequential write throughput: big sequential read throughput:
5x 800 GB raidz1 @ 8K volblocksize: 4000 GB - 800 GB -1200 GB -400 GB ≈1600 GB or 1490 GiB ≈1x ≈1x 4x 4x
4x 800 GB str. mirror @ 8K volblocksize: 3200 GB - 1600 GB 0 -320 GB ≈1280 GB or 1192 GiB ≈ 2x ≈4x 2x 4x
5x 800 GB raidz1 @ 32K volblocksize: 4000 GB - 800 GB 0 -640 GB ≈ 2560 GB or 2384 GiB ≈ 0.25x ≈0.25x 4x 4x

So you need to decide if you want capacity or performance.

Hi after a whole year again

I am not sure if you're still around helping people but I am trying to recreate your numbers using a spreadsheet. I do not understand (now) how you were able to separate out parity from padding from the spreadsheet?

Lastly, how do you come up with 8K random write IOPS, 8K random read IOPS, big seq write throughputs and big seq. read throughputs?
Thanks very much!

ZFS pool layout and limiting zfs cache size

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Attachments

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Attachments

Member

Distinguished Member

Member

Member

Member

Member

We value your privacy