[SOLVED] Maximizing ZFS performance

DR4GON

Member
Sep 7, 2021
39
0
6
35
TL;DR: RAM cache pros/cons, and how to? Or SSD cache pros/cons, and how to? Optimal choice?

I don’t understand how to get more out of my array, and maybe someone can help me here, or point me in the right direction.

I have a RAIDZ2 setup on 2 vdevs 6x4TB. All drives are WD Red 4TB (WD40EFZX), over 10GbE, and the server has 96GB of RAM. I have no cache, just the 12 drives on the pool “NAS”. I am in the process of adding another vdev 6x4TB.

What do I need to do to increase the reading/writing performance? Preferably read performance.
 
Your current pool has random I/O performance of just two disks with the two vdevs, so that is really suboptimal for performance.

If you do not want to change the slow disk pool, you could increase the overall performance with two special devices in a mirror (e.g. enterprise SSDs, even just two 240 GB) that will have all the metadata and some data that really needs to be SSD-fast. You can also add two Optane as a mirrored SLOG device for speeding up all sync write for every database operation. Besides that, the only way you can speed up things will be more vdevs, because the speed scales linearly with more vdevs.
 
If you do not want to change the slow disk pool, you could increase the overall performance with two special devices in a mirror (e.g. enterprise SSDs, even just two 240 GB) that will have all the metadata and some data that really needs to be SSD-fast. You can also add two Optane as a mirrored SLOG device for speeding up all sync write for every database operation.
It's a bit cost prohibitive to go full enterprise SSD's, so I guess I'd be interested in the SSD's for metadata, or SLOG route. I just don't know how to do that. If I add two drives to the system for that function, do I add them to the pool with a command, or is it done through the GUI?

What size drives should I need for 96TB, 64TB usable? And what kind of drives would be "cost friendly", but still offer better performance. The server is older, and doesn't have NVMe, but I do have 4 SATA ports spare. I don't want to get the wrong SSD's, as I do know there can be a big difference with technologies, I just don't know what's what.
Besides that, the only way you can speed up things will be more vdevs, because the speed scales linearly with more vdevs.
I was aware of something like that happening, and my enclosure will eventually house 4 vdevs, each RAIDZ2 6x4TB.
 
It's a bit cost prohibitive to go full enterprise SSD's, so I guess I'd be interested in the SSD's for metadata, or SLOG route.
Don't cheap out on SSDs. ZFS is killing consumer SSDs really fast (lost 3 in the last 3 months...and of my 20 SSDs in the homelab that use ZFS are just 4 consumer SSDs). So 75% consumer SSD losses last 3 months.^^
Also keep in mind that special devices are not cache. If you loose that mirror, all data on all your HDDs is lost. So you maybe even want 3 enterprise SSDs in a three-way mirror to fit the reliability of your other vdevs, so that any 2 SSDs of that special vdev mirror might fail without data loss.
I just don't know how to do that. If I add two drives to the system for that function, do I add them to the pool with a command, or is it done through the GUI?
Only by using the CLI. But thats a easy one-liner.
What size drives should I need for 96TB, 64TB usable?
See here how to calculate it: https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
Usually about 0,4% of your total pools capacity.
And what kind of drives would be "cost friendly", but still offer better performance. The server is older, and doesn't have NVMe, but I do have 4 SATA ports spare. I don't want to get the wrong SSD's, as I do know there can be a big difference with technologies, I just don't know what's what.
Enterprise SSD aren't expensive. They are cheap. Atleast if you look at the price per TB TBW (lifespan) and not the price per TB capacity. The initial costs of a enterprise SSD might be higher, but its better to pay double the price for a SSD that will last 5 years than buying a consumer SSD for the half price every year.
I was aware of something like that happening, and my enclosure will eventually house 4 vdevs, each RAIDZ2 6x4TB.
Wow, thats a nice enclosue. I already got problems finding a silent rackable case that can fit 8 HDDs and 18 SSDs. Basically, all I find are either way to enterprise-like and loud (propietary parts, PSU with 40mm FANs, 5000+ RPM 80mm fans, ...) or they squeeze the HDDs so close together and without and shock damping that I don't have a clue how my shucked WD Whites without any "anti resonance voodoo NAS firmwares" should survive that for long without getting any damage by vibration or heat.
 
Last edited:
See here how to calculate it: https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
Usually about 0,4% of your total pools capacity.
and you need to send/receive everything to take advantage of this.

So you maybe even want 3 enterprise SSDs in a three-way mirror to fit the reliability of your other vdevs
great idea!

It's a bit cost prohibitive to go full enterprise SSD's
Not anymore, we dropped 3,5'' drive usage in primary storage over a decade ago and only went with 2,5'' SAS and replaced that over 5 years with full-ssd, they are already at-par with SAS drives prices. We're running over 300 SSDs and we never had a SSD failure in this time - all enterprise of course. We also introduced the first NVMe storage array 2 years ago, but that performance does not translate well to the fabric. It is faster than any SAS storage, but "only" 2-3x faster, nothing compared to local storage on the machine.
 
Don't cheap out on SSDs. [...] Also keep in mind that special devices are not cache.
My current array is HDD's, that's what I meant by SSD's being cost prohibitive. I assume you mean getting an enterprise SSD for the "special device" role. That does make sense. It would be pretty annoying spending the time caring for the HDD reliability, and then forgoing it for the SSD.

I'm coming up against a few acronyms that I'm finding it difficult to wrap my head around. I'll try and walk through what I've done and what I think I should do next:

If I started with zpool create -f -o ashift=12 NAS raidz2 /dev/sd* /dev/sd* etc and zpool add -f -o ashift=12 NAS raidz2 /dev/sd* /dev/sd* etc as far as I can tell, ARC is setup by default:
Code:
# free -m
              total        used        free      shared  buff/cache   available
Mem:          96729       52728       42476         269        1524       43018
Swap:             0           0           0
# awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats
size 48618
If I was to setup L2ARC, it is as simple as zpool add -f -o ashift=12 NAS cache /dev/sd* and since it is considered volatile, I wouldn't need enterprise yet.

Then, the step after that would be utilizing zpool add -f -o ashift=12 NAS special mirror /dev/sd* /dev/sd* /dev/sd* which would benefit from the reliability of Enterprise SSD's, and 3x mirror.

I'm not sure what to make of zfs set special_small_blocks=* NAS How do I find my current block size, and how do I decide special block size? I don't quite understand what Wendel at Level1Tech means.

Code:
# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:   7363
  2k:    201
  4k:    120
  8k:    458
 16k:   3182
 32k:   8402
 64k:   3419
128k:   2262
256k:   3553
512k:   6228
  1M:  11758
  2M:  11734
  4M:   6197
  8M:  24714
 16M:   2428
 32M:    810
 64M:    629
128M:   1058
256M:   2891
512M:   3294
  1G:   1662
  1G:    108
  1G:    171
  1G:      2
  1G:     38
  1G:    644
  1G:      9
Does this help me figure out what to set my special block size?

Finally, LOG and SLOG.. Do I need a LOG or a SLOG?
Only by using the CLI. But thats a easy one-liner.

See here how to calculate it: https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
Usually about 0,4% of your total pools capacity.
Total capacity as in 96TB in drives, or total capacity as in 64TB usable storage?
Enterprise SSD aren't expensive.
You're right, I was incorrectly assuming I would have to get a 4TB SSD to match the HDD's in the vdev.

and you need to send/receive everything to take advantage of this.
I don't know what this means.
Not anymore
Yes "anymore". The benefit of living in Australia, just means that tech costs more, or we get less variety. When a 4TB WD RED Plus (WD40EFZX) is AUD$142 and the first Enterprise SSD at my local PC shop is a Intel DC P4510 1TB (SSDPE2KX010T807) and is a whopping AUD$479, it's a bit ridiculous to upgrade a Plex server with my life savings, and future house deposit.
 
If I was to setup L2ARC, it is as simple as zpool add -f -o ashift=12 NAS cache /dev/sd* and since it is considered volatile, I wouldn't need enterprise yet.
L2ARC and SLOG most of the time won't help much.
And the L2ARC only helps with reads. And the bigger your ARC is, the RAM ZFS will need. So you are sacrificing some super fast RAM for caching to get more but slower SSD for caching. You usually say, L2ARC only makes sense if you already upgraded your RAM to the limit your mainboard allows you your read cache is still too small to hold a complete copy a a very important big file that you always access (like a TB big database file). Otherwise it would be better to just buy more RAM.
"special devices" on the other hand will speed up sync+async reads+writes. Here I got a performance increase of factor 2 to 3 when working with smaller files as the HDDs IOPS performance isn't bottlenecking that early anymore as big part of all IO is metadata and these now don't hit the HDDs anymore.
Then, the step after that would be utilizing zpool add -f -o ashift=12 NAS special mirror /dev/sd* /dev/sd* /dev/sd* which would benefit from the reliability of Enterprise SSD's, and 3x mirror.
Jup.
I'm not sure what to make of zfs set special_small_blocks=* NAS How do I find my current block size, and how do I decide special block size? I don't quite understand what Wendel at Level1Tech means.
By default all metadata will be stored on those SSDs. But those SSDs can also store data. Here comes the special_small_blocks into play. If you don't want any data to be stored on the SSDs just keep that special_small_blocks at the default value (so "0"). But if you would want to store data in addition to metadata on those SSDs you could use special_small_blocks as some kind of filter. Lets say you got a dataset "NAS/mydataset" and you set special_small_block=8K for NAS/mydataset. Then all records (so all files) smaller or equal to 8KB will be stored on those SSDs and all files bigger than 8K will be stored on the HDDs. You can also set that special_small_blocks for zvols but here it won't make much sense as all block are the same size. But then it still could be used to force all blocks of a zvol written to the SSD instead of the HDDs.
Code:
# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:   7363
  2k:    201
  4k:    120
  8k:    458
 16k:   3182
 32k:   8402
 64k:   3419
128k:   2262
256k:   3553
512k:   6228
  1M:  11758
  2M:  11734
  4M:   6197
  8M:  24714
 16M:   2428
 32M:    810
 64M:    629
128M:   1058
256M:   2891
512M:   3294
  1G:   1662
  1G:    108
  1G:    171
  1G:      2
  1G:     38
  1G:    644
  1G:      9
Does this help me figure out what to set my special block size?
No, that is for calculating how big your SSDs have to be. Not to find out the block size.
Finally, LOG and SLOG.. Do I need a LOG or a SLOG?
SLOG will only help with sync writes, no async ones and as long as you don't run big databases on that pool 99% of your writes are probably async writes. If you got a space SSD it won't hurt, but usually it's not really necessary and it will heavily wear the SSD to consumer SSDs won't have a long lifespan.
Total capacity as in 96TB in drives, or total capacity as in 64TB usable storage?

You're right, I was incorrectly assuming I would have to get a 4TB SSD to match the HDD's in the vdev.
Not completely sure. You either need 2-3x 256GB SSDs or 2-3x 512GB SSD. But it isn't that bad when the special devices run out of space. Then the metadata will spill over to the HDDs so everything is still working, just that these spilled over metadata then wont be faster than now.
 
Thanks for all the info, you’ve made so many things make sense. Now to start putting it into action. Lol
 
When a 4TB WD RED Plus (WD40EFZX) is AUD$142 and the first Enterprise SSD at my local PC shop is a Intel DC P4510 1TB (SSDPE2KX010T807) and is a whopping AUD$479, it's a bit ridiculous to upgrade a Plex server with my life savings, and future house deposit.
Yes, but those drives are not enterprise level, they are "just NAS", which is also prosumer. Enterprise is SAS, not SATA and those drives would cost a lot more.
 
I was right? Interesting.
Yes, sorry. I was comparing enterprise harddisks with enterprise ssds, I should have made that clear. The difference there is negligible in 2.5'' and 3.5'' are a total niece there. If you want performance, you put in a lot of spindels and do RAID10 (also with 2 mirrors for the same redundancy as RAIDz2, but a lot more horsepower).

(No sane person would run non-enterprise hardware in an enterprise setting and think that's ok. After the first crash or slowliness, an external auditor or performance specialist comes in, those "non-enterprise is ok" people are usually fired.)
 
Yes, sorry. I was comparing enterprise harddisks with enterprise ssds
That's why I was confused, because I specifically said this in the original post:
All drives are WD Red 4TB (WD40EFZX)

No sane person would run non-enterprise hardware in an enterprise setting
Good thing I never brought up utilizing non enterprise hardware in an enterprise environment:
TL;DR: RAM cache pros/cons, and how to? Or SSD cache pros/cons, and how to? Optimal choice?
I believe my original question has been answered at this point.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!