Newbie question on ZFS - using multiple devices as a single logical unit

Perrone · Mar 30, 2022

I have three brand new physical storage devices with different speeds:

a) 7000 MB/s SSD with 512 GB
b) 2400 MB/s SSD with 1 TB
c) 190 MB/s HDD with 2 TB

My initial plan was to use:

(a) for the operational system and most used programs,
(b) for most of my data, and less used programs, and
(c) just for videos, backups and other large files.

Now I have read that "ZFS incorporates algorithms to ensure that your most recently used and most frequently used data are kept in the fastest system storage media". Does that mean I can set (a), (b) and (c) as a single logical unit and trust on ZFS to decide were each piece of data gets stored? I understand it would need to spend some time moving data around to make it optimal.

Also, I'll install Windows and Mac on this machine. My initial plan was to partition drive (a), then use a simple file system on (b) and (c) which can be shared between Windows and Mac. But that doesn't seem very optimal. How can ProxMox make my life easier here?

Please give me some pointers. I'm having a hard time finding the answers.

dylanw · Mar 30, 2022

Perrone said:
Now I have read that "ZFS incorporates algorithms to ensure that your most recently used and most frequently used data are kept in the fastest system storage media". Does that mean I can set (a), (b) and (c) as a single logical unit and trust on ZFS to decide were each piece of data gets stored? I understand it would need to spend some time moving data around to make it optimal.

This quote is somewhat misleading. ZFS doesn't actively/automatically shuffle data around drives. Instead, ZFS uses up to 50% of the system's RAM (depending on availability) for a read cache. This cache is named ARC (Adaptive Replacement Cache). Additionally with ZFS, you can also configure an L2ARC, which is a fast drive, e.g. an NVMe, that is used as a secondary read cache, which will be slower but typically provide more cache storage.
ARC (in memory cache) is set up by default with ZFS, so if you were to set up three separate, single-disk zpools, data would be cached in main memory, up to a certain amount (default 50%). L2ARC on the other hand would need to be configured manually on another drive.
An overview of both caching systems can be found here: https://www.brendangregg.com/blog/2008-07-22/zfs-l2arc.html

As a final note, much of ZFS's benefits come from its ability to prevent against data loss, via checksumming and redundancy. With single drive zpools, these measures are mostly ineffective against data loss.

Perrone said:
Also, I'll install Windows and Mac on this machine. My initial plan was to partition drive (a), then use a simple file system on (b) and (c) which can be shared between Windows and Mac. But that doesn't seem very optimal. How can ProxMox make my life easier here?

Just to clarify, do you mean that you'll install Windows and Mac next to Proxmox VE on the boot drive or that you wish to virtualise these within Proxmox VE? If you wish to virtualise Windows and Mac within Proxmox VE, you can create a virtual drives for both, then share files via a network share on either or another system. The benefit here will be that virtualised hardware and networking is easy to manage. If you wish to install the three alongside one another, Proxmox VE can't offer much, as it'll be an entirely separate system.

Perrone · Mar 30, 2022

Thanks for your great answer.

dylanw said:
This quote is somewhat misleading. ZFS doesn't actively/automatically shuffle data around drives. Instead, ZFS uses up to 50% of the system's RAM (depending on availability) for a read cache. This cache is named ARC (Adaptive Replacement Cache). Additionally with ZFS, you can also configure an L2ARC, which is a fast drive, e.g. an NVMe, that is used as a secondary read cache, which will be slower but typically provide more cache storage.

Thanks. That explains it. So, for instance, ZFS allows SSD space to serve as L2ARC cache layers of an HDD partition. Though, in such case the SSD will work as a redundancy layer, not as additional storage.

Just not sure what's the difference between L2ARC and another feature pointed out here (which seems to be the same thing):

SSD Hybrid Storage Pools

"High performing SSDs can be added in the ZFS storage pool to create a hybrid kind of pool. These high performing SSDs can be configured as a cache to hold frequently accessed data in order to increase performance."

dylanw said:
Just to clarify, do you mean that you'll install Windows and Mac next to Proxmox VE on the boot drive or that you wish to virtualise these within Proxmox VE? If you wish to virtualise Windows and Mac within Proxmox VE, you can create a virtual drives for both, then share files via a network share on either or another system. The benefit here will be that virtualised hardware and networking is easy to manage. If you wish to install the three alongside one another, Proxmox VE can't offer much, as it'll be an entirely separate system.

Yes, I want to virtualize Windows and Mac on Proxmox VE. Having easy ways to share files seems good enough. Also, I'm expecting to have no hardware compatibility issues with Mac.

From what I understood, Windows and Mac must have separate storage anyway. But instead of partitioning the disks, I can split them virtually using Proxmox. And It looks like Proxmox will allow me to conveniently resize them later as needed (something that would be hard to accomplish if I used partitions).

Dunuin · Mar 30, 2022

Perrone said:
Thanks for your great answer.

Thanks. That explains it. So, for instance, ZFS allows SSD space to serve as L2ARC cache layers of an HDD partition. Though, in such case the SSD will work as a redundancy layer, not as additional storage.

Just not sure what's the difference between L2ARC and another feature pointed out here (which seems to be the same thing):

SSD Hybrid Storage Pools
"High performing SSDs can be added in the ZFS storage pool to create a hybrid kind of pool. These high performing SSDs can be configured as a cache to hold frequently accessed data in order to increase performance."

For me they are just talking about L2ARC or they are wrong. You can create such hybrid pools by adding "special device" SSDs but these aren't caches. If you loose your special device SSD all data on the HDDs is lost too. Such a HDD + SSD hybrid pool would store all metadata on the SSDs so HDDs are hit by less IO, because the HDDs now only need to store the data itself. And you can set the "special_small_blocks" ZFS attribute for each dataset which will allow you to store small data on the SSDs too.
Lets say you set the special_small_blocks to 64K. Then all metadata + all data not bigger than 64K would be stored on the SSDs and only data of over 64K would be stored on the slow HDDs. Once the special device SSD is full data and metadata will spill over to the HDD.

Perrone said:
From what I understood, Windows and Mac must have separate storage anyway. But instead of partitioning the disks, I can split them virtually using Proxmox. And It looks like Proxmox will allow me to conveniently resize them later as needed (something that would be hard to accomplish if I used partitions).

You can use ZFS or LVM-Thin to be able to use thin-provisioning. That way there are no fixed limits and they will just use what they actually need. But you need to monitor the disk usage by yourself because nothing will stop the VMs from writing more than possible. Lets say you just got a 1TB SSD. You could create a 4TB virtual disk for both VMs and that would be fine as long as the combined data you write to those virtual disks isn't bigger than 800GB.

Perrone · Mar 31, 2022

Dunuin said:
For me they are just talking about L2ARC or they are wrong. You can create such hybrid pools by adding "special device" SSDs but these aren't caches. If you loose your special device SSD all data on the HDDs is lost too. Such a HDD + SSD hybrid pool would store all metadata on the SSDs so HDDs are hit by less IO, because the HDDs now only need to store the data itself. And you can set the "special_small_blocks" ZFS attribute for each dataset which will allow you to store small data on the SSDs too.

Hm. So the Level Two Adaptive Replacement Cache (L2ARC) is not actually a cache. At least not in the standard definition of "cache", which is for redundant/temporary storage.

That reminds me the hybrid 2.5" drives used in laptops, which has a small amount of SSD, such as 8GB of SSD for a 1TB hard drive. It is convenient to have both on the same unit, but more useful to have a greater amount of SSD. And it seems like ZFS can manage the hybrid device much better than the unit hardware, since it is file system aware.

Dunuin said:
You can use ZFS or LVM-Thin to be able to use thin-provisioning. That way there are no fixed limits and they will just use what they actually need. But you need to monitor the disk usage by yourself because nothing will stop the VMs from writing more than possible. Lets say you just got a 1TB SSD. You could create a 4TB virtual disk for both VMs and that would be fine as long as the combined data you write to those virtual disks isn't bigger than 800GB.

Nice to know! So I won't have to worry about resizing the devices. That's good for SSDs, where the fragmentation of data isn't a problem.

When you said "as long as the combined data isn't bigger than 800GB", I'm guessing the other 200GB were allocated for parity checking?

My priority here is performance, so I'll not use compression nor parity checking (I'll do a backups of important data instead). And it seems like ZFS hybrid pool (L2ARC) can provide best performance.

Documentation says I would need to give at least 8 GB of RAM memory to ZFS. But the "rule of the thumb says 2 GB+ 1Gb/TB of storage, which calculates to 2+3= 5 GB in my case. I have 32 GB DDR4 memory, which should be good enough.

I still need to learn and check many things, but this is my initial draft (probably with some conceptual mistakes yet):

Would that be a good choice?

O understand that would allow me to add a Linux OS to the LVM pool, then use FUSE in it to allow access the exFAT logical volumes too.

Dunuin · Mar 31, 2022

Perrone said:
Hm. So the Level Two Adaptive Replacement Cache (L2ARC) is not actually a cache. At least not in the standard definition of "cache", which is for redundant/temporary storage.

The L2ARC is just a cache and can be lost without a problem. But ZFS knows more layers then shown in your pyramid. There are some vdevs like the "special" for storing metadata (and optionally small data blocks), the "dedup" for storing deduplication tables, "spare" for hot spares...and then there is the new draid.

Best you google for "zfs special device" to get a better understanding. For example this here: http://storagegaga.com/discovering-openzfs-fusion-pool/

Perrone said:
That reminds me the hybrid 2.5" drives used in laptops, which has a small amount of SSD, such as 8GB of SSD for a 1TB hard drive. It is convenient to have both on the same unit, but more useful to have a greater amount of SSD. And it seems like ZFS can manage the hybrid device much better than the unit hardware, since it is file system aware.

But with these hybrid SSHDs the SSD part is just used for caching, so like a HDD backed by a SSD as a L2ARC. But ZFS can do way more. You can for example have a mixed pool of HDDs and SSDs where only big files are stored on the HDDs where IOPS doesn't matter and everything else on the fast IO-capable SSDs. See the "special device" link above.

Perrone said:
Nice to know! So I won't have to worry about resizing the devices. That's good for SSDs, where the fragmentation of data isn't a problem.

When you said "as long as the combined data isn't bigger than 800GB", I'm guessing the other 200GB were allocated for parity checking?

No, ZFS is a Copy-on-Write (CoW) filesystem so it always needs free space to be able to operate fast. The more you fill up the pool, the more the pool will fragment and become slow. This is especially a problem because CoW filesystems like ZFS can't be defragmented. If its fragmented, you need to remove everything from the entire pool and move it back later so it gets written again but less fragmented. If you care about performance and fragmentation you don't want to fill up your pool more than 80%. Parity comes on top of that. Lets say you got a mirror of 2x 1TB. Then you got a raw storage of 2TB, 1TB of that is parity and of the 1TB that is basically usable you want 20% always free. So of that 2TB only 800GB are really usable. And if you want to use snapshots these also require space, so a big part of that 800GB might be used by snapshots so you maybe just got 400 or 600GB for actual data. And in case of raidz with zvols you also might have some padding overhead wasting even more space.

Perrone said:
My priority here is performance, so I'll not use compression nor parity checking (I'll do a backups of important data instead). And it seems like ZFS hybrid pool (L2ARC) can provide best performance.

Compression usually improves performance, atleast with LZ4 compression, because most of the time the CPU can compress faster than the disks can write/read. As long as the disks bandwidth is the bottleneck and not the CPU performance, it will be faster to use compression, because the disks need to read/write less data which then improves the disks performance. Not sure about the NVMes but the HDD would definitely be faster with enabled compression.

Perrone said:
Documentation says I would need to give at least 8 GB of RAM memory to ZFS. But the "rule of the thumb says 2 GB+ 1Gb/TB of storage, which calculates to 2+3= 5 GB in my case. I have 32 GB DDR4 memory, which should be good enough.

The bigger your ARC the better. Also keep in mind that using L2ARC needs RAM too, because everything stored in the L2ARC read cache on your SSD needs to be indexed in RAM. If I remember right you want atleast 1GB RAM for each 10GB of L2ARC. So using L2ARC will give you alot of slower read cache but at the same time uses up a bit of RAM that otherwise could have been used as a way faster read cache.

Perrone said:
I still need to learn and check many things, but this is my initial draft (probably with some conceptual mistakes yet):

View attachment 35633

exFAT won't work well on a Hybrid pool. ZFS got datasets (which is a filesystem on its own) for storing files and you can't format it. You could create a zvol (a ZFS block device) and format it with exFAT, but then everything written to that block device will use a fixed blocksize (default is 8K) so everything you would write to the pool is a 8K block. So you can't tell ZFS to write small files to the SSDs because everything is a small file (a 8GB video is then stored as 1000000x 8K blocks). So to make best use of Hybrid pool you would need to use datsets and datasets can't be passthroughed into VMs. If you want to access the files on a dataset you would need to setup a NFS/SMB server on your host and mount it inside your VMs using SMB/NFS shares.

And not sure what you mean by "400 GB Shared Data". A virtual disk can't be shared between VMs. When two VMs would mount it at the same time you would corrupt the data on it. So if you want some kind of shared folder you would need to attach that virtual disk to a single VM and that VM then you need to share the data using NFS/SMB again.

Perrone · Apr 2, 2022

Thanks @Dunuin for your reply!

Ok. I'm using my best Google-Fu to digest the concepts.

So... My original idea of joining the entire 1 TB SSD with the 2 TB HDD in the same pool makes no sense. I had hopes that I wouldn't have to split data manually between them, but it seems like I'll have to, for best performance, moving frequently used data in the SSD and least used data in the HDD.

I also see that with the thin provisioning of LVM-Thin I'll have to closely monitor the space. Otherwise if it gets full, the result is catastrophic, usually beyond repair of the affected VMs, according to this source.

Now digesting ZFS options:

1) So the ARC is mandatory for ZFS. It takes a varying amount of the primary memory (which I will call as RAM). But it's only beneficial to repeatedly reading the same file.

By default, the ARC would use 1 to 16 GB of a 32 GB system. But it can be limited. The rule is 1 GB of RAM per each TB of disk space, which is taking 0.1% of the HDD space from the RAM. But the rule also asks additional X GB of RAM regardless of the disk space. The recommendation for X varies between 2 GB and 8 GB, depending on the source. However, this video explains that smaller amounts of ARC will also work, and more L2ARC can make up for the RAM limitation.

2) The L2ARC would be optional. Another cache layer, to be stored in the SSD. And it also increases the need for RAM memory and the RAM memory pressure since the L2ARC is indexed in RAM. So L2ARC is only recommended if ARC hit is below 90% and the ARC size can't be increased further because there isn't enough RAM memory. The L2ARC size should be tuned around 3x to 10x the size of the ARC, being 5x the most common ratio. So it usually takes around 0.5-1% of the HDD space from the SSD.

3) The special disk would bring metadata of the HDD to the SSD, greatly increasing performance, and that takes very little space. Adding small blocks to the SSD takes more space, but yet the required size is small. It's recommended to store files only up to 64k at max. Otherwise the default ZFS block size of 128k would have to be increased, which would negatively affect performance and space usage. By the end, the special disk usually takes around 3% of the HDD space from the SSD.

4) The compression is optional, but will certainly accelerate HDDs, especially on read access. Except in very specific case were the bottleneck is the CPU instead of the HDD speed.

5) When compressing NVMe, it may or not be useful, as explained in this 2020 Proxmox article:

They selected a unit which is known to do 3500 MB/s for read and write without compression, but the testing is about small files of 4k. For those, it was benchmarked it at 205 MB/s bandwidth and 51k IO/s.
Then, after compression benchmarks, that bandwidth goes up to 500 MB/s for single job. In the case of concurrency it goes much higher, up to almost 3000 MB/s read speed in the case of 32x jobs.
For the IO/s value, it can be better or worse, depending on the case.
- For a single job, it decreases to 7k - 2k IO/s.
- For 32x jobs, it triplicates read to 150k IO/s and drops write a bit, to 40k IO/s.

To better evaluate if it would be worth using compression in NVMe, I found this quote:

"While IOPS was important when measuring hard drive performance, most real-world situations do not require more than a thousand inputs/outputs per second. Therefore, IOPS is rarely viewed as an important metric in SSD performance."

Those tests didn't benchmark latency increase due to compression, but it seems fine since IO/s is still acceptable. Since Average IO size x IOPS = Throughput in MB/s, we can assume the average IO size has increased dramatically.

Their CPU has 2.6x more cores than a i5-12400, and they had 4x more memory than 32 GB RAM, but memory speed was the same (DDR4-3200) and the i5-12400 CPU single core speed is actually 75% faster, so my compression should be OK for my hardware.

==

Conclusions:

I still need to evaluate if ZFS in NVMe SSDs is worth since it will spare 20% of the SSD space plus some RAM, just for performance gains, but the SSD does not allow for so much performance gain as in the case of HDD. Because the SSDs do 2400 MB/s and 7000 MB/s, which is closer to the DDR4-3200 memory bandwidth of 25600 MB/s, while the HDDs does around 190 MB/s only. Also, DDR4 latency is in the range of a few nanoseconds while HDD latency is in the range of a few milliseconds, which is a million times more, but PCIe NVMe SSDs are in the middle range, with a few microseconds of latency. The table below summarizes this:

	DDR4-3200 memory	PCIe 4 NVMe SSD	PCIe 3 NVMe SSD	HDDs
Latency	a few nanoseconds	a few microseconds	a few microseconds	a few milliseconds
	1	x1,000	x1,000	x1,000,000
Bandwidth	25,600 MB/s	7000 MB/s	2400 MB/s	190 MB/s
	1	x 1 / 3.6	x 1 / 10	x 1 / 135
	x135	x37	x13	1

In logarithmic scale:

==

With that, I created a new draft schema (probably still with mistakes):

Another thing I need to decide is in which device Proxmox will be installed. I would want to install it in a way that won't change the structure above.

It's documentation says it can be installed in a ext4, XFS, BTRFS, or ZFS partition. And when using ex4 or XFS, then LVM is used, and it will take at least 4 GB or RAM. But it's unclear for me if that means it can go in the SSD side-by-side with the LVM-Thin pool, or if it would adding another layer of abstraction before the LVM and degrade performance. For best results, should I then use a spare HDD just for Proxmox?

Dunuin · Apr 2, 2022

Perrone said:
1) So the ARC is mandatory for ZFS. It takes a varying amount of the primary memory (which I will call as RAM). But it's only beneficial to repeatedly reading the same file.

Don't think of files and folders when working with ZFS. Everything is stored as transaction blocks. There are blocks that store data and there are blocks that store metadata. Lets say you got a 1MB file you want to store and your datasets recordsize is 128K. Then this 1MB file will be stored as 8x 128K data records + multiple metadata records. And the ARC isn'T only used for caching data, it will also cache metadata. Lets say you got folder with 100000 files in it and you want to list the contents of that folder. It don't needs to keep the data records of all 100000 files in ARC because you only need these when you actually try to open one of the 100000 files. But it would be really bad if your ARC isn't caching all the metadata records of those 100000 files because then listing a directory indeed would need to open hundreds of thousands of metadata blocks which could take countless minutes instead of just a second.
ZFS is a very complex Copy-on-Write storage with alot of algorithms that need to read, write and calculate all the time. So in order for the pool to work fast, you really want a big enough ARC so all stuff that gets used regularily will be held in RAM.
Think of ZFS more like a Logbook where you are only be able to add new log lines at the first free page at the end of the book but you never can overwrite a line you already have written in the past. There is no index which points to a place where a file is stored. The file might be spread as endless of loglines across the entire logbook and if you want to read that file you need to read the entire logbook from the first to the last page and add all log lines together so you get a file to work with. Best you read some ZFS tutorials to understand why ZFS is so different to other filesystems. Stuff liek this: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/

Perrone said:
By default, the ARC would use 1 to 16 GB of a 32 GB system. But it can be limited. The rule is 1 GB of RAM per each TB of disk space, which is taking 0.1% of the HDD space from the RAM. But the rule also asks additional X GB of RAM regardless of the disk space. The recommendation for X varies between 2 GB and 8 GB, depending on the source. However, this video explains that smaller amounts of ARC will also work, and more L2ARC can make up for the RAM limitation.

Jup. A ZFS pool might also run with just 2GB + 0.25 to 0.5GB RAM per 1TB raw storage. But the more ARC the better your pool will be able to operate. Another rule of thumb is "Don't use L2ARC if your mainboards RAM limit isn't already maxed out. Get as much RAM as possible and if you still need more cache, think about adding a L2ARC". With your setup I would test it with 4-8 GB RAM for your ARC. arc_summary is a great tool to see whats going on with your ARC. If your dnode/metadata cache size is running out of space our your hitrates aren't in the higher 90% range then you probably want to increase your ARC size.

Perrone said:
2) The L2ARC would be optional. Another cache layer, to be stored in the SSD. And it also increases the need for RAM memory and the RAM memory pressure since the L2ARC is indexed in RAM. So L2ARC is only recommended if ARC hit is below 90% and the ARC size can't be increased further because there isn't enough RAM memory. The L2ARC size should be tuned around 3x to 10x the size of the ARC, being 5x the most common ratio. So it usually takes around 0.5-1% of the HDD space from the SSD.

3) The special disk would bring metadata of the HDD to the SSD, greatly increasing performance, and that takes very little space. Adding small blocks to the SSD takes more space, but yet the required size is small. It's recommended to store files only up to 64k at max. Otherwise the default ZFS block size of 128k would have to be increased, which would negatively affect performance and space usage. By the end, the special disk usually takes around 3% of the HDD space from the SSD.

If I remember right metadata should usually be around 0.3% and not 3% of your usable storage. So as long as you just want to store metdadata on it and not small data blocks, the "special device" SSD doesn't need to be that big. Great thing about a special device is that it also increases the async write performance. After adding a 200GB SSD to my 32TB HDD pool I see a performance improvement of nearly factor 3 when writing small files because the IOPS of the HDDs aren't bottlenecking that early. And directory listings are magnitudes faster only taking seconds instead of minutes. There are way to calculate how big your metadata actually is. See here for example here: https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954/6

Perrone said:
"While IOPS was important when measuring hard drive performance, most real-world situations do not require more than a thousand inputs/outputs per second. Therefore, IOPS is rarely viewed as an important metric in SSD performance."

Depends on the use case. When putting a SSD in a gaming PC you don't get that much IOPS with your consumer workload. But in a hypervisor the IOPS are super important. If you run 20 VMs, you are also running 20x a complete OS so you get 20x the IOPS hitting the disk. And unlike 20x gaming PCs running 20x windows you don't got 20x SSDs. And then there is alot of overhead caused by ZFS, virtualization, mismatched blocksizes and so on. And when using enterprise workloads like virtualization you do alot more of small and/or sync writes so your workload is in general way more IOPS dependent. And then there is read/write amplification. I got a write amplification of factor 3 to 81 with an average of factor 20. When my Windows in a VM writes 1GB to the virtual disk, 20GB will be written to the physical SSD. So I only got 1/20th of the performance and the SSD will die 20 times faster.
Lets say your HDD can handle a normal value like 120 IOPS. That means when doing 4K random reads/writes it can read/write with 120 IOPS * 4K = 480KB/s. Now you maybe got a write amplification of factor 10 so your write performance will drop from 480kb/s down to 48 kb/s. Then you think your HDD will offer you 190 MB/s bandwidth but in special cases like described above all you get is just 48kb/s. As soon as your HDD is bottlenecking the "IO delay" will go up and the CPU is slowed down because it can't work when waiting for data stored on the HDD that can't keep up.

Perrone said:
I still need to evaluate if ZFS in NVMe SSDs is worth since it will spare 20% of the SSD space plus some RAM, just for performance gains, but the SSD does not allow for so much performance gain as in the case of HDD. Because the SSDs do 2400 MB/s and 7000 MB/s, which is closer to the DDR4-3200 memory bandwidth of 25600 MB/s, while the HDDs does around 190 MB/s only. Also, DDR4 latency is in the range of a few nanoseconds while HDD latency is in the range of a few milliseconds, which is a million times more, but PCIe NVMe SSDs are in the middle range, with a few microseconds of latency. The table below summarizes this:

DDR4-3200 memory PCIe 4 NVMe SSD PCIe 3 NVMe SSD HDDs
Latency a few nanoseconds a few microseconds a few microseconds a few milliseconds
1 x1,000 x1,000 x1,000,000
Bandwidth 25,600 MB/s 7000 MB/s 2400 MB/s 190 MB/s
1 x 1 / 3.6 x 1 / 10 x 1 / 135
x135 x37 x13 1

In logarithmic scale:

View attachment 35670

Don't trust the numbers the manufacturers will advertise the disks with. These were made with optimal cases in mind and are most of the time just "up to" numbers and doesn't reflect what a actual virtualization workload with all its overhead and write amplification and sync writes might look like. Do some sync 4k random write benchmarks with a low queue deep and disabled caching inside a VM and you can be happy when your great 7000MB/s SSD can write with real 70MB/s.

Perrone · Apr 3, 2022

Yes, I know the advertised speed is the best scenario. This is the benchmarks of KC3000 PCIe 4.0 NVME SSD, from here:

I used the advertised value because I need some number to compare the different storage devices I have, and the advertised number happens to be good for comparison as it's the device's physical limit. But I understand your point that RAM is different, as it's not susceptible to a IOP limitation. Even SSD is quite different than HDD.

Maybe we can compare SSD IOPS (input/output operations per second) to RAM memory transfers per second? In that case, the DDR4 3200 PC4-25600 does 3.2 GT/S. While the SSD above does 6k IO/s (with 1M files) to 1M IO/s in the RND4k/Q32T16 test. So RAM would do at least a thousand times more.

But yet my RAM is limited and I can't afford to buy more right now. Actually, I would like to provide the entire 32 GB for a single operating system, which is already not possible with Proxmox. This is why I will probably have to use L2ARC, to compensate for a small ARC. But yes, I'll try ZFS without the L2ARC first and just 4 GB of RAM, then check the ARC's hit rate before giving it extra 4 GB RAM (if possible) or add the L2ARC only if needed.

Also, given all of the ZFS overhead you explained, I might be better without it. Given the hardware restrictions, perhaps a simpler file system that can be shared over SMB/NFS would perform better. What you explained seems to be that ZFS will perform worse than other options if it doesn't have enough ARC, and that ZFS will consume much more space than other file systems. Perhaps XFS could work well for the videos, if it can perform well over a NFS share. If not, there are other options.

I will probably have to do many experiments and benchmark them to get the answer. Might even give up on Proxmox all at once if the overhead is too much. But it will be fun to try.

Right now, just waiting for my LGA 1700 cooler adapter.

Search

Search

Newbie question on ZFS - using multiple devices as a single logical unit

Perrone

New Member

dylanw

Proxmox Retired Staff

Perrone

New Member

SSD Hybrid Storage Pools

Dunuin

Distinguished Member

SSD Hybrid Storage Pools

Perrone

New Member

Dunuin

Distinguished Member

Perrone

New Member

Dunuin

Distinguished Member

Perrone

New Member

We value your privacy

Newbie question on ZFS - using multiple devices as a single logical unit

New Member

Proxmox Retired Staff

New Member

SSD Hybrid Storage Pools​

Distinguished Member

SSD Hybrid Storage Pools​

New Member

Distinguished Member

New Member

Distinguished Member

New Member

We value your privacy

SSD Hybrid Storage Pools

SSD Hybrid Storage Pools