[SOLVED] Performance comparison between ZFS and LVM

casalicomputers · Mar 16, 2023

Hi,

we are evaluating ZFS for our Proxmox VE future installations over the currently used LVM.
ZFS looks very promising with a lot of features, but we have doubts about the performance; our servers contains vm with various databases and we need to have good performances to provide a fluid frontend experience.

From our understanding ZFS needs to have direct access to the disks, so the server is required to have a transparent capable controller or none at all.
Instead of use the controller cache, it uses the system memory to cache all the requests, so we expect a greater memory requirement compared to LVM.

From an hardware point of view, we use servers with a perc h775 controller (8gb di cache) and for a zfs installation we are considering servers with an hba 355i.

We would like to get an idea of what kind of performance we should expect between ZFS and LVM, especially about these parameters:
- io performance
- throughput performance
- backup and restore time

Thanks a lot for the attention and the dedicated time,
Have a good day

Max Carrara · Mar 17, 2023

Hey there!

I personally am a big fan of ZFS because its features actually allow you to get the most out of your hardware.

Here's a short, non-exhaustive list on why I would prefer ZFS:

ZFS will checksum each block of data that's on the disks it uses, preventing bitrot and telling you when errors occur. This is its biggest feature, in my opinion.
ZFS uses an Adaptive Replacement Cache in RAM, which is faster than regular LRU caches.
- Data that's often read will therefore be almost always be available more quickly, as it's sitting in memory already and doesn't have to be read from disk.
- The data cached in the ARC is oftentimes also compressed, giving you a higher effective cache size.
- The ARC is also used to cache writes - data is put into the ARC first and then written in the most effective manner by ZFS in regular intervals.
Speaking of compression, ZFS can compress each block of data that it stores using an algorithm of your choice. During read and write operations, decompressing and compressing actually improves data throughput, because a lot of idle time during reads/writes is now actually being put to use. Turning on compression is therefore free.
You can adapt ZFS specifically to your workload, using:
- A SLOG - "synchronous write log" - which will first store sync writes over the network on a disk of your choice (usually e.g. an NVME mirror or something like Intel Optane) and then write the stored data asynchronously onto your main data store. This can, in some cases, double your effective data throughput over the network, where sync writes are common.
- A "special device" - a disk (again, usually an NVME mirror) which will store metadata about your ZFS pool and optionally also very small files. This may in some cases speed up your pool as well.
- A L2ARC - the ARC mentioned above, but as another second level cache (you guessed it, you'd use an NVME mirror for that). If you really need lots of cache, e.g. in cases where a LOT of data is being read all the time and you've run out of available RAM slots, you can designate another drive to act as level 2 ARC.
ZFS can create snapshots.
You can create so-called "datasets" in ZFS, on which you can set individual properties that differ from the rest of your pool.
- For example, at home I have a dataset specifically for media (pictures, movies, etc.) and because I don't usually access that data that often, I increased the compression level on that dataset and also changed the compression algorithm from lz4 to zstd before I put all my media there. No need to create/unpack .tar.zst files manually and I get a higher effective storage size.
You can create so-called "zvols" in ZFS. These are block devices but run on ZFS in the background. You could for example put ext4 on a zvol.
- Additionally, you can adjust the blocksize of zvols to tune them to your specific needs. Have a look here if you're interested in more information.

Also, as you already know from your other thread, PVE supports Storage Replication using ZFS.

As a disclaimer though, I'll also have to mention: ZFS will grab all your "spare" RAM and use it for its ARC (50% of your total RAM, to be precise). If your other applications use more than 50% as well, ZFS will give it back (and take it as soon as it becomes available again). So, if possible, give your servers a lot of ECC RAM. The more the better. High RAM usage is therefore completely normal if you use ZFS and isn't something you need to worry about - but if the ARC is underutilized, you won't get many of ZFS's speedy benefits. Also, stuff like a SLOG, L2ARC, Deduplication, etc. should really only be added/enabled in ZFS if you know your workloads need it and can benefit from it. If added/enabled sporadically without evaluating your use cases first, they might even impact performance negatively (especially deduplication).

LVM is a solid alternative but serves a completely different goal: It's literally just a volume manager and doesn't come with many "snazzy" features (but that might just be what you're looking for). You can arrange your drives in any layout you want and then slap a filesystem on top of it that suits your needs. ZFS is both a filesystem and a volume manager, the line gets very blurry with ZFS. Note that LVM cannot detect and correct errors in your data, though. At the same time, LVM is "leaner" and doesn't have much overhead.

Regarding performance: ZFS is notoriously hard to "really" benchmark, because the ARC will cache all reads and writes for you. If you have something like a SLOG, because you have workloads with lots of sync writes, the throughput will probably be enormously different than if you didn't. Overall it's really hard to tell whether LVM or ZFS will actually be more performant for your workload by doing some napkin math. It's probably best if you just tried it out, honestly.

So as you can see, I like ZFS quite a lot.

I really only "shill" it so much because it's improved the performance of my homelab drastically, and also, more importantly, saved my a** one time. I mistakenly had installed faulty RAM on my server which led to a couple of kernel panics until I had figured out what I did wrong. A lot of data that was written on my ZFS pool was corrupted. Or so I had thought - after replacing my RAM and performing a zpool scrub (basically letting ZFS scan the entire pool for faulty data and letting it correct it) ZFS was able to fix more than 5.000 faulty blocks of my data. The safety that ZFS provides is, in my opinion, an even bigger selling point than its potential performance benefits.

dignus · Mar 17, 2023

Big ZFS fan here as well!
But.. if you're running NVME or maybe even SSD's it's not the fastest filesystem around (by design, because a lot of features and safeguards). Even with the patch set included in the latest stable release. This video sheds some light on the issue https://m.youtube.com/watch?v=v8sl8gj9UnA.

casalicomputers · Mar 20, 2023

Thanks a lot for the detailed explanation, it satisfy basically all our doubts about zfs.
Thanks again for the dedicated time!

bunnypranav · Sep 23, 2024

Max Carrara said:
Hey there!

I personally am a big fan of ZFS because its features actually allow you to get the most out of your hardware.

Here's a short, non-exhaustive list on why I would prefer ZFS:

ZFS will checksum each block of data that's on the disks it uses, preventing bitrot and telling you when errors occur. This is its biggest feature, in my opinion.

ZFS uses an Adaptive Replacement Cache in RAM, which is faster than regular LRU caches.

Data that's often read will therefore be almost always be available more quickly, as it's sitting in memory already and doesn't have to be read from disk.

The data cached in the ARC is oftentimes also compressed, giving you a higher effective cache size.

The ARC is also used to cache writes - data is put into the ARC first and then written in the most effective manner by ZFS in regular intervals.

Speaking of compression, ZFS can compress each block of data that it stores using an algorithm of your choice. During read and write operations, decompressing and compressing actually improves data throughput, because a lot of idle time during reads/writes is now actually being put to use. Turning on compression is therefore free.

You can adapt ZFS specifically to your workload, using:

A SLOG - "synchronous write log" - which will first store sync writes over the network on a disk of your choice (usually e.g. an NVME mirror or something like Intel Optane) and then write the stored data asynchronously onto your main data store. This can, in some cases, double your effective data throughput over the network, where sync writes are common.

A "special device" - a disk (again, usually an NVME mirror) which will store metadata about your ZFS pool and optionally also very small files. This may in some cases speed up your pool as well.

A L2ARC - the ARC mentioned above, but as another second level cache (you guessed it, you'd use an NVME mirror for that). If you really need lots of cache, e.g. in cases where a LOT of data is being read all the time and you've run out of available RAM slots, you can designate another drive to act as level 2 ARC.

ZFS can create snapshots.

You can create so-called "datasets" in ZFS, on which you can set individual properties that differ from the rest of your pool.

For example, at home I have a dataset specifically for media (pictures, movies, etc.) and because I don't usually access that data that often, I increased the compression level on that dataset and also changed the compression algorithm from lz4 to zstd before I put all my media there. No need to create/unpack .tar.zst files manually and I get a higher effective storage size.

You can create so-called "zvols" in ZFS. These are block devices but run on ZFS in the background. You could for example put ext4 on a zvol.

Additionally, you can adjust the blocksize of zvols to tune them to your specific needs. Have a look here if you're interested in more information.

Also, as you already know from your other thread, PVE supports Storage Replication using ZFS.

As a disclaimer though, I'll also have to mention: ZFS will grab all your "spare" RAM and use it for its ARC (50% of your total RAM, to be precise). If your other applications use more than 50% as well, ZFS will give it back (and take it as soon as it becomes available again). So, if possible, give your servers a lot of ECC RAM. The more the better. High RAM usage is therefore completely normal if you use ZFS and isn't something you need to worry about - but if the ARC is underutilized, you won't get many of ZFS's speedy benefits. Also, stuff like a SLOG, L2ARC, Deduplication, etc. should really only be added/enabled in ZFS if you know your workloads need it and can benefit from it. If added/enabled sporadically without evaluating your use cases first, they might even impact performance negatively (especially deduplication).

LVM is a solid alternative but serves a completely different goal: It's literally just a volume manager and doesn't come with many "snazzy" features (but that might just be what you're looking for). You can arrange your drives in any layout you want and then slap a filesystem on top of it that suits your needs. ZFS is both a filesystem and a volume manager, the line gets very blurry with ZFS. Note that LVM cannot detect and correct errors in your data, though. At the same time, LVM is "leaner" and doesn't have much overhead.

Regarding performance: ZFS is notoriously hard to "really" benchmark, because the ARC will cache all reads and writes for you. If you have something like a SLOG, because you have workloads with lots of sync writes, the throughput will probably be enormously different than if you didn't. Overall it's really hard to tell whether LVM or ZFS will actually be more performant for your workload by doing some napkin math. It's probably best if you just tried it out, honestly.

So as you can see, I like ZFS quite a lot. I really only "shill" it so much because it's improved the performance of my homelab drastically, and also, more importantly, saved my a** one time. I mistakenly had installed faulty RAM on my server which led to a couple of kernel panics until I had figured out what I did wrong. A lot of data that was written on my ZFS pool was corrupted. Or so I had thought - after replacing my RAM and performing a zpool scrub (basically letting ZFS scan the entire pool for faulty data and letting it correct it) ZFS was able to fix more than 5.000 faulty blocks of my data. The safety that ZFS provides is, in my opinion, an even bigger selling point than its potential performance benefits.

You said the ZFS used 50% of the RAM available. Can we change it to less if our needs allow it, and if the extreme performance is not necessary.
Thanks!

Azunai333 · Sep 23, 2024

bunnypranav said:
Can we change it to less if our needs allow it, and if the extreme performance is not necessary.

Yes:
https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage

Max Carrara · Sep 23, 2024

bunnypranav said:
You said the ZFS used 50% of the RAM available. Can we change it to less if our needs allow it, and if the extreme performance is not necessary.
Thanks!

Yes, of course!

@Azunai333 was a little faster than me

By default, new PVE installations with ZFS will set the ARC to ~10% of your RAM by the way (which can be tweaked in the advanced disk settings during install), so there's a chance you won't have to lower it. The value of 50% is just the default of ZFS on Linux in general, but not for new PVE installations. Just wanted to note that.

bunnypranav · Sep 23, 2024

Max Carrara said:
Yes, of course!

@Azunai333 was a little faster than me

By default, new PVE installations with ZFS will set the ARC to ~10% of your RAM by the way (which can be tweaked in the advanced disk settings during install), so there's a chance you won't have to lower it. The value of 50% is just the default of ZFS on Linux in general, but not for new PVE installations. Just wanted to note that.

Thanks for the update.
I have a very small scale HomeLab. 4 Core CPU, 32 GB SODIMM RAM (Yes, it is a mini PC), A NVME boot drive of 256GB, two HDD for storage (500GB and 1TB). I do not need too much performance but want fill utilization of my drives. Should I choose LVM of ZFS for the boot installation. I believe the LVM seperates the boot drive and reduces the amount to data stored in it, so for full utilization should I choose ZFS? Any other suggestions accepted.

Thanks!

waltar · Sep 23, 2024

Max Carrara said:
I personally am a big fan of ZFS because its features actually allow you to get the most out of your hardware.

Here's a short, non-exhaustive list on why I would prefer ZFS:

ZFS will checksum each block of data that's on the disks it uses, preventing bitrot and telling you when errors occur. This is its biggest feature, in my opinion.

ZFS uses an Adaptive Replacement Cache in RAM, which is faster than regular LRU caches.

Data that's often read will therefore be almost always be available more quickly, as it's sitting in memory already and doesn't have to be read from disk.

The data cached in the ARC is oftentimes also compressed, giving you a higher effective cache size.

The ARC is also used to cache writes - data is put into the ARC first and then written in the most effective manner by ZFS in regular intervals.

Speaking of compression, ZFS can compress each block of data that it stores using an algorithm of your choice. During read and write operations, decompressing and compressing actually improves data throughput, because a lot of idle time during reads/writes is now actually being put to use. Turning on compression is therefore free.

You can adapt ZFS specifically to your workload, using:

A SLOG - "synchronous write log" - which will first store sync writes over the network on a disk of your choice (usually e.g. an NVME mirror or something like Intel Optane) and then write the stored data asynchronously onto your main data store. This can, in some cases, double your effective data throughput over the network, where sync writes are common.

A "special device" - a disk (again, usually an NVME mirror) which will store metadata about your ZFS pool and optionally also very small files. This may in some cases speed up your pool as well.

A L2ARC - the ARC mentioned above, but as another second level cache (you guessed it, you'd use an NVME mirror for that). If you really need lots of cache, e.g. in cases where a LOT of data is being read all the time and you've run out of available RAM slots, you can designate another drive to act as level 2 ARC.

ZFS can create snapshots.

You can create so-called "datasets" in ZFS, on which you can set individual properties that differ from the rest of your pool.

For example, at home I have a dataset specifically for media (pictures, movies, etc.) and because I don't usually access that data that often, I increased the compression level on that dataset and also changed the compression algorithm from lz4 to zstd before I put all my media there. No need to create/unpack .tar.zst files manually and I get a higher effective storage size.

You can create so-called "zvols" in ZFS. These are block devices but run on ZFS in the background. You could for example put ext4 on a zvol.

Additionally, you can adjust the blocksize of zvols to tune them to your specific needs. Have a look here if you're interested in more information.

Also, as you already know from your other thread, PVE supports Storage Replication using ZFS.

As a disclaimer though, I'll also have to mention: ZFS will grab all your "spare" RAM and use it for its ARC (50% of your total RAM, to be precise). If your other applications use more than 50% as well, ZFS will give it back (and take it as soon as it becomes available again). So, if possible, give your servers a lot of ECC RAM. The more the better. High RAM usage is therefore completely normal if you use ZFS and isn't something you need to worry about - but if the ARC is underutilized, you won't get many of ZFS's speedy benefits. Also, stuff like a SLOG, L2ARC, Deduplication, etc. should really only be added/enabled in ZFS if you know your workloads need it and can benefit from it. If added/enabled sporadically without evaluating your use cases first, they might even impact performance negatively (especially deduplication).

LVM is a solid alternative but serves a completely different goal: It's literally just a volume manager and doesn't come with many "snazzy" features (but that might just be what you're looking for). You can arrange your drives in any layout you want and then slap a filesystem on top of it that suits your needs. ZFS is both a filesystem and a volume manager, the line gets very blurry with ZFS. Note that LVM cannot detect and correct errors in your data, though. At the same time, LVM is "leaner" and doesn't have much overhead.

Regarding performance: ZFS is notoriously hard to "really" benchmark, because the ARC will cache all reads and writes for you. If you have something like a SLOG, because you have workloads with lots of sync writes, the throughput will probably be enormously different than if you didn't. Overall it's really hard to tell whether LVM or ZFS will actually be more performant for your workload by doing some napkin math. It's probably best if you just tried it out, honestly.

So as you can see, I like ZFS quite a lot. I really only "shill" it so much because it's improved the performance of my homelab drastically, and also, more importantly, saved my a** one time. I mistakenly had installed faulty RAM on my server which led to a couple of kernel panics until I had figured out what I did wrong. A lot of data that was written on my ZFS pool was corrupted. Or so I had thought - after replacing my RAM and performing a zpool scrub (basically letting ZFS scan the entire pool for faulty data and letting it correct it) ZFS was able to fix more than 5.000 faulty blocks of my data. The safety that ZFS provides is, in my opinion, an even bigger selling point than its potential performance benefits.

How much ... in one place ...
ZFS uses an Adaptive Replacement Cache in RAM, which is 3-4x slower than regular LRU caches --> test yourself !!!
A "special device" - a disk (again, usually an NVME mirror) which will store metadata about your ZFS pool and optionally also very small files. This may in some cases speed up your pool as well. --> Jom Salter famous ZFS admin: "I don’t think it’s worth it. I’ve tested the `SPECIAL` fairly thoroughly, if artificially, and saw little or no real benefit in practice, even when trying to do things like run a `find` against tens of thousands of files (which means lots and lots of metadata)."
https://discourse.practicalzfs.com/...ool-worth-it-for-backup-mass-storage-nas/1638
ZFS will grab all your "spare" RAM and use it for its ARC (50% of your total RAM, to be precise). If your other applications use more than 50% as well, ZFS will give it - NOT - back AND your application become memory allocation error and die while zfs consists of it's cache contents - that's completely different as in the documentation layed down as a development goal but which is not reached until today --> test yourself !!!
Regarding performance: ZFS is notoriously SLOW when used data exceeds 2x installed amount of memory, 2-5x times for seq r/w and up to >100x when benchmarking multitiple metadata performance requests. That's even not further surprising as the checksum overhead for small I/O's isn't there for filesystems or even just LVM (without filesystem).
Don't to forget zfs is eating pve OS ssd's and nvme's with there checksum I/O.
And don't to forget the susceptibility for cannot import pool after a power outage.
Enough for now with this high tech features which are the downsides when you want eat the highlights

waltar · Sep 23, 2024

We support a fileserver with 24x 16TB raidz2 with 4 vdevs running clamav each day and it cannot fill the cores but give us I/O waits until in the heaven, yeah that's ARC at it's best ... aah .. worst in reality - sorry but the reality isn't as pink as zfs likes to be.

Max Carrara · Sep 23, 2024

waltar said:
ZFS uses an Adaptive Replacement Cache in RAM, which is 3-4x slower than regular LRU caches --> test yourself !!!

Could you elaborate on how you tested this? Because the Wikipedia article you linked starts with the following sentence (emphasis mine):

Adaptive Replacement Cache (ARC) is a page replacement algorithm with better performance than LRU (least recently used).

waltar said:
A "special device" - a disk (again, usually an NVME mirror) which will store metadata about your ZFS pool and optionally also very small files. This may in some cases speed up your pool as well. --> Jom Salter famous ZFS admin: "I don’t think it’s worth it. I’ve tested the `SPECIAL` fairly thoroughly, if artificially, and saw little or no real benefit in practice, even when trying to do things like run a `find` against tens of thousands of files (which means lots and lots of metadata)."

Yes, this is why I said that it may speed up your pool as well, but it obviously isn't the only way to improve performance. I haven't elaborated on this because the post was already long enough; whether a special device is useful for one's pool or not is to be determined by the administrator.

waltar said:
ZFS will grab all your "spare" RAM and use it for its ARC (50% of your total RAM, to be precise). If your other applications use more than 50% as well, ZFS will give it - NOT - back AND your application become memory allocation error and die while zfs consists of it's cache contents - that's completely different as in the documentation layed down as a development goal but which is not reached until today --> test yourself !!!

Yes, it will grab up to 50% of your RAM, but it will give it back. If it doesn't in your case, please demonstrate how to replicate this; I'm really curious about it. My PVE workstation consists of two zpools and I regularly run huge compile workloads while my VMs run in the background, I've never encountered a single allocation issue. Perhaps you're running into some kind of edge case? That would warrant further investigation.

waltar said:
Regarding performance: ZFS is notoriously SLOW when used data exceeds 2x installed amount of memory, 2-5x times for seq r/w and up to >100x when benchmarking multitiple metadata performance requests. That's even not further surprising as the checksum overhead for small I/O's isn't there for filesystems or even just LVM (without filesystem).

Could you elaborate on this? In which workloads / scenarios does this show up?

waltar said:
Don't to forget zfs is eating pve OS ssd's and nvme's with there checksum I/O.
And don't to forget the susceptibility for cannot import pool after a power outage.

That's why we usually recommend using SSDs with power loss protection; if you have any serious workloads, you should have some kind of power loss protection anyway, preferably via a UPS even. (And also a backup.)

The SSD wearout will depend a lot on your use case; of course lots of small writes will cause faster wearout, but even in my case (using PVE as a workstation, spinning up and down lots of VMs with various workloads, doing frequent checkouts in git, etc.) my SSDs have 1% and 0% in their "Percentage Used" SMART value. And that's after one year of usage.

waltar said:
We support a fileserver with 24x 16TB raidz2 with 4 vdevs running clamav each day and it cannot fill the cores but give us I/O waits until in the heaven, yeah that's ARC at it's best ... aah .. worst in reality - sorry but the reality isn't as pink as zfs likes to be.

Would you mind posting the output of zpool status and zpool list here and also elaborate on your workload? I might be able to help a little.

waltar · Sep 23, 2024

Test arc easily eg with: cd /usr; tar cf /ext4-or-xfs-mount/testfile.tar * /etc; time dd if=/ext4-or-xfs-mount/testfile.tar of=/dev/null bs=32k
and cd /usr; tar cf /zfs-mount/testfile.tar * /etc; time dd if=/zfs-mount/testfile.tar of=/dev/null bs=32k

Unluckily don't had a zfs special fileserver in my hands until now for myself testing.

We had a production fileserver with clamav running each day and if you don't echo "limit" > zfs_arc_max before start endless clamav get allocation errors so we must ensure to have that 64x 1.3GB mem free from installed ram.

Evenly that production fileserver is notoriously slow as all files are going through arc daily and so it's the best example for bad arc file handling which is then mostly uncached.

Yes, enterprise ssd's/nvme's are best needed for zfs but as you see here are so many proxmox home and small budget users which use consumer ssd's/nvme's which are definitive eaten by zfs which isn't warned enough for before. Yes it's said in endless threads but mostly people just try before and later are wondering about failing drives - buy cheap comes to buy twice.

It's a beegfs daily-backup and archive-data server and so it's mostly without any user contact, raidz2 with def. 128k recordsize, 349TB, 120TB allocated, frag 13%, cap 34%, no-dedup, lz4 on, xattr=sa, still atime=on, ashift=13, about one more year from today and it's going into pension or recycling.

Max Carrara · Sep 24, 2024

waltar said:
Test arc easily eg with: cd /usr; tar cf /ext4-or-xfs-mount/testfile.tar * /etc; time dd if=/ext4-or-xfs-mount/testfile.tar of=/dev/null bs=32k
and cd /usr; tar cf /zfs-mount/testfile.tar * /etc; time dd if=/zfs-mount/testfile.tar of=/dev/null bs=32k

That's an interesting way to test it, but keep the following things in mind:

If there are any other IOPS going on on your system, they might affect how both caches behave.
- The best way to ensure that the ARC is definitely cleared is to export your pool first and then unload the ZFS kernel module, then load the module and import the pool again. This is obviously not possible (or a wise thing to do) if you're using ZFS on root.
Note that you won't necessarily "benchmark" the ARC that way: As you yourself know, ZFS does a lot of extra things behind the curtains, e.g. (de-)compression etc. This will greatly affect your results.

It is overall notoriously hard to accurately benchmark filesystem caches, because of the reasons above. I could elaborate much more, but this isn't really actually the point I want to make. Becaaaause ...

waltar said:
We had a production fileserver with clamav running each day and if you don't echo "limit" > zfs_arc_max before start endless clamav get allocation errors so we must ensure to have that 64x 1.3GB mem free from installed ram.

Evenly that production fileserver is notoriously slow as all files are going through arc daily and so it's the best example for bad arc file handling which is then mostly uncached.

I think this here is where some misunderstandings are coming from. The ARC exists for the same purpose as LRU caches in other filesystems: It's there to make reads faster (obviously). The difference between the ARC and an LRU cache is that the ARC has a higher cache hit rate, because it does additional tracking. In fact, the ARC consists of four LRU caches itself -- to quote Wikipedia again:

T1, for recent cache entries.
T2, for frequent entries, referenced at least twice.
B1, ghost entries recently evicted from the T1 cache, but are still tracked.
B2, similar ghost entries, but evicted from T2.

Without going into too much detail, these four internal LRUs let the ARC have a higher hit rate than a simple LRU cache, because it isn't as vulnerable to cache flushes.

For example, let's say you have two directories A and B with a lot of files. At first, you work mostly with the files from directory A, then you have to work with the files from directory B for a short amount of time, and then you switch back to directory A again.

If your filesystem has an LRU cache, it's possible that all files of directory A get evicted from the cache while you work in directory B for a short moment, which means you'll need to read from disk again when switching back.

If your filesystem has an ARC, it's much more likely that the files of directory A remain in the cache, because the ARC also keeps track of files that were recently evicted.

This is the main benefit of the ARC.

Of course, if you have mostly random reads throughout your fileserver, there's only little benefit to a cache -- neither ARC or LRU make a difference here, nor does any other kind of cache.

So, what I believe is the actual issue with your setup is something else entirely. In an earlier post, you revealed a little more about your zpool:

waltar said:
We support a fileserver with 24x 16TB raidz2 with 4 vdevs running clamav each day and it cannot fill the cores but give us I/O waits until in the heaven [...]

This is not an issue with the ARC, but rather with IOPS, which is why you're getting I/O waits. Regarding IOPS, there are a couple things to keep in mind:

RAID-Z vdevs will each be limited to the IOPS of the slowest drive in the vdev.
Mirror vdevs are not limited by this -- the IOPS will scale with the number of drives in the mirror vdev.

Because you have four RAID-Z2 vdevs, you essentially have the IOPS of only four disks. This has nothing to do with the ARC, you see -- the ARC can only help you if it can actually cache things, otherwise there will be no performance gain -- and if it cannot cache things, you most likely won't see any performance loss at all.

My personal recommendation would be to revise your pool's geometry. I like this site a lot personally for planning things (but unfortunately it doesn't show any information on IOPS): https://jro.io/capacity/

In summary, if you want to increase your IOPS, you need more vdevs, with less disks per vdev. This means that a "classic" RAID-10 setup (a bunch of striped 2-way mirror vdevs) will give you the most IOPS while still providing redundancy. (Just striping over all disks would give you the highest IOPS, but... your pool will die as soon as a single disk dies.)

I think this rather shows that it's hard to plan ahead with ZFS when you have larger storage setups -- it requires a lot of knowledge to get things right. But at the same time, you have the power to optimise your storage as much as possible, so I personally will still remain a big fan of ZFS

waltar said:
Yes, enterprise ssd's/nvme's are best needed for zfs but as you see here are so many proxmox home and small budget users which use consumer ssd's/nvme's which are definitive eaten by zfs which isn't warned enough for before. Yes it's said in endless threads but mostly people just try before and later are wondering about failing drives - buy cheap comes to buy twice.

Yeah I agree with you, there are unfortunately some really bad SSDs out there, but there's only so much that we can do. Chasing after terrible SSDs is a cat-and-mouse game. Even if you think you know all bad products, more cheap stuff follows, unfortunately. I personally always recommend used enterprise/datacenter SSDs to homelabbers if they can buy them somewhere, because even with a bit of wearout they will survive quite a long time in the average homelab environment.

Either way, that's beside the main topic here -- I hope my tips above may help you with your pool. To me it looks like you'll have to rebuild it if you want more IOPS, unfortunately. Or, you can always add more vdevs, but I don't know if that's an option in your case -- you already have quite a lot of disks. Good luck! And let me know if I can help with anything else regarding your pool.

tomtom13 · Oct 23, 2024

Ok so I will throw my knickers in the ring as well

In interest of full disclosure:
- I'm an zfs fanboy (at least since it became available on linux and fixing some of my btrfs headaches).
- I mostly use ZFS & ceph, then lvm if need for it arrises.

ZFS is an fs that will allow you to go crazy with your storage, do some really crazy data managment and squeeze out every ounce of performance out of multi dist storage while giving you level of reliability that 3 letter agencies could dream about 30-40 years ago - BUT that reliability sometimes comes with a cost. If you want to run containers - it's a best choice. For VM's you can get some performance bottle necks because if you for example have heavily utilised database, you essentially go -> database layer -> VM FS -> hyperfisor block storage emulation -> host FS (here zfs) -> your storage. There is no way you will get same performance as bare metal, so you need to test your solution before pushing to production.

Ceph - it's great if you have lot's of nodes and you want your VM's to be reliable in terms of shifting within a second between nodes (even on power loss) but that comes with the price of performance loss because the second your data needs to hit the network you will always have latencies larger than any local FS could create - solution don't use CT / VM that require low latency.

LVM - well ... it haze near zero funky features, but what it gives you is near RAW PASSTHROUGH storage speed for your VM. I don't see a point of using it for container (and don't know even if you could) since containers can use host FS without anything in between so those would behave same as apps on the bare metal host. So bottom line is, if you really really need that blazing speed of you PCI-E 8 NVME that you had to sell you car to be able to afford it, don't hinder your self with anything in between and use LVM. I use LVM where I really want performance, while I don't want to splash out for mega server to compensate for zfs intermediary layer performance loss.

ps.
RAW PASSTHROUGH - there are still some skeletons in the closet with qemu, so I just went with LVM instead and not had many issues (well not as bad as loosing all my data in production during Christmas period). But if you don't care about if you loose your VM (duno, maybe you just run continuous compile and integration testing, and can easily restart it from the template) - go for RAW.

esi_y · Oct 23, 2024

tomtom13 said:
there are still some skeletons in the closet with qemu

What do you mean?

esi_y · Oct 23, 2024

Max Carrara said:
Yeah I agree with you, there are unfortunately some really bad SSDs out there, but there's only so much that we can do. Chasing after terrible SSDs is a cat-and-mouse game.

There's nothing in your wiki or e.g. installer to warn against choosing ZFS for low TBW drives at the least.

Max Carrara said:
I personally always recommend used enterprise/datacenter SSDs to homelabbers if they can buy them somewhere, because even with a bit of wearout they will survive quite a long time in the average homelab environment.

It's not only cost prohibitive, there's almost no selection for a homelab (think e.g. mini PC) that would be meaningful with PLP and high TBW at the same time.

I only know of 2 in 2280 size:

- Micron 7450 (old one 7400 not sold anymore) - this one ends at 1T capacity for 2280 form factor;
- Kingston DC2000B (and old one 1000B not sold anymore) - this one ends at 1T.

Micron at 1T has 1700 TB and Kingston at 1T only 700 TBW. Compare that with e.g. homelab targeting WD SN700 with 1T which is 2,000TBW albeit no PLP at ~30% of the Micron's price.

Kingston will not even fit in many cases because it has a (controller only) heatsink already on.

So the recommendation [1] leave the average homelab user in helpless position, that is, if they understood it or discovered it in the first place.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#install_recommended_requirements

tomtom13 · Oct 24, 2024

esi_y said:
What do you mean?

that

esi_y · Oct 24, 2024

tomtom13 said:
that

That's DRBD & MDRAID (lately) thread, what's that to do with QEMU?

esi_y · Oct 24, 2024

tomtom13 said:
LVM - well ... it haze near zero funky features, but what it gives you is near RAW PASSTHROUGH storage speed for your VM. I don't see a point of using it for container (and don't know even if you could) since containers can use host FS without anything in between so those would behave same as apps on the bare metal host. So bottom line is, if you really really need that blazing speed of you PCI-E 8 NVME that you had to sell you car to be able to afford it, don't hinder your self with anything in between and use LVM. I use LVM where I really want performance, while I don't want to splash out for mega server to compensate for zfs intermediary layer performance loss.

ps.
RAW PASSTHROUGH - there are still some skeletons in the closet with qemu, so I just went with LVM instead and not had many issues (well not as bad as loosing all my data in production during Christmas period). But if you don't care about if you loose your VM (duno, maybe you just run continuous compile and integration testing, and can easily restart it from the template) - go for RAW.

I quote it in full above (highlighting mine) to give the context, but I still do not understand the comparison, you could instead however e.g. compare with what e.g. Red Hat has been suggesting since a while (they do not do ZFS, obviously):

https://docs.redhat.com/fr/document...dm-integrity_configuring-raid-logical-volumes

tomtom13 · Oct 24, 2024

esi_y said:
That's DRBD & MDRAID (lately) thread, what's that to do with QEMU?

yeah, passthrough direct IO devices ... magically becoming mdraid. Please read the thread from the beginning, not the tail end where some peps hijacked it. Bottom line is that there are esoteric bugs, and if people want to use it - they need to be aware of possibility of having all the data being corrupted, since even direct IO is not as straight forward as it seems. Personally, if somebody asked me about RAW performance and had cash to spare, I would advice them to passthrough PCI-e HBA and take it from there.

[SOLVED] Performance comparison between ZFS and LVM

Renowned Member

Active Member

Renowned Member

Renowned Member

New Member

Active Member

Active Member

New Member

Active Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member