[SOLVED] Performance comparison between ZFS and LVM

tomtom13 · Oct 24, 2024

esi_y said:
I quote it in full above (highlighting mine) to give the context, but I still do not understand the comparison, you could instead however e.g. compare with what e.g. Red Hat has been suggesting since a while (they do not do ZFS, obviously):

https://docs.redhat.com/fr/document...dm-integrity_configuring-raid-logical-volumes

I think you're trying to split hairs here. Lvm has more features than RAW but not as many as ZFS. LVM allows to have several VM's on single storage device, and device passthrough should be one per VM. From my (and few other admins) experience LVM seems to be more reliable than RAW disk passthrough but has a tiny penalty of speed.

esi_y · Oct 24, 2024

tomtom13 said:
yeah, passthrough direct IO devices ... magically becoming mdraid. Please read the thread from the beginning,

You are right, I stand corrected. I just read through it filtering out the noise and kept reading between you and staff. This has absolutely nothing to do with MDRAID. I apologise I added on to it recently, the original diversion happened in #37, anyhow I should have just started a new thread and reference to it.

But now you got me interested, I am looking at #36 where the mitigation was suggested - basically default to scsi-hd.

But that also means that this is still true:

default to scsi-hd (which is not full pass-through) instead of scsi-block for pass-through, with the possibility to "opt-in" to the old behaviour with all the associated risk (until further notice)

So the old behaviour is still opt-in. And there has been no NO FIX whatsover:
https://qemu-devel.nongnu.narkive.com/7NrZkgBQ/data-corruption-in-qemu-2-7-1#post12

Do you mind I ask in the original thread what happened with this?

tomtom13 · Oct 25, 2024

esi_y said:
Do you mind I ask in the original thread what happened with this?

Well, story was pretty basic, I was one of first to actually perform the update of proxmox because it was during the christmas, and about 24 * 8TB drives worth of data went to hell (as far as I remember the size of the shelf). I was pretty "displeased" (and maybe was to harsh in the thread), however I was trying to find the culprit, and spent to much time trying to figure out what was going wrong. After I've noticed more and more people coming in with similar issues, I begun to be very doubtful that this is my misconfiguration and that problem was systemic. Then I think I've noticed a similar issue on non proxmox forum relating to qemu and RAW passthrough to the VM - at that point I said F* that and shifted whole storage to ceph for that specific VM. Few years back (duno the exact year but it was before pandemic) in one corp one admin floated to me idea of passing RAW disks to VM for performance - I gave him a stern warning. He came back to me after some time saying that he created test VM with that sort of setup and it ate his data, so I would say it was not resolved back then. After that I've not used RAW myself or for business aplications. However with LVM I had no major issues.
Another trick that I use is: if I really really need that sweet raw performance goodness, and my stuff can run in a container, I will just create what ever array I need and pass it to the container through the mount point.

But hell - ask away, maybe they will tell us all is good in raw passthrough land. I'm not going to test it.

esi_y · Oct 25, 2024

tomtom13 said:
Well, story was pretty basic, I was one of first to actually perform the update of proxmox because it was during the christmas, and about 24 * 8TB drives worth of data went to hell (as far as I remember the size of the shelf). I was pretty "displeased" (and maybe was to harsh in the thread), however I was trying to find the culprit, and spent to much time trying to figure out what was going wrong.

As you describe it that's frustrating enough.

tomtom13 said:
After I've noticed more and more people coming in with similar issues, I begun to be very doubtful that this is my misconfiguration and that problem was systemic. Then I think I've noticed a similar issue on non proxmox forum relating to qemu and RAW passthrough to the VM - at that point I said F* that and shifted whole storage to ceph for that specific VM.

I understand, but based on the mitigation back then, this should not be occuring (even at the time) if scsi-hd is used (as opposed to scsi-block). So it's not really storage related, it's QEMU param related.

tomtom13 said:
Few years back (duno the exact year but it was before pandemic) in one corp one admin floated to me idea of passing RAW disks to VM for performance - I gave him a stern warning. He came back to me after some time saying that he created test VM with that sort of setup and it ate his data, so I would say it was not resolved back then.

So this is raising eyebrows because I suppose he would have used defaults, which by then would have been scsi-hd.

tomtom13 said:
After that I've not used RAW myself or for business aplications. However with LVM I had no major issues.
Another trick that I use is: if I really really need that sweet raw performance goodness, and my stuff can run in a container, I will just create what ever array I need and pass it to the container through the mount point.

I get that takeout, just I like to know the core of the issue. Because I could not find any further changes on qemu side in respect to scsi-block and with the anectodal evidence you mention, it also makes me wonder about scsi-hd, but that should not be a an issue, logically speaking, because it's not full passtrhrough.

tomtom13 said:
But hell - ask away, maybe they will tell us all is good in raw passthrough land. I'm not going to test it.

Maybe I test it first, then ask. There's a test case in the mailing list even.

Thanks!

tomtom13 · Oct 25, 2024

esi_y said:
I understand, but based on the mitigation back then, this should not be occuring (even at the time) if scsi-hd is used (as opposed to scsi-block). So it's not really storage related, it's QEMU param related.

Yeah, mitigation was suggested few month later, I guess you can imaging what business will say if you suggest "just sit on our hands until somebody graciously will give us a way of fixing that". I'm not knocking of proxmox chap - they have their product that is based on open source projects -> they are not 100% in control of other people timetable as well.

esi_y said:
So this is raising eyebrows because I suppose he would have used defaults, which by then would have been scsi-hd.

Well, no. The whole discussion with him started from "hey do you know how to pass the devices raw ?" so he would not use defaults. Let's not kid our selfs, raw passthrough is and was far from default

esi_y said:
Maybe I test it first, then ask. There's a test case in the mailing list even.

Or maybe you could test what I've reported

Just use disk connections I listed in the original posts, and start dumping data into it. Duno, maybe you've got collection of pr0n that you can hash and later you can read through the passed disks and check hashes ? Just be warned that that corruption would not happen instantaneously. Btrfs would fail spectacularly because it doesn't have some safety features of zfs, but I had problems with ext4 as well. My luck was that this specific VM's was picking up footage from multiple sources and dumping it into a massive array so writes were pretty random and large (this was also the reason to use RAW passthrough since anything else would collapse on amount of random IO to spinning rust).

BUT:
maybe let's not hijack this thread as well - dude was asking about ZFS vs LVM.

casalicomputers · Feb 17, 2025

Hello everyone,
we've finally decided to test internally a 2 node cluster pve installation with zfs and replication,
and vm performance is considerably slower than what we've got used to with hardware raid and lvm,

Our test installation is meant to replace an old single node installation running on a Dell PowerEdge R730
with a PERC H730 Mini controller with a raid-5 of 3 Samsung 1tb SSDs.

The new cluster nodes are two Dell PowerEdge R440 with 128GB of ram and a mirror zpool running on these Kingston 2tb SSDs.

On the nodes we run a bunch of services but the most critical of all is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk,
we noticed a substantial performance decrease during high IOPS tasks after we moved the vm on the zfs nodes,
to put it in perspective, the same VM takes between 15-30% more time to complete the same tasks on zfs compared to when it was running on lvm.
After running some tests with fio and directly performing tasks on the db on the same vm running on both zfs and raid+lvm,
I saw that the issue is mainly in random read-write workloads, sequential read and write operation performance is mostly the same.
The fio cmds i used to run the tests are the same used in this post.

After reading various posts on this forum and other websites these are the parameters can have an effect on performance,
so far I tried changing:

recordsize=8K, set to 8K because that's what OracleDB uses as blocksize
atime=off, most sources cited that setting access times to off can lead to a slight improvement in IOs
sync=disabled, I tried setting this parameter both on and off and I did not notice any meaningful improvements
also since I do not have a SLOG device I'd rather say on the safe side and keep this on.
zfs_arc_max=51539607552, currently set to 50GB.

ashift=12, I left the default setting because i read that most ssds have a pagesize of 4K,
however this website says that the pagesize of the Kingston DC600M is 16KB,
do you think that creating the zpool with ashift=14 can lead to any sort of improvement?

volblocksize=16K, the OpenZFS documentation states that tuning can help with random workloads,
I did not change this parameter from the default 16KB, right now it's the same as the supposed pagesize of my ssd disks,
Would setting this to the blocksize used by the db (8K) or the filesystem (4K) make any sort of difference?

Also one thing that I noticed is that there is a lot of difference in performance if I run the same fio command from inside the vm or directly on the host,
that makes me think that maybe the issue is not zfs in itself but somewhere between the vm vfs and the vm config itself,
which is this one: scsi2: zfs-2tb:vm-218-disk-0,cache=writeback,discard=on,iothread=1,size=128G,ssd=1

These are the properties of the zpool on which the vm is running.

Code:

root@pve02:~# zfs get all zfs-2tb
NAME     PROPERTY              VALUE                  SOURCE
zfs-2tb  type                  filesystem             -
zfs-2tb  creation              Wed Jan 22 11:41 2025  -
zfs-2tb  used                  1.35T                  -
zfs-2tb  available             336G                   -
zfs-2tb  referenced            4.04G                  -
zfs-2tb  compressratio         1.92x                  -
zfs-2tb  mounted               yes                    -
zfs-2tb  quota                 none                   default
zfs-2tb  reservation           none                   default
zfs-2tb  recordsize            8K                     local
zfs-2tb  mountpoint            /zfs-2tb               default
zfs-2tb  sharenfs              off                    default
zfs-2tb  checksum              on                     default
zfs-2tb  compression           on                     local
zfs-2tb  atime                 off                    local
zfs-2tb  devices               on                     default
zfs-2tb  exec                  on                     default
zfs-2tb  setuid                on                     default
zfs-2tb  readonly              off                    default
zfs-2tb  zoned                 off                    default
zfs-2tb  snapdir               hidden                 default
zfs-2tb  aclmode               discard                default
zfs-2tb  aclinherit            restricted             default
zfs-2tb  createtxg             1                      -
zfs-2tb  canmount              on                     default
zfs-2tb  xattr                 on                     default
zfs-2tb  copies                1                      default
zfs-2tb  version               5                      -
zfs-2tb  utf8only              off                    -
zfs-2tb  normalization         none                   -
zfs-2tb  casesensitivity       sensitive              -
zfs-2tb  vscan                 off                    default
zfs-2tb  nbmand                off                    default
zfs-2tb  sharesmb              off                    default
zfs-2tb  refquota              none                   default
zfs-2tb  refreservation        none                   default
zfs-2tb  guid                  7036852244761846984    -
zfs-2tb  primarycache          all                    default
zfs-2tb  secondarycache        all                    default
zfs-2tb  usedbysnapshots       0B                     -
zfs-2tb  usedbydataset         4.04G                  -
zfs-2tb  usedbychildren        1.35T                  -
zfs-2tb  usedbyrefreservation  0B                     -
zfs-2tb  logbias               latency                default
zfs-2tb  objsetid              54                     -
zfs-2tb  dedup                 off                    default
zfs-2tb  mlslabel              none                   default
zfs-2tb  sync                  standard               local
zfs-2tb  dnodesize             legacy                 default
zfs-2tb  refcompressratio      1.00x                  -
zfs-2tb  written               4.04G                  -
zfs-2tb  logicalused           821G                   -
zfs-2tb  logicalreferenced     4.02G                  -
zfs-2tb  volmode               default                default
zfs-2tb  filesystem_limit      none                   default
zfs-2tb  snapshot_limit        none                   default
zfs-2tb  filesystem_count      none                   default
zfs-2tb  snapshot_count        none                   default
zfs-2tb  snapdev               hidden                 default
zfs-2tb  acltype               off                    default
zfs-2tb  context               none                   default
zfs-2tb  fscontext             none                   default
zfs-2tb  defcontext            none                   default
zfs-2tb  rootcontext           none                   default
zfs-2tb  relatime              on                     default
zfs-2tb  redundant_metadata    all                    default
zfs-2tb  overlay               on                     default
zfs-2tb  encryption            off                    default
zfs-2tb  keylocation           none                   default
zfs-2tb  keyformat             none                   default
zfs-2tb  pbkdf2iters           0                      default
zfs-2tb  special_small_blocks  0                      default
zfs-2tb  prefetch              all                    default

I'll attach the output of the fio tests below as posing them in codeblocks here would probably make the post too long.
Maybe some of you can either point me in the right direction or at least confirm that with this setup the current results are somewhat expected.

tomtom13 · Feb 19, 2025

OK, So maybe I will chip in some detail out of the "embedded side of computing". Situation with SSD's is VERY VERY VERY complicated.
NAND flash uses very large sectors, I've seen nand chips having sectors size of ranging between 8k and 128K.
What does it mean ?!:
Well, process of writing to flash is that you erase the sector - which sets all bits to "1", when you write data you simply set appropriate bits to 0. Simple, right ? Yup. Theoretically if you want to modify data - you simply lift whole sector to ram -> erase the block -> write out new data. However the erase process is LONG, and this erase process wares of the actual flash. In embedded world you use things like j2ff, that just keeps adding data to the filesystem and FS and sectors as long log. Every once in a while, FS driver walks in and finds parts of the log that are no longer representing valid data (or are half filled with valid data) -> writes that data in empty space and marks those sectors for slow garbage collector erasure.
But what it means to me since I run PC not some punny embedded platform:
Well, you are in luck, since all modern (10 years or younger) SSD implement that sort of strategy on their own, where they will show you some fake sectors (logical) sectors, and will keep writing modifications to new empty space and quietly do the trimming and erasing for you - hence you need a "controller" on the SSD. Problem arises when your FS sector size is different from actual sector size of SSD / NVME disk. Copy on write filesystems are supposed to mitigate this by just copying modified data to free space - but SSD lies to you anyway, so you would write to free space anyway.

Now imagine situation where you have SSD being punched with lots or random IO - disk will cope just fine with this, but after short while (miliseconds) it starts walking it's log to find data to truncate and which sectors to erase - this is causing the internal IO storm. This is why all SSD for PC have the problem of far slower IO for random 4k vs sequential. Everybody here knows that random 4k can be 100x slower than sequential on NVME. In embedded world - where your CPU is connected directly to NVME, difference between sequential and random 4k on j2ffs is maybe 5x because changing the address to which you want to write out of sequence data gives maybe 1.5x slowdown, rest is down to filesystem walking it's logs for trimming - BUT filesystem & flash trim controll are one and the same thing eliminating IO storm - FS knows what is written and what is going to be written so it can schedule accordingly, dumb SSD connected to host has none of that luxury. But not all is great in j2ffs land - walking the log on very fragmented FS can be very slow and reading whole file can sometimes give a solid slow down. So there is no silver bullet.

Best way is to theoretically in ZFS set sector size to same size as underlying SSD - unless manufacturer is lying as well - then zfs should work in sync with SSD controller and diminish the internal IO storms. Using NVME with ample of RAM is beneficial, and enterprise level NVME. Large ARC helps as well, as well as superfast LOG devices.

Now ... VM ... I honestly don't know, the current qemu is so complicated my only bet would be to use as raw FS as possible with matching sectors and NVME emulation. Using preallocated disk images should help as well, but at the moment I'm at the loss with those.

LnxBil · Feb 23, 2025

casalicomputers said:
On the nodes we run a bunch of services but the most critical of all is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk

Yes, that's a very bad idea. Why don't you switch to ASM on a block device? This is much better performance wise in comparison to the 4K blocksizes of ext4.

casalicomputers · Feb 24, 2025

tomtom13 said:
[...]

Best way is to theoretically in ZFS set sector size to same size as underlying SSD - unless manufacturer is lying as well - then zfs should work in sync with SSD controller and diminish the internal IO storms. Using NVME with ample of RAM is beneficial, and enterprise level NVME. Large ARC helps as well, as well as superfast LOG devices.

So as I undestand ashift, volblocksize and recordsize should all match the page size / sector size of the underlying blockdevice.
So my disks are reporting a 512 page size, yet techpowerup reports that it's 16KB, and Kingston does not provide this information in the datasheet nor anywhere else I could find, I guess that just leaves it to trial and error or is there some other way to make sure which is which?
What would the best way to minimize io storm if the database application writes in 8K blocks?
The application writes an 8k large block of data, that gets translated to 2 ext4 operations for the host os, and below that the qemu process performs 16 write operations?

Code:

root@pve01:~# cat /sys/class/block/sdf/queue/logical_block_size
512
root@pve01:~# cat /sys/class/block/sdf/queue/physical_block_size
512

would ashift=9, volblocksize=512 and recordsize=512 do it?
also could anyone explain the difference between volblocksize and recordsize?

LnxBil said:
Yes, that's a very bad idea. Why don't you switch to ASM on a block device? This is much better performance wise in comparison to the 4K blocksizes of ext4.

Ok thats a new idea, but unfortunately I do not know anything about that, what is ASM on a block device?, could you reference some documentation on how I can do that in proxmox? Or just point me in the right direction so I can get started?

IsThisThingOn · Feb 24, 2025

Here is my best guess.

The Kingston SSD is a pretty mediocre, default SSD. It uses an off the shelf Phison controller. So I am pretty sure that it is 4k.
Even if it is 512b or 16k internally, the Phison controller was tuned for the default 4k and you won't see much off a difference.

In my opinion, this SSD is not made for a DB. This SSD performs pretty poor and is more suited for homelab users like I am that value stability, TBW, and PLP.

casalicomputers said:
So as I undestand ashift, volblocksize and recordsize should all match the page size / sector size of the underlying blockdevice.

No!

If you have a volblocksize of 16k (the proxmox default) and your SSD 4k, it will just use four sectors to write down these 16k. That is assuming you use ZFS and RAW VM disks. Same goes for a recordsize of 128k. It will just use multiple stripes/sectors.

Also, even if your application writes at 8k, you don't know how big the write will be after compression. Unless you disable compression.

But I get where you are coming from, I think you are mixing it up with volblocksize and matching pool geometry and padding when using zvol and RAIDZ.
This does not apply to you, since you use mirrors.

I don't know how complicated your current setup is or what you currently setup is. The fact that sync = disabled does not change performance for you seems pretty suspicious. Maybe this drive does really not have performance implications thanks to PLP.
I would simply create a mirror and a 8k zvol. Disable compression and atime on that zvol and leave sync at default.

Depending on how important the DB is, I would probably go for bare metal and make use of ZFS.

guletz · Feb 24, 2025

casalicomputers said:
is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk

Hi,

For OracleDB, you must have volblocksize=8k and ext4 on top of that, you should create the ext4 with "-b 4096 -E stripe-width=2". Also, it is recomended that your DB files to stay on a dedicated partition or even better on other vDisk. In such a case, is better to use on such vDisk :
zfs set primarycache=metadata zpool....vDisk, because the DB itself know better what to cache and what to not cache.

Anyway, a riadz/raid5 for a VM with any DB is not so good.

Good luck / Bafta!

LnxBil · Feb 24, 2025

casalicomputers said:
Ok thats a new idea, but unfortunately I do not know anything about that, what is ASM on a block device?, could you reference some documentation on how I can do that in proxmox? Or just point me in the right direction so I can get started?

ASM has nothing do to with Proxmox VE, it's an Oracle technology:

https://docs.oracle.com/en/database/oracle/oracle-database/19/ostmg/

tomtom13 · Mar 12, 2025

casalicomputers said:
So as I undestand ashift, volblocksize and recordsize should all match the page size / sector size of the underlying blockdevice.
So my disks are reporting a 512 page size, yet techpowerup reports that it's 16KB, and Kingston does not provide this information in the datasheet nor anywhere else I could find, I guess that just leaves it to trial and error or is there some other way to make sure which is which?

NAND datasheet is the only place, SSD manufacturer will lie their head of / their marketing material creating people have no idea what they are talking about / language barrier.

casalicomputers said:
The application writes an 8k large block of data, that gets translated to 2 ext4 operations for the host os, and below that the qemu process performs 16 write operations?

first part - yes, the second part - no you mix the qemu to the equation, which complicates it by 2 orders of magnitude ... explanation would be scale of dissertation. Duno why you try to keep DB storage inside of VM ?! just passthrough the disk directly and have a proper backup for it implemented. DB inside of VM was never a good idea for high performance. Proxmox as well as qemu are powerful things, but are not meant for EVERYTHING, proxmox can't feed my dog and qemu doesn't cure cancer

Somebody mentioned compression - if you have compression enabled on ZFS while having a vm with db on top of it and thinking of extreme performance ... shrink is two doors down the hallway. High load DB with lots of data is worse case scenario for anything, so you should be eliminating layers, not adding. ( and no, I'm not saying that somebody suggested having it - it's a fair warning to any future reader to avoid it ! )

CS76 · Apr 3, 2025

We are planning to perform a Proxmox VE Proof of Concept and I’m a bit lost in making the right storage choice (LVM, LVM-thin, ZFS,..)
Our hosts have local storage only.

Hardware specifications:
HPE Proliant Gen10 Plus (2x 18core CPU, 128GB RAM)
2x 300GB HDD
2x 2TB SSD
Storage controller: HPE MR416i-a Gen10+

We are planning to run pfsense VMs only on the hosts, if the POC is successful.
VM snapshots is a requirement.

What is the best choice in regard to the storage design?
The storage controller can be set in mixed mode (RAID and HBA pass-through functionality simultaneously)
So, I was thinking of using 2x 300GB in RAID1 (LVM -> ext4) for the hypervisor and 2x 2TB RAID1 ZFS for the VM data?
Please advise? Is the hardware supported? Do we need additional of different storage controllers? (NVMe?)

Thanks.

tomtom13 · Apr 3, 2025

Depends on what you intend to do with those pfsense instances. From my experience PFsense doesn't really require a lot of storage performance.
I would create raid1 for HDD and raid1 for SSD - both through ZFS, makes life a lot easier if something goes wrong !

Leme tell you a story: I had a machine with ZFS and two drives in raid 1 as main storage. There were few VM's on it. one of ssd crapped the bed - but the machine kept running like this without missing a beat for maybe 6 months. At some point I was doing cluster wide check whenever email addresses were working OK - it didn't for this machine. After fixing it I received "quite a few" email notifying me "hey bozo, your storage is degraded".

So I would say, that if you can spare that extra a few % of performance (literally FEW) , don't even look at LVM or some hardware raids. (side note ZFS may improve your HDD performance)

Now if you had heavy loaded DB serving billions of queries - it's a different story, but then we wouldn't be talking about spinning rust either, but some space age PCIE gen 5 enterprise NVME.

Johannes S · Apr 3, 2025

CS76 said:
Hardware specifications:
HPE Proliant Gen10 Plus (2x 18core CPU, 128GB RAM)
2x 300GB HDD
2x 2TB SSD
Storage controller: HPE MR416i-a Gen10+

We are planning to run pfsense VMs only on the hosts, if the POC is successful.
VM snapshots is a requirement.

What is the best choice in regard to the storage design?
The storage controller can be set in mixed mode (RAID and HBA pass-through functionality simultaneously)
So, I was thinking of using 2x 300GB in RAID1 (LVM -> ext4) for the hypervisor and 2x 2TB RAID1 ZFS for the VM data?
Please advise? Is the hardware supported? Do we need additional of different storage controllers? (NVMe?)
.

ZFS has most features and flexibility , see @UdoB Post (it's mostly aimed at Homelabbers, but still covers some main points) https://forum.proxmox.com/threads/f...y-a-few-disks-should-i-use-zfs-at-all.160037/ and the PVE wiki: https://pve.proxmox.com/wiki/ZFS_on_Linux, https://pve.proxmox.com/wiki/Storage:_ZFS and https://pve.proxmox.com/wiki/Storage#_storage_types
ZFS is also the only back storage backend which you could use for pve-zsync and Storage replication:
https://pve.proxmox.com/wiki/PVE-zsync
https://pve.proxmox.com/wiki/Storage_Replication

It has two things you need to be aware of: You will need to disable hardware RAID (since HW Raid and ZFS don't play nice together) on your Storage controller. If this is not possible you would go with LVM-thin and ext4 or XFS depending on your workload (see https://access.redhat.com/articles/3129891#local-file-systems-overview-2 for some information). Please note, that LVM-thick doesn't allows snapshots, so in case of local storage I would go with LVM-thin if ZFS is not possible. LVM-thick is however the only option if you have something like a FC- or ISCSI attached SAN if the SAN doesn't support ZFS-over-ISCSI

The second point for thoughts is that for ZFS it's highly recommended to use enterprise-grade SSDs with power-loss-protection, but since you will propably have them anyway in a corporate environment this shouldn't be much of a deal breaker

For a POC they are not needed obviouvsly but for Production I wouldn't go without it and since SSDs with PLP give better performance a POC without them might not give you the full picture

CS76 · Apr 8, 2025

Johannes S said:
ZFS has most features and flexibility , see @UdoB Post (it's mostly aimed at Homelabbers, but still covers some main points) https://forum.proxmox.com/threads/f...y-a-few-disks-should-i-use-zfs-at-all.160037/ and the PVE wiki: https://pve.proxmox.com/wiki/ZFS_on_Linux, https://pve.proxmox.com/wiki/Storage:_ZFS and https://pve.proxmox.com/wiki/Storage#_storage_types
ZFS is also the only back storage backend which you could use for pve-zsync and Storage replication:
https://pve.proxmox.com/wiki/PVE-zsync
https://pve.proxmox.com/wiki/Storage_Replication

It has two things you need to be aware of: You will need to disable hardware RAID (since HW Raid and ZFS don't play nice together) on your Storage controller. If this is not possible you would go with LVM-thin and ext4 or XFS depending on your workload (see https://access.redhat.com/articles/3129891#local-file-systems-overview-2 for some information). Please note, that LVM-thick doesn't allows snapshots, so in case of local storage I would go with LVM-thin if ZFS is not possible. LVM-thick is however the only option if you have something like a FC- or ISCSI attached SAN if the SAN doesn't support ZFS-over-ISCSI

The second point for thoughts is that for ZFS it's highly recommended to use enterprise-grade SSDs with power-loss-protection, but since you will propably have them anyway in a corporate environment this shouldn't be much of a deal breaker
For a POC they are not needed obviouvsly but for Production I wouldn't go without it and since SSDs with PLP give better performance a POC without them might not give you the full picture

It's not an issue that the storage controller is not NVMe?

Johannes S · Apr 8, 2025

CS76 said:
It's not an issue that the storage controller is not NVMe?

Nope, as long as you don't have Hardware RAID you should be good.
ZFS is way older than NVME ( back in the day there were only hdds) and can of course also be used with HDDs or SATA SSDs. In theory you wouldn't even need ssds with power-loss-protection. They will wear out faster though and anything else but ssds as vm storage Media won't have great Performance.

UdoB · Apr 8, 2025

CS76 said:
It's not an issue that the storage controller is not NVMe?

I do not know "HPE MR416i-a Gen10+". As long as it has "HBA/IT-Mode" to allow direct access to the physical disks, it should work fine.

tomtom13 · Apr 9, 2025

CS76 said:
It's not an issue that the storage controller is not NVMe?

Nope - ZFS was created for "Commodity hardware" - ie, no RAID, no fancy controllers needed etc. You should feed it directly the disks and let it do it's magic. It will digest hdd, ssd, nvme, direct attached memory mapped NOR ram (If you have mmap mappings) - it's pretty inteligent by being their creators not trying to overcomplicate the bottom layer.

Johannes S said:
ZFS is way older than NVME ( back in the day there were only hdds)

Interesting tidbit: ZFS is from 2005, first released customer style SSD in 2001

(I have fond memory of some ultra expensive parallel scsi ssd from 90's, but those were more of a novelty for DB freaks.

[SOLVED] Performance comparison between ZFS and LVM

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Attachments

Renowned Member

Distinguished Member

Renowned Member

Well-Known Member

Distinguished Member

Distinguished Member

Renowned Member

New Member

Renowned Member

Famous Member

New Member

Famous Member

Distinguished Member

Renowned Member

We value your privacy