[SOLVED] Performance comparison between ZFS and LVM

I quote it in full above (highlighting mine) to give the context, but I still do not understand the comparison, you could instead however e.g. compare with what e.g. Red Hat has been suggesting since a while (they do not do ZFS, obviously):

https://docs.redhat.com/fr/document...dm-integrity_configuring-raid-logical-volumes
I think you're trying to split hairs here. Lvm has more features than RAW but not as many as ZFS. LVM allows to have several VM's on single storage device, and device passthrough should be one per VM. From my (and few other admins) experience LVM seems to be more reliable than RAW disk passthrough but has a tiny penalty of speed.
 
  • Like
Reactions: Johannes S
yeah, passthrough direct IO devices ... magically becoming mdraid. Please read the thread from the beginning,

You are right, I stand corrected. I just read through it filtering out the noise and kept reading between you and staff. This has absolutely nothing to do with MDRAID. I apologise I added on to it recently, the original diversion happened in #37, anyhow I should have just started a new thread and reference to it.

But now you got me interested, I am looking at #36 where the mitigation was suggested - basically default to scsi-hd.

But that also means that this is still true:
default to scsi-hd (which is not full pass-through) instead of scsi-block for pass-through, with the possibility to "opt-in" to the old behaviour with all the associated risk (until further notice)

So the old behaviour is still opt-in. And there has been no NO FIX whatsover:
https://qemu-devel.nongnu.narkive.com/7NrZkgBQ/data-corruption-in-qemu-2-7-1#post12

Do you mind I ask in the original thread what happened with this?
 
Do you mind I ask in the original thread what happened with this?
Well, story was pretty basic, I was one of first to actually perform the update of proxmox because it was during the christmas, and about 24 * 8TB drives worth of data went to hell (as far as I remember the size of the shelf). I was pretty "displeased" (and maybe was to harsh in the thread), however I was trying to find the culprit, and spent to much time trying to figure out what was going wrong. After I've noticed more and more people coming in with similar issues, I begun to be very doubtful that this is my misconfiguration and that problem was systemic. Then I think I've noticed a similar issue on non proxmox forum relating to qemu and RAW passthrough to the VM - at that point I said F* that and shifted whole storage to ceph for that specific VM. Few years back (duno the exact year but it was before pandemic) in one corp one admin floated to me idea of passing RAW disks to VM for performance - I gave him a stern warning. He came back to me after some time saying that he created test VM with that sort of setup and it ate his data, so I would say it was not resolved back then. After that I've not used RAW myself or for business aplications. However with LVM I had no major issues.
Another trick that I use is: if I really really need that sweet raw performance goodness, and my stuff can run in a container, I will just create what ever array I need and pass it to the container through the mount point.

But hell - ask away, maybe they will tell us all is good in raw passthrough land. I'm not going to test it.
 
Last edited:
Well, story was pretty basic, I was one of first to actually perform the update of proxmox because it was during the christmas, and about 24 * 8TB drives worth of data went to hell (as far as I remember the size of the shelf). I was pretty "displeased" (and maybe was to harsh in the thread), however I was trying to find the culprit, and spent to much time trying to figure out what was going wrong.

As you describe it that's frustrating enough.

After I've noticed more and more people coming in with similar issues, I begun to be very doubtful that this is my misconfiguration and that problem was systemic. Then I think I've noticed a similar issue on non proxmox forum relating to qemu and RAW passthrough to the VM - at that point I said F* that and shifted whole storage to ceph for that specific VM.

I understand, but based on the mitigation back then, this should not be occuring (even at the time) if scsi-hd is used (as opposed to scsi-block). So it's not really storage related, it's QEMU param related.

Few years back (duno the exact year but it was before pandemic) in one corp one admin floated to me idea of passing RAW disks to VM for performance - I gave him a stern warning. He came back to me after some time saying that he created test VM with that sort of setup and it ate his data, so I would say it was not resolved back then.

So this is raising eyebrows because I suppose he would have used defaults, which by then would have been scsi-hd.

After that I've not used RAW myself or for business aplications. However with LVM I had no major issues.
Another trick that I use is: if I really really need that sweet raw performance goodness, and my stuff can run in a container, I will just create what ever array I need and pass it to the container through the mount point.

I get that takeout, just I like to know the core of the issue. Because I could not find any further changes on qemu side in respect to scsi-block and with the anectodal evidence you mention, it also makes me wonder about scsi-hd, but that should not be a an issue, logically speaking, because it's not full passtrhrough.

But hell - ask away, maybe they will tell us all is good in raw passthrough land. I'm not going to test it.

Maybe I test it first, then ask. There's a test case in the mailing list even.

Thanks!
 
I understand, but based on the mitigation back then, this should not be occuring (even at the time) if scsi-hd is used (as opposed to scsi-block). So it's not really storage related, it's QEMU param related.
Yeah, mitigation was suggested few month later, I guess you can imaging what business will say if you suggest "just sit on our hands until somebody graciously will give us a way of fixing that". I'm not knocking of proxmox chap - they have their product that is based on open source projects -> they are not 100% in control of other people timetable as well.
So this is raising eyebrows because I suppose he would have used defaults, which by then would have been scsi-hd.
Well, no. The whole discussion with him started from "hey do you know how to pass the devices raw ?" so he would not use defaults. Let's not kid our selfs, raw passthrough is and was far from default ;)
Maybe I test it first, then ask. There's a test case in the mailing list even.
Or maybe you could test what I've reported :D Just use disk connections I listed in the original posts, and start dumping data into it. Duno, maybe you've got collection of pr0n that you can hash and later you can read through the passed disks and check hashes ? Just be warned that that corruption would not happen instantaneously. Btrfs would fail spectacularly because it doesn't have some safety features of zfs, but I had problems with ext4 as well. My luck was that this specific VM's was picking up footage from multiple sources and dumping it into a massive array so writes were pretty random and large (this was also the reason to use RAW passthrough since anything else would collapse on amount of random IO to spinning rust).

BUT:
maybe let's not hijack this thread as well - dude was asking about ZFS vs LVM.
 
Last edited:
Hello everyone,
we've finally decided to test internally a 2 node cluster pve installation with zfs and replication,
and vm performance is considerably slower than what we've got used to with hardware raid and lvm,

Our test installation is meant to replace an old single node installation running on a Dell PowerEdge R730
with a PERC H730 Mini controller with a raid-5 of 3 Samsung 1tb SSDs.

The new cluster nodes are two Dell PowerEdge R440 with 128GB of ram and a mirror zpool running on these Kingston 2tb SSDs.

On the nodes we run a bunch of services but the most critical of all is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk,
we noticed a substantial performance decrease during high IOPS tasks after we moved the vm on the zfs nodes,
to put it in perspective, the same VM takes between 15-30% more time to complete the same tasks on zfs compared to when it was running on lvm.
After running some tests with fio and directly performing tasks on the db on the same vm running on both zfs and raid+lvm,
I saw that the issue is mainly in random read-write workloads, sequential read and write operation performance is mostly the same.
The fio cmds i used to run the tests are the same used in this post.

After reading various posts on this forum and other websites these are the parameters can have an effect on performance,
so far I tried changing:

recordsize=8K, set to 8K because that's what OracleDB uses as blocksize
atime=off, most sources cited that setting access times to off can lead to a slight improvement in IOs
sync=disabled, I tried setting this parameter both on and off and I did not notice any meaningful improvements
also since I do not have a SLOG device I'd rather say on the safe side and keep this on.
zfs_arc_max=51539607552, currently set to 50GB.

ashift=12, I left the default setting because i read that most ssds have a pagesize of 4K,
however this website says that the pagesize of the Kingston DC600M is 16KB,
do you think that creating the zpool with ashift=14 can lead to any sort of improvement?

volblocksize=16K, the OpenZFS documentation states that tuning can help with random workloads,
I did not change this parameter from the default 16KB, right now it's the same as the supposed pagesize of my ssd disks,
Would setting this to the blocksize used by the db (8K) or the filesystem (4K) make any sort of difference?

Also one thing that I noticed is that there is a lot of difference in performance if I run the same fio command from inside the vm or directly on the host,
that makes me think that maybe the issue is not zfs in itself but somewhere between the vm vfs and the vm config itself,
which is this one: scsi2: zfs-2tb:vm-218-disk-0,cache=writeback,discard=on,iothread=1,size=128G,ssd=1

These are the properties of the zpool on which the vm is running.
Code:
root@pve02:~# zfs get all zfs-2tb
NAME     PROPERTY              VALUE                  SOURCE
zfs-2tb  type                  filesystem             -
zfs-2tb  creation              Wed Jan 22 11:41 2025  -
zfs-2tb  used                  1.35T                  -
zfs-2tb  available             336G                   -
zfs-2tb  referenced            4.04G                  -
zfs-2tb  compressratio         1.92x                  -
zfs-2tb  mounted               yes                    -
zfs-2tb  quota                 none                   default
zfs-2tb  reservation           none                   default
zfs-2tb  recordsize            8K                     local
zfs-2tb  mountpoint            /zfs-2tb               default
zfs-2tb  sharenfs              off                    default
zfs-2tb  checksum              on                     default
zfs-2tb  compression           on                     local
zfs-2tb  atime                 off                    local
zfs-2tb  devices               on                     default
zfs-2tb  exec                  on                     default
zfs-2tb  setuid                on                     default
zfs-2tb  readonly              off                    default
zfs-2tb  zoned                 off                    default
zfs-2tb  snapdir               hidden                 default
zfs-2tb  aclmode               discard                default
zfs-2tb  aclinherit            restricted             default
zfs-2tb  createtxg             1                      -
zfs-2tb  canmount              on                     default
zfs-2tb  xattr                 on                     default
zfs-2tb  copies                1                      default
zfs-2tb  version               5                      -
zfs-2tb  utf8only              off                    -
zfs-2tb  normalization         none                   -
zfs-2tb  casesensitivity       sensitive              -
zfs-2tb  vscan                 off                    default
zfs-2tb  nbmand                off                    default
zfs-2tb  sharesmb              off                    default
zfs-2tb  refquota              none                   default
zfs-2tb  refreservation        none                   default
zfs-2tb  guid                  7036852244761846984    -
zfs-2tb  primarycache          all                    default
zfs-2tb  secondarycache        all                    default
zfs-2tb  usedbysnapshots       0B                     -
zfs-2tb  usedbydataset         4.04G                  -
zfs-2tb  usedbychildren        1.35T                  -
zfs-2tb  usedbyrefreservation  0B                     -
zfs-2tb  logbias               latency                default
zfs-2tb  objsetid              54                     -
zfs-2tb  dedup                 off                    default
zfs-2tb  mlslabel              none                   default
zfs-2tb  sync                  standard               local
zfs-2tb  dnodesize             legacy                 default
zfs-2tb  refcompressratio      1.00x                  -
zfs-2tb  written               4.04G                  -
zfs-2tb  logicalused           821G                   -
zfs-2tb  logicalreferenced     4.02G                  -
zfs-2tb  volmode               default                default
zfs-2tb  filesystem_limit      none                   default
zfs-2tb  snapshot_limit        none                   default
zfs-2tb  filesystem_count      none                   default
zfs-2tb  snapshot_count        none                   default
zfs-2tb  snapdev               hidden                 default
zfs-2tb  acltype               off                    default
zfs-2tb  context               none                   default
zfs-2tb  fscontext             none                   default
zfs-2tb  defcontext            none                   default
zfs-2tb  rootcontext           none                   default
zfs-2tb  relatime              on                     default
zfs-2tb  redundant_metadata    all                    default
zfs-2tb  overlay               on                     default
zfs-2tb  encryption            off                    default
zfs-2tb  keylocation           none                   default
zfs-2tb  keyformat             none                   default
zfs-2tb  pbkdf2iters           0                      default
zfs-2tb  special_small_blocks  0                      default
zfs-2tb  prefetch              all                    default

I'll attach the output of the fio tests below as posing them in codeblocks here would probably make the post too long.
Maybe some of you can either point me in the right direction or at least confirm that with this setup the current results are somewhat expected.
 

Attachments

OK, So maybe I will chip in some detail out of the "embedded side of computing". Situation with SSD's is VERY VERY VERY complicated.
NAND flash uses very large sectors, I've seen nand chips having sectors size of ranging between 8k and 128K.
What does it mean ?!:
Well, process of writing to flash is that you erase the sector - which sets all bits to "1", when you write data you simply set appropriate bits to 0. Simple, right ? Yup. Theoretically if you want to modify data - you simply lift whole sector to ram -> erase the block -> write out new data. However the erase process is LONG, and this erase process wares of the actual flash. In embedded world you use things like j2ff, that just keeps adding data to the filesystem and FS and sectors as long log. Every once in a while, FS driver walks in and finds parts of the log that are no longer representing valid data (or are half filled with valid data) -> writes that data in empty space and marks those sectors for slow garbage collector erasure.
But what it means to me since I run PC not some punny embedded platform:
Well, you are in luck, since all modern (10 years or younger) SSD implement that sort of strategy on their own, where they will show you some fake sectors (logical) sectors, and will keep writing modifications to new empty space and quietly do the trimming and erasing for you - hence you need a "controller" on the SSD. Problem arises when your FS sector size is different from actual sector size of SSD / NVME disk. Copy on write filesystems are supposed to mitigate this by just copying modified data to free space - but SSD lies to you anyway, so you would write to free space anyway.

Now imagine situation where you have SSD being punched with lots or random IO - disk will cope just fine with this, but after short while (miliseconds) it starts walking it's log to find data to truncate and which sectors to erase - this is causing the internal IO storm. This is why all SSD for PC have the problem of far slower IO for random 4k vs sequential. Everybody here knows that random 4k can be 100x slower than sequential on NVME. In embedded world - where your CPU is connected directly to NVME, difference between sequential and random 4k on j2ffs is maybe 5x because changing the address to which you want to write out of sequence data gives maybe 1.5x slowdown, rest is down to filesystem walking it's logs for trimming - BUT filesystem & flash trim controll are one and the same thing eliminating IO storm - FS knows what is written and what is going to be written so it can schedule accordingly, dumb SSD connected to host has none of that luxury. But not all is great in j2ffs land - walking the log on very fragmented FS can be very slow and reading whole file can sometimes give a solid slow down. So there is no silver bullet.

Best way is to theoretically in ZFS set sector size to same size as underlying SSD - unless manufacturer is lying as well - then zfs should work in sync with SSD controller and diminish the internal IO storms. Using NVME with ample of RAM is beneficial, and enterprise level NVME. Large ARC helps as well, as well as superfast LOG devices.

Now ... VM ... I honestly don't know, the current qemu is so complicated my only bet would be to use as raw FS as possible with matching sectors and NVME emulation. Using preallocated disk images should help as well, but at the moment I'm at the loss with those.
 
On the nodes we run a bunch of services but the most critical of all is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk
Yes, that's a very bad idea. Why don't you switch to ASM on a block device? This is much better performance wise in comparison to the 4K blocksizes of ext4.