Benchmark: ZFS vs mdraid + ext4 + qcow2

Discussion in 'Proxmox VE: Installation and configuration' started by Kurgan, Dec 19, 2018.

  1. Kurgan

    Kurgan New Member

    Apr 27, 2018
    Likes Received:
    After fighting with ZFS memory hunger, poor performance, and random reboots, I just have replaced it with mdraid (raid1), ext4, and simple qcow2 images for the VMs, stored in the ext4 file system. This setup should be the least efficient because of the multiple layers of abstraction (md and file system have to be "traversed" to reach the actual VM data), but still we have noticed a 5x improvement in VM responsiveness, database query times inside the VMs, and in backup (vzdump) speed in production. Also, no more reboots, and all of the RAM is now available for the VMs, instead of only half of it.

    I have just run some tests in a lab environment.

    Hardware: 64 GB ram, dual Xeon
    Disk controller: simple SATA3 integrated in the server mainboard
    Disk configuration: 2x WD GOLD 2 TB

    First test: Standard installation of PVE 5.2 (from CD, no online updates), using RAIDZ-1. I have uploaded a simple VM backup (3 GB data on the virtual disk) and performed a vzdump "in place", that means I have dumped the VM from and to the same phisical disks.

    Time to dump 3 GB of data (using the default "backup" function of PVE): 12 minutes.

    Second test: on the same HW I installed Debian 9 with md raid 1, and ext4 file system on top of the md device. I have then uploaded the same VM backup, and tested the same, identical, procedure.

    Time to dump 3 GB of data (using the default "backup" function of PVE): 2 minutes, 6 seconds.


    CONCLUSIONS: Please, PVE developers, PLEASE, PLEASE, PLEASE, consider offering mdraid as a default installation path for the PVE ISO images. It is quite clear to me that ZFS has a LOT of disadvantages: it's slow, memory hungry, and crash prone (because of OOM situations).


    I know that I should use a proper RAID controller and LVM, and ditch ZFS. But why spend a lot of money on a raid controller when, in some low-end (and mid-range) setups, md raid works really well?

    I know that the first test uses PVE 5.2 and the second uses 5.3 (from the free PVE repos). Still I don't think that the performance improvement is caused by using 5.3 instead of 5.2

    I know I can tune ZFS to make it stop using up half of the available RAM, but speed will only get worse anyway.

    I know I can set up PVE on Debian as I did, and not bother PVE developers to ask for something I can do by myself, still I can't believe it's so hard to support LVM on mdraid as a setup option. What's wrong with mdraid? I use it everywhere, I have used it for 15 years (maybe 20) and I had NEVER HAD ANY ISSUE AT ALL.
  2. LnxBil

    LnxBil Well-Known Member

    Feb 21, 2015
    Likes Received:
    First: I hear your complaints, but I would not paint it that black.

    My first steps with ZFS many years ago were very similar to yours. I also encountered OOMs and extreme slowliness all across the board. I haven't had a single OOM in almost 2 years on ZFS systems and I use them simply for their features, not for their speed in comparison to "stupid block or file storage".

    You wrote your have two disks and created a Raidz1? Why? You're comparing oranges and apples. You should have created a mirrored vdev in this setup. More information here:

    You neglect that your ext4-qcow2-system will also use as much RAM as it can get. Most people don't realise that because this is the default behaviour of every modern operating system. For "real" benchmarks, one always uses direct i/o paths or drop the caches to get proper numbers. The thing is that ZFS does only use at most 50% of RAM on Linux, default Linux page cache has no limits. On the other hand, both storage backends suffer from low cache sizes and will be slow then.

    One other main aspect is your hardware. ZFS is not a low-end consumer filesystem like the ext-family, it is a filesystem for enterprise class hardware and an enterprise setting. Having two slow at most 7.2k rpm disks will not be fast enough - for anything. ZFS will not perform well on those machines, why should it? It was not designed for it. ZFS is the filesystem without any limits ... but on the far side of the spectrum. You should compare it in a bigger picture with a lot of disks to standard mdadm and ext4. ext4 has also a lot of limitations that ZFS does not has like concurrent access and such.

    We also did use mdadm... also for decades, also without any issues, but we switched almost every system that was/is big enough to ZFS. Yes, it is not as fast as it could be, but this is mainly due to the features we really, felt in love with:

    * transparent compression (this yields even higher read/write throughput than ext4 depending on your data - also on low-end hardware)
    * resilvering instead of stupid 1:1 copy/parity on rebuild (or creating a 100 TB pool that is in the beginning already high available)
    * in addition to scrubbing/patrol reads - transparent self healing on every read.
    * support for trimming (not towards the hardware, but towards the virtual disk)
    * 100% valid data detection in a two disk mirror. In a stupid 2 disk mdadm mirror, you cannot tell which data is correct if a block does not match on both disks.
    * snapshots and the ability to transfer them very, very efficiently (also replicate)

    The last step was the big problem on mdadm with external backups. We decreased the time of an external backup via ZFS-replication by a factor of almost 100 in comparison to rsyncs, because we can transfer incrementally and do not have to copy the backups themselves on the external disk, only the changes. This also yields much smaller space requirements and therefore longer history or smaller disks. This fact sold us totally to ZFS, because you simply cannot do that on a not CoW-based system.

    This could also help in your use case: vdzump
    In an ext4 setup, your need to vzdump in order to have something that you copy off to another location as a backup. Yes, you can do that with ZFS too, but this is not what you should do, you should snapshot your VM and transfer the much smaller differiential data to your backup site. This will always be faster than the vzdump because you only need to read the data only present in the newest snapshot.

    If you have ZFS on your PVE (which unfortunately only works on single nodes and not in a cluster with shared storage), you can snapshot and replicate via built-in tools so that you can have a backup system, that has e.g. 15 time delation. Totally impossible in an ext4 setup with mdadm.

    So, some remarks on the performance:
    Is it slower than ext4 on some test hardware? Yes sure. I used it on the Rapsberry PI 1-3 and it was almost dead slow, but it worked. It is definetely slower than ext4 in a 2 or 4 disk SATA system on software raid. But in our experience (and I mean also all the ZFS advocates here) it is not even half as slow as ext4 in comparison on a broad variety of use cases.
    Alwin likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice