Benchmark: ZFS vs mdraid + ext4 + qcow2

Kurgan

Well-Known Member
Apr 27, 2018
33
5
48
54
After fighting with ZFS memory hunger, poor performance, and random reboots, I just have replaced it with mdraid (raid1), ext4, and simple qcow2 images for the VMs, stored in the ext4 file system. This setup should be the least efficient because of the multiple layers of abstraction (md and file system have to be "traversed" to reach the actual VM data), but still we have noticed a 5x improvement in VM responsiveness, database query times inside the VMs, and in backup (vzdump) speed in production. Also, no more reboots, and all of the RAM is now available for the VMs, instead of only half of it.

I have just run some tests in a lab environment.

Hardware: 64 GB ram, dual Xeon
Disk controller: simple SATA3 integrated in the server mainboard
Disk configuration: 2x WD GOLD 2 TB

First test: Standard installation of PVE 5.2 (from CD, no online updates), using RAIDZ-1. I have uploaded a simple VM backup (3 GB data on the virtual disk) and performed a vzdump "in place", that means I have dumped the VM from and to the same phisical disks.

Time to dump 3 GB of data (using the default "backup" function of PVE): 12 minutes.

Second test: on the same HW I installed Debian 9 with md raid 1, and ext4 file system on top of the md device. I have then uploaded the same VM backup, and tested the same, identical, procedure.

Time to dump 3 GB of data (using the default "backup" function of PVE): 2 minutes, 6 seconds.

This is a SIXFOLD IMPROVEMENT.

CONCLUSIONS: Please, PVE developers, PLEASE, PLEASE, PLEASE, consider offering mdraid as a default installation path for the PVE ISO images. It is quite clear to me that ZFS has a LOT of disadvantages: it's slow, memory hungry, and crash prone (because of OOM situations).


CAVEATS:

I know that I should use a proper RAID controller and LVM, and ditch ZFS. But why spend a lot of money on a raid controller when, in some low-end (and mid-range) setups, md raid works really well?

I know that the first test uses PVE 5.2 and the second uses 5.3 (from the free PVE repos). Still I don't think that the performance improvement is caused by using 5.3 instead of 5.2

I know I can tune ZFS to make it stop using up half of the available RAM, but speed will only get worse anyway.

I know I can set up PVE on Debian as I did, and not bother PVE developers to ask for something I can do by myself, still I can't believe it's so hard to support LVM on mdraid as a setup option. What's wrong with mdraid? I use it everywhere, I have used it for 15 years (maybe 20) and I had NEVER HAD ANY ISSUE AT ALL.
 
First: I hear your complaints, but I would not paint it that black.

My first steps with ZFS many years ago were very similar to yours. I also encountered OOMs and extreme slowliness all across the board. I haven't had a single OOM in almost 2 years on ZFS systems and I use them simply for their features, not for their speed in comparison to "stupid block or file storage".

using RAIDZ-1.

You wrote your have two disks and created a Raidz1? Why? You're comparing oranges and apples. You should have created a mirrored vdev in this setup. More information here: https://constantin.glez.de/2010/01/23/home-server-raid-greed-and-why-mirroring-still-best/

I know I can tune ZFS to make it stop using up half of the available RAM, but speed will only get worse anyway.

You neglect that your ext4-qcow2-system will also use as much RAM as it can get. Most people don't realise that because this is the default behaviour of every modern operating system. For "real" benchmarks, one always uses direct i/o paths or drop the caches to get proper numbers. The thing is that ZFS does only use at most 50% of RAM on Linux, default Linux page cache has no limits. On the other hand, both storage backends suffer from low cache sizes and will be slow then.

One other main aspect is your hardware. ZFS is not a low-end consumer filesystem like the ext-family, it is a filesystem for enterprise class hardware and an enterprise setting. Having two slow at most 7.2k rpm disks will not be fast enough - for anything. ZFS will not perform well on those machines, why should it? It was not designed for it. ZFS is the filesystem without any limits ... but on the far side of the spectrum. You should compare it in a bigger picture with a lot of disks to standard mdadm and ext4. ext4 has also a lot of limitations that ZFS does not has like concurrent access and such.

I use it everywhere, I have used it for 15 years (maybe 20) and I had NEVER HAD ANY ISSUE AT ALL.

We also did use mdadm... also for decades, also without any issues, but we switched almost every system that was/is big enough to ZFS. Yes, it is not as fast as it could be, but this is mainly due to the features we really, felt in love with:

* transparent compression (this yields even higher read/write throughput than ext4 depending on your data - also on low-end hardware)
* resilvering instead of stupid 1:1 copy/parity on rebuild (or creating a 100 TB pool that is in the beginning already high available)
* in addition to scrubbing/patrol reads - transparent self healing on every read.
* support for trimming (not towards the hardware, but towards the virtual disk)
* 100% valid data detection in a two disk mirror. In a stupid 2 disk mdadm mirror, you cannot tell which data is correct if a block does not match on both disks.
* snapshots and the ability to transfer them very, very efficiently (also replicate)

The last step was the big problem on mdadm with external backups. We decreased the time of an external backup via ZFS-replication by a factor of almost 100 in comparison to rsyncs, because we can transfer incrementally and do not have to copy the backups themselves on the external disk, only the changes. This also yields much smaller space requirements and therefore longer history or smaller disks. This fact sold us totally to ZFS, because you simply cannot do that on a not CoW-based system.

This could also help in your use case: vdzump
In an ext4 setup, your need to vzdump in order to have something that you copy off to another location as a backup. Yes, you can do that with ZFS too, but this is not what you should do, you should snapshot your VM and transfer the much smaller differiential data to your backup site. This will always be faster than the vzdump because you only need to read the data only present in the newest snapshot.

If you have ZFS on your PVE (which unfortunately only works on single nodes and not in a cluster with shared storage), you can snapshot and replicate via built-in tools so that you can have a backup system, that has e.g. 15 time delation. Totally impossible in an ext4 setup with mdadm.

So, some remarks on the performance:
Is it slower than ext4 on some test hardware? Yes sure. I used it on the Rapsberry PI 1-3 and it was almost dead slow, but it worked. It is definetely slower than ext4 in a 2 or 4 disk SATA system on software raid. But in our experience (and I mean also all the ZFS advocates here) it is not even half as slow as ext4 in comparison on a broad variety of use cases.
 
  • Like
Reactions: herzkerl and Alwin
LnxBil, thanks for your reply.

Please bear with me, I am quite desperate because of OOM issues and crashes (slowliness is an issue, but not the main one).

I made a mistake in saying that I used RAIDZ-1. It's simply RAID1 made using ZFS. I got the term wrong.

What I'm trying to accomplish is to have a stable and not terribly slow system, on more or less low-end hardware.

I have now understood that ZFS is slow on low-end hardware (sata 7200 rpm disks, no SSD caching). I also understand it has its benefits (a lot of great features, that come at the price of speed, probably).

What I don't understand is why it's so prone to OOM situations and reboots. Or, maybe more precisely, why is PVE with ZFS so prone to these issues.

If ZFS really wants hi-end hardware (64 or more GB ram, multiple 15k sas disks, etc) I believe that this should be clearly stated in PVE documentation. When I approached PVE for the first time, I thought that ZFS could be the proper solution when you don't have a decent RAID controller, so I thought I could use it in a low-end server. (well, I got crashes and reboots even on a 64 GB server, actually)
 
I use Proxmox on consumer hardware (spinning disks, desktop motherboards, non-ECC RAM) always with ZFS and for a very limited number of VMs it has an OK speed. The best improvement I just started to do is move from spinning disks to SSD. Usually I use two consumer grade SSD (Samsung EVO serie) in ZFS RAID-1 for Proxmox installation and virtual disks which need more speed (OS, databases, applications), then I put 2 or more HDD (SATA 7200 rpm) in RAID1 or RAID10 for data storage. I have max 3 Windows VMs on each server.

I have good uptimes (some server with less mainteinance is up from 2 years) and I think strange you have OOM and reboots, perhaps you are overprovisioning too much. I try to leave at least 512MB free RAM, in each server in /etc/sysctl.d I have a conf file with
Code:
vm.swappiness = 0
vm.min_free_kbytes = 524288
 
I am not overprovisioning (or at least, I believe I am not doing it). I have for example a 16 GB server that had, until yesterday: 8 GB ZFS ARC cache, 3 GB for a VM, 1 GB for the other (so we are at 12 total) and i could not start another VM with 1 GB because KVM told me it could not allocate memory.

Now I have reduced ARC to 4 GB (needs a reboot, because ARC cache does NEVER shrink) and I could start the third VM.

This server is definitely not overprovisioned, IMHO.


Another server has 32 GB RAM, 16 used in ZFS ARC, 4 VMs running, totalling 12 GB RAM (so I should have 4 GB unused) and it crashes and reboots every month or so, always while running backups (vzdump). I will try to reduce ARC size to 8 GB and see if with 12 GB free it stops crashing.

All of my hosts are entry level servers, Xeons with ECC memory and SATA nearline 7200 rpm disks, with swappiness set to 0, and no "min_free" set.

The only hosts that show no issues at all are the ones not using ZFS.
 
Try to set the vm.min_free_kbytes , so there is always memory free for proxmox. I set that in every proxmox server and as I said, I don't have problems with crash or reboots.

In your examples I think your ZFS ARC max is too much, Linux has his own cache too and even if that is released when needed, it needs some time. If your VMs need circa half RAM of the host (and if you assign eg 1GB to a VM, that VM doesn't use only 1 GB, but some more, you can check yourself), I'd set max ZFS ARC to one third of the RAM, no more, so you don't have too much memory pressure. Try to never reach 75% of used RAM
 
I will surely reduce ZFS ARC, because it's now clear to me that it takes up too much ram, and it does never give it back, even under low memory conditions. It should do it, documentation says it does it, but it's not true. It does not give up a single byte of ram. Then I will try to use your settings and see if my servers stop crashing.
 
Hello @Kurgan ,

Yes, zfs is very bad at the first time, and if you do not try to read, and to read and to re-read the documentations. The most difficult is to understand how this beast is working.
But if you are willing to spent many hours, then at a piont your zfs setups will start to be better and better. zfs has a lots of settings that could be done, even on servers with low memory. I have using zfs for many years, on many desktop alike with 2 GB ram (not proxmox at that time) with many services. But luky me, I have faced only few reboots.
Now with Proxmox, most of my nodes have 16 GB ram(data-bases, file servers and many others, including nfs, lizardfs, etc). Not a single one crash/reboot.

One trick is to use zfs cache for metadata only. Also if you use the recomended vblock size for your DB maker you can have very good performance (with many other custom zfs setups).

So in the end, if something is not working as you hope using zfs, it must be your lack of knowledge and/or misunderstood of yours data environment(I repet this sentence to myself over and over again, until I am able to solve my own zfs problem)

Good luck!!!
 
and it does never give it back, even under low memory conditions. It should do it, documentation says it does it, but it's not true.
Your statement is false. Just keep watching the ARC size with
Code:
arcstat  1
and you'll see by yourself that ARC allocation is dynamic, it grows up when there are request and free memory and it shrinks when there memory is needed (example: you start another VM and there is not enough free RAM). But it need some seconds to release the allocated RAM, so in case there is too low free RAM and too much RAM pressure, there can happen problems. Hence my advice: never reach 75% used RAM (ARC total + RAM assigned to VMs + some extra RAM for each VM + some RAM for proxmox) and set vm.min_free_kbytes
 
Guletz and mbaldini, thanks for your answers. I am not a ZFS expert, I used it for the first time in Proxmox, because it's its default choice for software RAID. I have always used md and ext4 before. What baffles me is the fact that ZFS has so many issues in Proxmox, and that, based on what I read on this forum, ZFS is very poorly tuned in Proxmox.

You (and other people too) are telling me to tune ZFS in Proxmox to make it work properly, and I ask myself why these settings are not default, and why there is no documentation about it in Proxmox.

I have so far learned that:

  1. I should not swap on ZFS (and Proxmox does by default, and this is not configurable in the installer)
  2. At least, swap area on ZFS should have primarycache=metadata, logbias=throughput and sync=always, and this was not the case until latest PVE releases
  3. I should set primarycache=metadata for all ZFS volumes and not only swap area, and Proxmox does not do it.
  4. I should limit RAM used by ARC, and Proxmox does not do it and there is very little documentation about it.
  5. I should set vm.min_free_kbytes to at least 512K, swappiness to zero, and anyway never use more than 75% RAM, and there is no documentation about this, nor is swappiness set to 0 by default by the installer.
  6. ARC size should really be dynamic, but I have withnessed cases where, after setting arc_max at runtime, the cache never gave up a single byte even when OOM situations occurred.
I believe that all of these settings should be made default (when possible) and documented in the ZFS howto for PVE, otherwise it is not advisable to use ZFS in PVE unless you have prior knowledge of ZFS itself.

It seems to me that ZFS in PVE is a work in progress and is not stable enough to be used in production unless you are competent enough to be able to tune it based on your own knowledge of ZFS itself.


I am now interested in the use of primarycache=metadata for the volume containing the VM images, I have already applied the other settings and suggestions to my "test server" which is actually a production one, so I am being very conservative in testing. Does this setting require a reboot? Does it improve performance if I have a small (4 GB) ARC size?
 
I should not swap on ZFS (and Proxmox does by default, and this is not configurable in the installer)
It's changed in the last versions of the installer, now you can reserve some free space on the disk/disks when you install Proxmox and then you can configure swap on that free space

I should set primarycache=metadata for all ZFS volumes and not only swap area, and Proxmox does not do it.
I never needed to do that, it can be dependent on your hardware, your workload, and so on

I should limit RAM used by ARC, and Proxmox does not do it and there is very little documentation about it.
Official Proxmox documentation about ZFS: https://pve.proxmox.com/wiki/ZFS_on_Linux
there is a chapter: Limit ZFS Memory Usage
Very little documentation?


I should set vm.min_free_kbytes to at least 512K, swappiness to zero, and anyway never use more than 75% RAM, and there is no documentation about this, nor is swappiness set to 0 by default by the installer.
Again, this cannot be a default, because it depends on your VMs, you hardware, your workload, ...
If you are the person that must manage the server, you must know this things. Otherwise you can pay someone with experience or ask support directly from proxmox team with subscription fee


It seems to me that ZFS in PVE is a work in progress and is not stable enough to be used in production unless you are competent enough to be able to tune it based on your own knowledge of ZFS itself.
I think that if you need to manage a complex system like a virtualized host in production, you should know well and be competent about what you need to manage. It's not different with Vmware and other virtualization platforms.


This are only my 2c, I gave you my advices according to my experience with Proxmox, that is little, about 30 Proxmox hosts, most of them stand alone, some in a cluster, I tried ceph, tried DRBD, use ZFS replication, and so on. I usually test technologies in my lab before selling them, so I can be ready to manage problems and optimize for what is the workload of every customer.

Have a nice day
 
Thanks @mbaldini, here are my additional 2c:

I should limit RAM used by ARC, and Proxmox does not do it and there is very little documentation about it.

No, you should not limit it. In fact, the default for ZFS on Linux to use is at most half of your RAM, which is totally fine and does not pose any problems. If you want to limit it further, you can just do so, but as @mbaldini pointed out, the ARC can and will shrink if space is needed.

I have no idea why it is not in your case, but it generally does. Could you please run this on your server:

Code:
pveversion -v
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!