ZFS Raid 10 vs SW Raid 10

sebschmidt1981

New Member
Sep 4, 2018
6
1
3
42
Hey guys,

im using Proxmox a really long time now, but im still getting confused when "zfs" joins the club. I got a few questions, which i hope some of you can answer. I moved from unRAID in home server use away, because of lesser read and write IO performance, because i plan to nested virtualization a lot and unRAID as the host is cool for sharing space over network to the nested systems, but 1 disk at a time is still slower than 4 together. So let me start first at my system, for my home server.

Server-HW:
  • 1 x Intel(R) Xeon(R) CPU E5-2650L v2 10c/20t
  • 6 x Hynix 16GB ECC Registred RDIMM Rang4 Sticks (96GBRAM)
    • (Planing to move to Samsung, same specs but 8 sticks (128GBRAM))
  • 1 x Supermicro X9SRA Mainboard (Can support Intel C602 FakeRAID but i think im better of with a SW-Raid)
  • 4 x WD Blue 4TB 3,5" disks
  • 1 x 240 Sandisk Ultra 2 SSD

Possible scenarios:
  • Debian 9.5 native minimal install with RAID 10 (manual setup raid10 500MB /boot + Rest in LVM with 16G swap)
  • Proxmox ISO - ZFS RAID 10 Setting, next next done, basically nothing manually
  • Suprise me, maybe there is something cool i didnt think of

Here are my questions now:
  1. Why use ZFS? (I read yesterday 2-3 hours about ZFS, its new, cool, hashing blocks, regenerative abilities, does fragment on many little files and so on, honestly im now more confused than before)
  2. WHY would you use ZFS and in WHICH scenario is ZFS good?
  3. What about this insane memory discussions? Do i have to care at 96/128GB RAM?
  4. Why does it even need RAM? And does it "write" faster than SW RAID 10?
  5. Is it even better than SW RAID10?
  6. The guys of PvE wont add something stupid in the native ISO installer, so is ZFS better at all, or whats the reason its in there with ZFS RAID 10 in the menu and not RAID10 with MDADM?
  7. Why one is faster?
I really hope some of you can help me, because i read a lot in this forum here and all i can say is, YOU GUYS ROCK!
 
First: ZFS is an enterprise grade filesystem which is logically not as fast as other filesystems if you compare pure IO on a filesystem level. There is just more software in between an IO call and its return. Instead of minimizing the IO path, ZFS has a lot of things you would want from a modern, enterprise-grade filesystem:
- transparent compression
- snapshotting, CoW clones, simple, asynchronous replication based on snapshots
- easy NFS/SMB or even iSCSI shares directly from the zfs utility
- integrated volume manager (block and file storage)
- Resilvering (RAID rebuild) only copies used blocks instead of everything
- different and also adaptive blocksizes for better compressibility
- self-healing with intelligent scrubs (even possible on single disks if you have "only" block errors)
- possible deduplication (USE WITH CAUTION!)

Now to your questions:

1) new is relative ... ZFS is at least 12 years old. The feature list itself is really great, e.g. you can create pool snapshots and sync them to a off-site backup with ease and only new data is transferred. Compression is also very handy and will increase the write throughput, because you need less slow disk IOs
2) ZFS is a single server filesystem, so if you have a cluster, you cannot use it in a HA-manner - it's not cluster capable. You can however create one server with a ZFS and export its LUNs. ZFS is not HA-capable per se, but there exist HA-solutions that try to build it in a failover-manner that loads the pool on another system and rexports its stuff. I also tried running ZFS on a raspberry pi and it worked, not fast but it worked. If you have ZFS everywhere, you can use asynchronous replication to have consistent, working copies of your stuff for drop in replacements. We also use it for storing PVE backups on a big ZFS pool with external/off-site replication.
3) ZFS uses RAM to store its metadata and block copies, so that it'll work fast. It is storing more metadata than other filesystems (because of its features), therefore it also needs more RAM. Normally, your Linux will also cache everything as long as there is enough memory, so this is also not new, but it has its own cache, so this is a little different than ordinary Linux filesystems.
4) If the data is compressible, yes it writes faster.
5) All features outweigh mdadm-raid
6) Proxmox design decision... maybe they reply - or state something on the wiki, because this question has been asked before.
7) That hugely depends on the use case and if you really want to have raw beast performance, don't virtualize. Most of the time this is cheaper with respect to software licenses.
 
  • Like
Reactions: guletz
@LnxBil:
Thanks a lot for your feedback and insight, i think i understand now more clearly the concept around zfs.

To your answer at #7 i respect that. But i also forgot to mention that i want to tranfser some virtualized stuff i have on my dedicated root in the internet also running PvE, mostly Gameserver related IO is my point here. But i dont want to install Windows 10 on a root and install everything in there with click it stuff. I want the separation i got used to, from PvE over the last 7 years.

I just ordered a offsite disk to backup here and there via rsync, which enables me to try out Debian+PvE+RAID10 a few weeks and if im not happy i can try out PvE-ISO+ZFSRAID10.

Is there like a EXT4 "barrier=0" setting for zfs also?
 
6. because md-raid has no possibility to check/verify id the same data is identical on both member of a mirror - and this could create big problems.
In my own case, because of this problem (what you read from a md-raid could not be for sure is what you had wrote), I start to use zfs.
 
Is there like a EXT4 "barrier=0" setting for zfs also?

You can turn off sync to get more performance, but this'll compromise the security of data.

To your answer at #7 i respect that.

Yes, I know :-D Often, you don't need raw power. As I wanted to point out: All the features your get with ZFS are worth any possible slowness. I used mdadm, LVM for over a decade and after Proxmox introduced ZFS in PVE, I tried and never went back. My systems run either hardware raid 1 in simple compute nodes with an attached SAN, just because most server include it already and ZFS on all other machines - even inside VMs for things like fileserver and such.

One thing I did not mention is to speedup your slow disk setup with a proper enterprise SSD for sync writes. Similar to CacheCade in RAID controllers.

Yet I can understand that learning ZFS is a steep curve, but it is worth learning. Having everything on mdadm and LVM also yields good performance. I'd also investigate QCOW2 because of the features.
 
@1 & 2
Features. And it's usually use-case dependent, so I'll leave these 2 to the community.

@3
Not having a lot of RAM isn't really a *problem* in the sense that things would start breaking (unless you go RALLY low...). But you need a certain minimum amount for reasonable performance.
I've also used it on a machine with 4G of RAM, but there wasn't anything running there that would actually have otherwise needed the RAM, and if I did ZFS would happily free it up. It'll only be a problem if you have a lot of sudden RAM usage spikes because ZFS might not be able to free it up so fast.

@4
Up to a certain point more RAM means better performance, because a lot of metadata will be accessed a *lot*, and you don't want it to have to re-read the same sector multiple times during a single transaction. After a certain point more RAM just means more "cache". How more cache relates to performance depends on a lot of factors and needs to be checked for the specific use case anyway. Additionally, remember that if you run only VMs, the guest kernels typically keep a cache anyway, just like your host.

@5
Yes. With SW RAID you don't (at least not out-of-the-box) get checksumming. If you have a file system that does this, it may report the error, but it can't really do much about it.
Since ZFS integrates both into one, it can read the data & checksums on multiple disks if you have it in a redundant layout, and if one of them is correct while the other isn't, it can try to correct that on its own.
Iow.: bitrot and silent corruption can be detected and fixed while being reported to you to consider swapping drives out.
On the other hand DM-RAID and MD-RAID will tell you the raid is broken and needs a resync. You'll then need to tell it which disk to sync onto the other disk manually.
There are some verification layers being worked on you can stack onto SW RAID. The ones currently available don't really interact with the raid subsystem as far as I know, so that's a big downside.

@6
Since this comes up a lot, I'll mention it again: Both DM-RAID and MD-RAID have one big issue: If you disable cache on your VMs (which, if the storage supports it, most people do and want because the VM has a cache itself anyway), these systems actually do not guarantee that the data written to each device in a RAID1 will get the same datan (because of how `O_DIRECT` is implemented in them). If the data is modified on the fly, the write operation on each disk will be writing different data. The end result is that the next time these blocks are read your DM/MDRAID will tell you the RAID is broken, and you have to resync.
 
@1 & 2
Features. And it's usually use-case dependent, so I'll leave these 2 to the community.

@3
Not having a lot of RAM isn't really a *problem* in the sense that things would start breaking (unless you go RALLY low...). But you need a certain minimum amount for reasonable performance.
I've also used it on a machine with 4G of RAM, but there wasn't anything running there that would actually have otherwise needed the RAM, and if I did ZFS would happily free it up. It'll only be a problem if you have a lot of sudden RAM usage spikes because ZFS might not be able to free it up so fast.

@4
Up to a certain point more RAM means better performance, because a lot of metadata will be accessed a *lot*, and you don't want it to have to re-read the same sector multiple times during a single transaction. After a certain point more RAM just means more "cache". How more cache relates to performance depends on a lot of factors and needs to be checked for the specific use case anyway. Additionally, remember that if you run only VMs, the guest kernels typically keep a cache anyway, just like your host.

@5
Yes. With SW RAID you don't (at least not out-of-the-box) get checksumming. If you have a file system that does this, it may report the error, but it can't really do much about it.
Since ZFS integrates both into one, it can read the data & checksums on multiple disks if you have it in a redundant layout, and if one of them is correct while the other isn't, it can try to correct that on its own.
Iow.: bitrot and silent corruption can be detected and fixed while being reported to you to consider swapping drives out.
On the other hand DM-RAID and MD-RAID will tell you the raid is broken and needs a resync. You'll then need to tell it which disk to sync onto the other disk manually.
There are some verification layers being worked on you can stack onto SW RAID. The ones currently available don't really interact with the raid subsystem as far as I know, so that's a big downside.

@6
Since this comes up a lot, I'll mention it again: Both DM-RAID and MD-RAID have one big issue: If you disable cache on your VMs (which, if the storage supports it, most people do and want because the VM has a cache itself anyway), these systems actually do not guarantee that the data written to each device in a RAID1 will get the same datan (because of how `O_DIRECT` is implemented in them). If the data is modified on the fly, the write operation on each disk will be writing different data. The end result is that the next time these blocks are read your DM/MDRAID will tell you the RAID is broken, and you have to resync.

Thank you for your input, which do you think is faster? Aka #7
 
Depends on the setup. Generally, ZFS does a lot, so for a real comparison you'd need to put a file system onto the SW RAID block so that you end up with comparable features. It doesn't make much sense to compare an ext4 on SW-RAID1 to a ZFS mirror in my opinion.
Also note that you can easily add and remove a level 2 cache and log device to your ZFS, giving you a massive speed boost for all the common operations. For SW RAID you can try and stack on a dm-cache layer, but I don't know how well that works in practice and how well it works for writes (or if at all), there's a separate writecache layer available, too. I haven't actually seen these two actually in use yet ;-), also I'm not sure how many layers you want to stack up in total anyway. I suppose the stratis folks would know more about that.
 
Depends on the setup. Generally, ZFS does a lot, so for a real comparison you'd need to put a file system onto the SW RAID block so that you end up with comparable features. It doesn't make much sense to compare an ext4 on SW-RAID1 to a ZFS mirror in my opinion.
Also note that you can easily add and remove a level 2 cache and log device to your ZFS, giving you a massive speed boost for all the common operations. For SW RAID you can try and stack on a dm-cache layer, but I don't know how well that works in practice and how well it works for writes (or if at all), there's a separate writecache layer available, too. I haven't actually seen these two actually in use yet ;-), also I'm not sure how many layers you want to stack up in total anyway. I suppose the stratis folks would know more about that.
How would the cache disk in zfs work?
Is it like unRAID does? Move files faster over network until cache size is full and then to the raid itself and at 3:40am it moves all to the raid for freeing up space on the cache disk again, or something similar?
 
there is a lot of good info on web for instance http://open-zfs.org/wiki/Performance_tuning#Adaptive_Replacement_Cache


for cache and log get a data center grade ssd . 200GB or less . we use 80GB with 8GB for log. for the main spinning disks any decent ones will do.

consider 2 ssd's for cache/log. mirror the log . this is how my home system is set up, it backs up work and is used for home theatre:
Code:
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 7h17m with 0 errors on Sun Sep  9 07:41:44 2018
config:

        NAME                                                  STATE     READ WRITE CKSUM
        tank                                                  ONLINE       0     0     0
          raidz2-0                                            ONLINE       0     0     0
            scsi-35000c50058837f23                            ONLINE       0     0     0
            scsi-35000c50058706047                            ONLINE       0     0     0
            scsi-35000c500588380f3                            ONLINE       0     0     0
            scsi-35000c500588374bb                            ONLINE       0     0     0
            scsi-35000c500963ecda7                            ONLINE       0     0     0
            scsi-35000c50058836be7                            ONLINE       0     0     0
        logs
          mirror-1                                            ONLINE       0     0     0
            ata-INTEL_SSDSA2VP020G2_CVLC12730064020AGN-part1  ONLINE       0     0     0
            ata-INTEL_SSDSA2VP020G3_CVHA2124007Z020AGN-part1  ONLINE       0     0     0
        cache
          ata-INTEL_SSDSA2VP020G2_CVLC12730064020AGN-part2    ONLINE       0     0     0
          ata-INTEL_SSDSA2VP020G3_CVHA2124007Z020AGN-part2    ONLINE       0     0     0

note logs is mirrored over 2 ssd's cache i think can not be mirrored.

i choose raidz2 .

zfs was not only built well, the documentation is superb and the cli system administration is like working the controls of a star ship.

we use ceph for our cluster, and zfs at nfs and off site backup systems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!