[TUTORIAL] FabU: Can I use ZFS RaidZ for my VMs?

UdoB

Distinguished Member
Nov 1, 2016
3,549
2,254
273
Germany
Assumption: you use at least four identical devices for that. Mirrors, RaidZ, RaidZ2 are possible - theoretically.

Technically correct answer: yes, it works. But the right answers is: no, do not do that! The recommendation is very clear: use “striped mirrors”. This results in something similar to a classic Raid10.

(1) RaidZ1 (and Z2 too) gives you the IOPS of a single device, completely independent of the actual number of physical devices. For the “four devices, mirrored” approach this will double --> giving twice as many Operations per Second. For a large-file fileserver this may be not so important, but for multiple VMs running on it concurrently as high IOPS as possible are crucial!

(2) It is a waste of space because of padding blocks: Dunuin has described that problem several times, an extreme example for RaidZ3 : https://forum.proxmox.com/threads/zfs-vs-single-disk-configuration-recomendation.138161/post-616199 “A 8 disk raidz3 pool would require that you increase the block size from 8K (75% capacity loss) to 64K (43% capacity loss) or even 256K (38% capacity loss)“


There seem to be some counter arguments against “only mirrors”:

(3) Resiliency: "I will use RaidZ2 with six drives to allow two to fail. Mirrors are less secure, right?"

Yes. In a single RaidZ2-vdev any two devices may fail without data loss. In a normal mirror only one device may fail.

BUT: there are triple mirrors! These are being so rarely discussed that I need to mention them here explicitly. Let us compare that RaidZ2 with six devices:

(3a) the RaidZ2 will give us the performance of a single drive and the usable capacity of four drives. Two drives may fail.

(3b) the two vdev with triple mirrors gives us the IOPS of two drives for writing data + six fold read performance! Any two of each vdev may fail! (So up to four drive may die - but only in a specific selection.)

(4) Capacity: the only downside of (3) is that the capacity shrinks down to two drives.


Recommendation: for VM storage use a mirrored vdev approach. For important data use RaidZ2 or RaidZ3.

In any case note that “Raid” of any flavor and/or having snapshots does not count as a backup. Never!


See also:
 
Beginners often confuse hardware RAID5/6 with BBU (which can cache sync writes) with ZFS RaidZ1/2 (with unfortunate block size alignment on consumer drives) just because both can deal with one/two missing drive(s). The performance behavior is indeed completely different (as well as the supported feature set) and RaidZ, as you already explained, is mostly unsuitable for VMs.
 
Last edited:
It couldn't hurt to add that a single vdev stripe of multiple disks, whether it's a misconfiguration or a misunderstanding of the striped mirror concept, is the worst choice of all. Even worse than using a single disk because it at least doubles the failure rate.
 
  • Like
Reactions: UdoB and Johannes S
It couldn't hurt to add that a single vdev stripe of multiple disks, whether it's a misconfiguration or a misunderstanding of the striped mirror concept, is the worst choice of all. Even worse than using a single disk because it at least doubles the failure rate.
Yes, absolutely correct. For the interested reader, let me show you two basic examples:

This is the bad approach, it has zero redundancy - and if one device fails the whole pool is gone:
Code:
# zpool create dummypool /rpool/dummy/disk-a.img /rpool/dummy/disk-b.img /rpool/dummy/disk-c.img /rpool/dummy/disk-d.img 

# zpool status dummypool
  pool: dummypool
 state: ONLINE
config:

        NAME                       STATE     READ WRITE CKSUM
        dummypool                  ONLINE       0     0     0
          /rpool/dummy/disk-a.img  ONLINE       0     0     0
          /rpool/dummy/disk-b.img  ONLINE       0     0     0
          /rpool/dummy/disk-c.img  ONLINE       0     0     0
          /rpool/dummy/disk-d.img  ONLINE       0     0     0

While what we are recommending is this to use mirrors:
Code:
# zpool create dummypool  mirror /rpool/dummy/disk-a.img /rpool/dummy/disk-b.img  mirror /rpool/dummy/disk-c.img /rpool/dummy/disk-d.img 

# zpool status dummypool
  pool: dummypool
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        dummypool                    ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            /rpool/dummy/disk-a.img  ONLINE       0     0     0
            /rpool/dummy/disk-b.img  ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            /rpool/dummy/disk-c.img  ONLINE       0     0     0
            /rpool/dummy/disk-d.img  ONLINE       0     0     0

:)
 
What if you only have three disks per node and three nodes, and still want some redundancy? Should you not use RaidZ and take the IOPS hit?

With mirroring it's either redundancy for the OS or for VMs. A single disk failure can take out the node, with no possibility of live migrating any VMs.
 
With mirroring it's either redundancy for the OS or for VMs.
No, you could create a raidz1 or mirror with three disks, install proxmox and VMs onto it. That's not the best idea (it really should be split) for IOPS/VM-performance, but "some redundancy".
General advice: get more disks. In the end losing money is not that bad as losing data.
 
No, you could create a raidz1 or mirror with three disks, install proxmox and VMs onto it.
This was what I suggested. I was basically asking if there was any reason not to.


That's not the best idea (it really should be split) for IOPS/VM-performance, but "some redundancy".
Yes, I understand that there are performance consequences. My only retort is that it is better to have some performance and some redundancy than no redundancy and eventually no performance.

General advice: get more disks. In the end losing money is not that bad as losing data.
Some hardware only support a limited number of disk drives.

In this particular case I’m less concerned with loss of data than with lack of availability. Local storage is synced and there are backups.
 
This was what I suggested. I was basically asking if there was any reason not to.
If you know what you're doing this is fine. Personally I like it split, so I can anytime do a reinstall on pool1, just reimport the untouched pool2 etc. and everything gets the IOPS it needs. pool1 (slow, cheap drives) for proxmox, pool2 for VMs (fast drives), pool3 for ...something.

My only retort is that it is better to have some performance and some redundancy than no redundancy and eventually no performance.
Correct...
In this particular case I’m less concerned with loss of data than with lack of availability
...then you should create a mirror out of the three disks. Nearly triple (sequential, not IOPS!) read speed, write speed of one disk and two can fail.
 
Personally I like it split, so I can anytime do a reinstall on pool1, just reimport the untouched pool2 etc. and everything gets the IOPS it needs.
I can see the benefits of this approach.

Non-redundant OS disk. Any disk failure takes down the node hard. HA failover kicks in. All is good in the world.

Redundant VM storage. Disk failure has no immediate effect. Live migrate VMs upon receiving the alert. All is good.

...then you should create a mirror out of the three disks. Nearly triple (sequential, not IOPS!) read speed, write speed of one disk and two can fail.
Not a bad idea. Too bad you halve the available storage by half compared to RaidZ. But, as they say, no free lunches.
 
the post is a little bit too absolute when it says parity layouts give you only the IOPS of a single disk. That is roughly true for a single small RAIDZ vdev, but dRAID is not exactly the same. OpenZFS says dRAID performance is similar to RAIDZ, and gives a random-IOPS estimate of

floor((N-S)/(D+P)) * single-drive-IOPS,

N = total number of disks in the dRAID vdev
S = number of distributed spare disks
D = number of data disks per redundancy group
P = number of parity disks per redundancy group. 1 for dRAID1, 2 for dRAID2, 3 for dRAID3

so a dRAID vdev can expose more than one-disk worth of IOPS depending on its layout. Still, it remains a parity layout with RAIDZ-like behavior, not a mirror layout
- If your workload is VMs, databases, containers, or lots of random sync writes, prefer striped mirrors in most cases. That is still the safe default.
- If your workload is large-block sequential storage and your array is very large, dRAID can make sense because the fast distributed-spare rebuild model is the whole point.
- dRAID also has fixed stripe width and padding, which can hurt space efficiency for small blocks. OpenZFS and TrueNAS both warn about this, and OpenZFS notes the default minimum allocation can become quite large.

As most of the time - there is no "one fits all" - it depends on Your data and usecase.
For small private setups draid1 can be ok if You can take the performance penalty - otherwise striped mirrors are the save option as OP mentioned correctly. Again - it depends on the usecase. If Your containers sit idle most of the time it does not really matter. If You run a database server with high random IOPS, thats another story.
 
  • Like
Reactions: UdoB and Johannes S
the post is a little bit too absolute when it says parity layouts give you only the IOPS of a single disk. That is roughly true for a single small RAIDZ vdev,
It (obviously) aimed at small systems with only a few disks ;-)

so a dRAID vdev can expose more than one-disk worth of IOPS depending on its layout.
Sure. And so does a RaidZ1 based pool with 10*4 drives = 10 vdevs ;-)

As most of the time - there is no "one fits all" - it depends on Your data and usecase.
Correct! And it is great to have new possibilities.

For small private setups draid1 can be ok if You can take the performance penalty
Maybe. Probably there are some data hoarders with many drives. Note that the reference page https://openzfs.github.io/openzfs-docs/Basic Concepts/dRAID Howto.html#introduction talks about 10 to 90 drives :-)

In any case I would recommend to run a real-world test before going productive. Including producing device failures and "nearly full" situations and scrubs and backup/restore(!) and ...

:)