[TUTORIAL] FabU: Can I use ZFS RaidZ for my VMs?

UdoB

Distinguished Member
Nov 1, 2016
3,062
1,774
243
Germany
Assumption: you use at least four identical devices for that. Mirrors, RaidZ, RaidZ2 are possible - theoretically.

Technically correct answer: yes, it works. But the right answers is: no, do not do that! The recommendation is very clear: use “striped mirrors”. This results in something similar to a classic Raid10.

(1) RaidZ1 (and Z2 too) gives you the IOPS of a single device, completely independent of the actual number of physical devices. For the “four devices, mirrored” approach this will double --> giving twice as many Operations per Second. For a large-file fileserver this may be not so important, but for multiple VMs running on it concurrently as high IOPS as possible are crucial!

(2) It is a waste of space because of padding blocks: Dunuin has described that problem several times, an extreme example for RaidZ3 : https://forum.proxmox.com/threads/zfs-vs-single-disk-configuration-recomendation.138161/post-616199 “A 8 disk raidz3 pool would require that you increase the block size from 8K (75% capacity loss) to 64K (43% capacity loss) or even 256K (38% capacity loss)“


There seem to be some counter arguments against “only mirrors”:

(3) Resiliency: "I will use RaidZ2 with six drives to allow two to fail. Mirrors are less secure, right?"

Yes. In a single RaidZ2-vdev any two devices may fail without data loss. In a normal mirror only one device may fail.

BUT: there are triple mirrors! These are being so rarely discussed that I need to mention them here explicitly. Let us compare that RaidZ2 with six devices:

(3a) the RaidZ2 will give us the performance of a single drive and the usable capacity of four drives. Two drives may fail.

(3b) the two vdev with triple mirrors gives us the IOPS of two drives for writing data + six fold read performance! Any two of each vdev may fail! (So up to four drive may die - but only in a specific selection.)

(4) Capacity: the only downside of (3) is that the capacity shrinks down to two drives.


Recommendation: for VM storage use a mirrored vdev approach. For important data use RaidZ2 or RaidZ3.

In any case note that “Raid” of any flavor and/or having snapshots does not count as a backup. Never!


See also:
 
Beginners often confuse hardware RAID5/6 with BBU (which can cache sync writes) with ZFS RaidZ1/2 (with unfortunate block size alignment on consumer drives) just because both can deal with one/two missing drive(s). The performance behavior is indeed completely different (as well as the supported feature set) and RaidZ, as you already explained, is mostly unsuitable for VMs.
 
Last edited:
It couldn't hurt to add that a single vdev stripe of multiple disks, whether it's a misconfiguration or a misunderstanding of the striped mirror concept, is the worst choice of all. Even worse than using a single disk because it at least doubles the failure rate.
 
  • Like
Reactions: UdoB and Johannes S
It couldn't hurt to add that a single vdev stripe of multiple disks, whether it's a misconfiguration or a misunderstanding of the striped mirror concept, is the worst choice of all. Even worse than using a single disk because it at least doubles the failure rate.
Yes, absolutely correct. For the interested reader, let me show you two basic examples:

This is the bad approach, it has zero redundancy - and if one device fails the whole pool is gone:
Code:
# zpool create dummypool /rpool/dummy/disk-a.img /rpool/dummy/disk-b.img /rpool/dummy/disk-c.img /rpool/dummy/disk-d.img 

# zpool status dummypool
  pool: dummypool
 state: ONLINE
config:

        NAME                       STATE     READ WRITE CKSUM
        dummypool                  ONLINE       0     0     0
          /rpool/dummy/disk-a.img  ONLINE       0     0     0
          /rpool/dummy/disk-b.img  ONLINE       0     0     0
          /rpool/dummy/disk-c.img  ONLINE       0     0     0
          /rpool/dummy/disk-d.img  ONLINE       0     0     0

While what we are recommending is this to use mirrors:
Code:
# zpool create dummypool  mirror /rpool/dummy/disk-a.img /rpool/dummy/disk-b.img  mirror /rpool/dummy/disk-c.img /rpool/dummy/disk-d.img 

# zpool status dummypool
  pool: dummypool
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        dummypool                    ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            /rpool/dummy/disk-a.img  ONLINE       0     0     0
            /rpool/dummy/disk-b.img  ONLINE       0     0     0
          mirror-1                   ONLINE       0     0     0
            /rpool/dummy/disk-c.img  ONLINE       0     0     0
            /rpool/dummy/disk-d.img  ONLINE       0     0     0

:)
 
What if you only have three disks per node and three nodes, and still want some redundancy? Should you not use RaidZ and take the IOPS hit?

With mirroring it's either redundancy for the OS or for VMs. A single disk failure can take out the node, with no possibility of live migrating any VMs.
 
With mirroring it's either redundancy for the OS or for VMs.
No, you could create a raidz1 or mirror with three disks, install proxmox and VMs onto it. That's not the best idea (it really should be split) for IOPS/VM-performance, but "some redundancy".
General advice: get more disks. In the end losing money is not that bad as losing data.
 
No, you could create a raidz1 or mirror with three disks, install proxmox and VMs onto it.
This was what I suggested. I was basically asking if there was any reason not to.


That's not the best idea (it really should be split) for IOPS/VM-performance, but "some redundancy".
Yes, I understand that there are performance consequences. My only retort is that it is better to have some performance and some redundancy than no redundancy and eventually no performance.

General advice: get more disks. In the end losing money is not that bad as losing data.
Some hardware only support a limited number of disk drives.

In this particular case I’m less concerned with loss of data than with lack of availability. Local storage is synced and there are backups.
 
This was what I suggested. I was basically asking if there was any reason not to.
If you know what you're doing this is fine. Personally I like it split, so I can anytime do a reinstall on pool1, just reimport the untouched pool2 etc. and everything gets the IOPS it needs. pool1 (slow, cheap drives) for proxmox, pool2 for VMs (fast drives), pool3 for ...something.

My only retort is that it is better to have some performance and some redundancy than no redundancy and eventually no performance.
Correct...
In this particular case I’m less concerned with loss of data than with lack of availability
...then you should create a mirror out of the three disks. Nearly triple (sequential, not IOPS!) read speed, write speed of one disk and two can fail.
 
Personally I like it split, so I can anytime do a reinstall on pool1, just reimport the untouched pool2 etc. and everything gets the IOPS it needs.
I can see the benefits of this approach.

Non-redundant OS disk. Any disk failure takes down the node hard. HA failover kicks in. All is good in the world.

Redundant VM storage. Disk failure has no immediate effect. Live migrate VMs upon receiving the alert. All is good.

...then you should create a mirror out of the three disks. Nearly triple (sequential, not IOPS!) read speed, write speed of one disk and two can fail.
Not a bad idea. Too bad you halve the available storage by half compared to RaidZ. But, as they say, no free lunches.