Avoid IO Delay: RAIDz1 vs RAID5

xadox

Well-Known Member
Aug 21, 2019
56
4
48
I am currently running a RAIDz1 with 8 classic 3.5 HDDs and 128GB RAM on my Proxmox server.
I only run LXCs. These are located on a separate NVMe. ZFS datasets are integrated in various LXCs via mount points.

Depending on the data activity on the ZFS datasets, there are always quite high IO delay values.

I am now asking myself whether I should redesign my setup by creating a real RAID5 with hardware controller instead of a RAIDz1 and simply create a ZFS pool on it.
Would this give me better performance? Or can I avoid the IO delays by doing this?
 
creating a real RAID5 with hardware controller instead of a RAIDz1 and simply create a ZFS pool on it.

(Only) if I understand this sentence correctly you would create a Raid5. Then you add ZFS on top of that - effectively getting a pool with a single vdev.

Do NOT do that!

One of the remarkable features of ZFS is "self healing" and also bit-rot detection per device. Both features are lost in this approach.

If I misinterpreted that sentence: better safe than sorry ;-)

My recommendation would be a pool with four vdevs, each being a mirror. (--> similar Raid 10) This gives you four times higher IOPS. Additionally I would try to add a (mirrored) "Special Device" with NVMe or SSDs...
 
...(Only) if I understand this sentence correctly you would create a Raid5. Then you add ZFS on top of that - effectively getting a pool with a single vdev.
That was the idea.

One of the remarkable features of ZFS is "self healing" and also bit-rot detection per device. Both features are lost in this approach.
Thats why I am using this at the moment in RAIDz1.

My recommendation would be a pool with four vdevs, each being a mirror. (--> similar Raid 10) This gives you four times higher IOPS. Additionally I would try to add a (mirrored) "Special Device" with NVMe or SSDs...
But in that case I would loose four of the disks for mirroring. I would like to have the capacity of 7 disks.
 
> Depending on the data activity on the ZFS datasets, there are always quite high IO delay values.

You don't give ANY details on the disks themselves. What is the make/model/capacity? For all we know, you could be using SMR 5400rpm drives. Or lightweight shite like WD Blue spinners, those are desktop-class drives and tend to fail early.

ZFS configuration also makes a difference, are you using ashift=12, do you have compression enabled (e.g. gzip-9 is going to absolutely kill your performance), do you have dedup on, what recordsize are you using per-dataset, etc?

RAIDZ1 is mostly deprecated, any disks over ~2TB you should be using RAIDZ2. Z1 is basically only acceptable these days if you have fast SSD - and spares handy. RAIDZx is also ok for storage / media files, notsomuch for interactive response.

Did you consult any experts before setting this up, or just Leeroy Jenkins it yourself?
 
Last edited:
  • Like
Reactions: Johannes S
I may be reading something into it, but somehow your feedback seems slightly grumpy to me. But that shouldn't matter here.

These are Western Digital Red SATA III, so as you suspected, they are lightweight shit.
ashift is set to 12 for all pools. Dedup is also switched on. But actually no longer necessary. Could therefore be deactivated.

When the pools were set up about 7 years ago, I did not consult any “experts”. However, I have asked for recommendations in some forums, here as well.
At that time the config was ok. But you can ask 5 people and get 12 opinions.
Apart from that, I'm just asking "now" how to set this up ;)

Think I found a good summary article:
https://nex7.blogspot.com/2013/03/readme1st.html
 
Last edited:
WD Red HDD are slow harddisk with only 5400 rpm, I used (and still use on few servers) but its usage is very limited.
Can be ok for backup storage (if it doesn't matter how long it may take), fileserver but for virtual host for not too bad performance should be used with vm disks preallocated of few vm (and without a cow filesystem inside) with low disk usage and where performance is not important.
From the experience I had with those disks, compared for example to WD Enterprise HDDs, with the Red ones you cannot achieve the same performance even with good HW raid controller and a lot of cache.
I don't have much experience with zfs, with optimizations and a lot of cache it seems to me that it could improve a bit, but I have some doubts that it can achieve good results.
 
  • Like
Reactions: Johannes S
> These are Western Digital Red SATA III, so as you suspected, they are lightweight shit. ashift is set to 12 for all pools. Dedup is also switched on. But actually no longer necessary

The article you listed is an excellent resource.

Yep, there's your problem. You might have gotten lucky but WD started mixing in SMR drives with the non-Pro Red line a few years ago. Huge uproar. I don't trust WD for anything but light desktop SSD these days.

You're going to need to rebuild the pool, can't just switch ZFS dedup off. I can recommend Toshiba N300 drives for speed, but they are not inexpensive and you might also get good results with Seagate EXOs (they are louder tho)

If you want the best possible speed from spinners, go with SAS drives - they are full-duplex where SATA is half-duplex. SAS shelves are fairly cheap on ebay these days. This is what I went with and populated it with 14x4TB used drives in 2-disk-failure DRAID (but again, 14 spindles is not good for interactive, mine is tertiary backup "cold" storage that gets turned on once a month or less.) Shop around for pricing, but this seller included everything - all hotswap bays and cable.

https://www.ebay.com/itm/194192688167

If you have lots of small files, add a mirror zfs "special" device (SSD, and different makes/models so they don't fail around the same time) - this will help greatly with scrubs. ~23TB on my DRAID finishes in less than 6 hours

Suggested topology - start with a 6-disk raidz2, then add on another 6-disk raidz2, you will have 2 slots leftover for hotswap or SSD special mirror (they have special trays available)
 
Last edited:
Many thanks for the tips and recommendation.

From a technical point of view, this makes perfect sense and I understand it. For me, however, the price is going in the wrong direction for the private sector.

I therefore conclude that a RAID5 controller makes no sense.

My original idea was to replace my existing disks with 3 x 20 TB disks. However, due to the size of the disks, a Z1 should not be used. For a Z2, however, at least 6 disks should be used.
I don't want to afford 6 x 20TB disks. Alternatively, I could possibly use 8x Seagate Exos E - 7E10 8TB. However, the price per TB would be noticeably expensive here.

I'm currently asking myself to what extent the effort and costs add value.
Basically, my Proxmox system works mostly smoothly. Delays only occur from time to time at startup.
 
https://wintelguy.com/zfs-calc.pl

This is where you start saving up for the next build, and plan your first vdev around what you're actually using at the moment (along with some free space for snapshots, and at least 5-10% free space after that.) If you go with recertified drives from amazon/ebay it can save you lots of money. Just be sure to burn-in test them (full write DD zeros followed by SMART long test)

You can get 12TB CMR NAS-rated HD for as low as $84-$100, and you have plenty of time to look for sales. Even if you only buy one a month, in 6 months you'll have your first vdev. In the meantime you still have your existing kit :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!