Yet another "ZFS on HW-RAID" Thread (with benchmarks)

EdoFede · Mar 3, 2024

UdoB said:
Barracudas larger than 500 GB are probably 4K: https://www.seagate.com/staticfiles/docs/pdf/datasheet/disc/barracuda-ds1737-1-1111us.pdf
So ashift=9 would be a bad idea...

Oh, ok... I knew that truly 4K native spinning disks are few, but I'm probably not up to date because I haven't used spinning for a long time.

Kingneutron · Mar 3, 2024

> So personally I see two possibilities:

ZFS is so unreliable that it cannot work properly on hardware solutions (on which every other file systems works fine)
The recommendation is simply derived from an exaggerated interpretation of what is actually a very interesting feature, but not so critical in its absence

Personally, you're wrong, and you haven't done your research. Not combining hardware RAID with ZFS is well documented, and has been for well over a decade. You're running an unsupported configuration, and will not get any support when it falls over - because you deliberately avoided Best Practices. Honestly I don't even know why I bother anymore trying to educate the willfully blind and reckless, but here you go.

https://www.reddit.com/r/truenas/comments/zz3sfg/hardware_raid_zfs_shoot_me_i_deserve_it/

https://www.truenas.com/community/r...bas-and-why-cant-i-use-a-raid-controller.139/

https://www.reddit.com/r/zfs/comments/135ku0i/zfs_on_top_of_multiple_hwraid0s/

https://openzfs.github.io/openzfs-d...uning/Hardware.html#hardware-raid-controllers

> On another three servers with ZFS on a proper HW-Raid controller, I've got already 2 disk failure without any issue.
Replaced, rebuilded, not a single error.

Every few months, some subgenius drops in and thinks they know everything. No, you've actually built a house of cards on quicksand. The wind just hasn't come by yet. You're going to lose your data when it falls over and sinks - and you won't be able to migrate your disks to anything but the same make and model of RAID card. Please do not encourage others to try replicating your foolishness.

EdoFede · Mar 3, 2024

Kingneutron said:
Honestly I don't even know why I bother anymore trying to educate the willfully blind and reckless, but here you go.

Kingneutron said:
Every few months, some subgenius drops in and thinks they know everything...

Kingneutron said:
Please do not encourage others to try replicating your foolishness.

Congratulations, great behavior for a technical discussion forum!

You simply didn't even read what I wrote in the entire thread, not even the very clear bolded parts.

If you are not able to deal with a technical dialogue by discussing topics without offending, your contribution is totally useless.

emunt6 · Mar 3, 2024

So Summary, the scenarios:
1., ZFS on RAW disks
2., ZFS on HWRAID
3., ZFS+SLOG on RAW disks
4., ZFS on OpenCAS (RAW disks)

I would be interesting to see the 3., and 4., scenario benchmarks.
The idea of the OpenCAS is the same for SLOG, but more robust ( middle-ware between the raw disk and zfs, basically a "smart software cache" solution)

The difference between 1., and 4., :
[ disk ] ----> [ zfs+slog ] -----> [ filesystem ] -----> [ VM ]
[ disk ] ----> [ opencas ] ----> [ zfs ] -----> [ filesystem ] -----> [ VM ]

The OpenCAS scenario, the separate "cache" device can be a "raid" (software/hardware) disk too (example: 2x NVME in raid-1 ) - while acting a proxy between the ZFS and the RAW disks.

rj45 · Mar 5, 2024

>To be clear, just last week I installed two servers with ZFS, flashing the PERC H330 controllers in IT mode (HBA330), because of the absence >of battery backed-up cache.

No need "IT mode", just use "WriteThrough" mode if battery backed-up cache absent (not writeback).

rj45 · Mar 5, 2024

Kingneutron said:
and you won't be able to migrate your disks to anything but the same make and model of RAID card

Bullshit. Example : newer lsi raid-card is backward-compatible with older lsi raid-cards. I myself replaced old dead raid-card some years ago. After reboot it prompts to "import foreign configuration" (which is stored on hw-raid10 virtual drive) and all ok.

rj45 · Mar 5, 2024

Kingneutron said:
https://www.reddit.com/r/truenas/comments/zz3sfg/hardware_raid_zfs_shoot_me_i_deserve_it/

https://www.truenas.com/community/r...bas-and-why-cant-i-use-a-raid-controller.139/

https://www.reddit.com/r/zfs/comments/135ku0i/zfs_on_top_of_multiple_hwraid0s/

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#hardware-raid-controllers

Long read, lol: https://mangolassi.it/topic/12047/zfs-is-perfectly-safe-on-hardware-raid

Nemesiz · Mar 5, 2024

This thread started as performance comprising between native ZFS raid and ZFS on top of H.Raid and half leg is it good to run ZFS on H.Raid.

I can tell you this - How you will set up that's way it will work.

If you want H.Raid to monitor HDD, I can suggest to set ZFS checksum=off otherwise ZFS will return IO error to software on checksum mismatch because ZFS will not have how to fix it.

One of my personal setup from 'a horror story' book.

In one VM (external) I have limit in HDD space. I tried other things but I ended with creating ZFS ( in VM ) from files and files are accessed through SAMBA. Samba server is not local too . Mounted with 'hard' option.

Files in samba server -> VM ZFS import pools

This way system is working for about 3 years. I`m planning to replace it. But as for this journey I had couple times to restore from backup due checksum problems. I do snapshot every 10 minutes to minimize the gap.

As you can see stuffs can work in different ways

UdoB · Mar 5, 2024

Nemesiz said:
If you want H.Raid to monitor HDD, I can suggest to set ZFS checksum=off otherwise ZFS will return IO error to software on checksum mismatch because ZFS will not have how to fix it.

What a strange recommendation!

I want my data delivered to me to be intact. If it got damaged by any means I want at least to know this fact - and not hidden!

Of course: your mileage may vary. Perhaps there may be circumstances where this may help reading "half destroyed" data. But for nomal usage on a daily basis this sound really like a bad advise.

Just my 2 €¢...

Nemesiz · Mar 5, 2024

UdoB said:
What a strange recommendation!

I want my data delivered to me to be intact. If it got damaged by any means I want at least to know this fact - and not hidden!

Of course: your mileage may vary. Perhaps there may be circumstances where this may help reading "half destroyed" data. But for nomal usage on a daily basis this sound really like a bad advise.

Just my 2 €¢...

If you want to know then yes, keep it on. In the same situation others FS will continue to operate.

But keep in mind your VM will be broken or you movie will be half way watchable.

mfpck · Mar 30, 2025

EdoFede said:
Hi,

I would like to talk about a topic already discussed hundreds of times here on the forum, but in slightly more scientific and practical terms:
ZFS on top of Hardware RAID

Introduction
I already know all the warnings regarding this configuration, but since in most cases the references mentioned are experiments on small home-lab, issues on cheap hardware and so on, I would like to cover this topic in the "enterprise servers" context.

I would start from an assumption: enterprise-class RAID controllers, as well as enterprise storage solutions with LUNs, have been present for decades and keep the majority of IT systems running without evidence of continuous catastrophes or problems, with any kind FS on top.

So in this thread, on the "Hardware-RAID" side, I'm talking about professional RAID cards with battery backed-up caching, redundant arrays with T10 data protection and so on.... not raid on consumer motherboard chipsets or similar cheap solutions.

Using ZFS directly on raw disks has its advantages of course (but also disadvantages, such as the impossibility of expansion by single disks additions).

If the performances were equivalent, I think no one would have any doubts about the choice: letting ZFS directly manage the disks is certainly better!
But the reality is that ZFS disk management VS Hw-RAID seems to have a huge impact on performance, especially under database-type load profiles.

Testing configuration
Let me start by saying that I have been using ZFS since Solaris 10 and, based on what I have always read, I have always thought that it was undoubtedly better to let ZFS manage the disks directly.
However, after some rather shocking tests (very low performance) on a fully managed ZFS pool, very intrigued by this topic, I investigated futhermore and went down the rabbit hole, doing many kind of testing on many configurations.

I've done many test on two identical machine. I would like to show you the most relevant ones.

Common hardware configuration

Dell Poweredge R640 with 8x2,5" SATA/SAS backplane

2x Intel(R) Xeon(R) Gold 5120

128GB of RAM

Dell PERC H730 Mini with 1GB of cache and Battery PLP

2x Crucial BX500 as Boot drives

4x Kingston DC600M (3840GB) as Data drives

First server setup

PERC in HBA mode, bypassing all the drives to the OS

One ZFS 2 drive mirror for boot/OS

One ZFS 4 drive stripped mirrors (RAID 10 equivalent) for data

Second server setup

PERC in RAID mode, Caching Writeback and No read-ahead

One Virtual disk with 2 drive in RAID 1 for boot/OS

One Virtual disk with 4 drive in RAID 5 for data

Testing environment setup
I've installed PVE on both nodes, configured everything (network, etc...), and created the ZFS data pools on both nodes with the same configs (comp LZ4, ashift 12).

Then, in order to test a typical workload from our existing infrastructure, I've created a Windows 2022 virtual machine on both nodes with this config:

32 vCPUs (2 sockets, 16 core, NUMA enabled)

20GB of RAM (balooning enabled, but not triggered since the hosts never reaches the threshold)

One 80GB vDisk for OS

One 35GB vDisk for DB Data/Log files (Formatted with 64K allocation size, according to SQL Server guidelines)

VirtIO SCSI single for disks

Caching set to "Default (No cache)", Discard enabled

VirtIO full package drivers installed on guest

On both hosts, I've limited the ZFS arc to 20 GB (using zfs_arc_max parameter)
No other ZFS options nor optimization are used on both setups

Testing metodology
I've used CrystalDiskMark for a rapid test with these parameters:

Duration: 20 sec

Interval: 10 sec

Number of tests: 1

File size: 1 GiB

SQL Server 2022 Developer edition and HammerDB was used for DB workload benchmarking,
with this configuration-metodology:

Created an empty DB with 8x datafiles (20 GB total) and 1x logfile (5 GB)

Populated the DB with HammerDB, using 160 Warehouses

Backed-up the DB (useful to do multiple test starting with the same condition)

All tests are done after a PVE host restart and 2 minutes wait after the VM startup and no other VMs/applications/backups are running on the nodes during tests

HammerDB testing parameters:

No encryption (direct connection on local DB)

Windows authentication

Use all warehouses: ON

Checkpoint when complete: ON

Virtual users: 100

User delay: 20ms

RESULTS

CrystalDiskMark rapid test

First server (ZFS managed disks)
View attachment 60518

Second server (ZFS on HW Raid)
View attachment 60519

No major difference here, except the sequential write.

HammerDB - Orders per minutes
First server (ZFS managed disks): 28700
Second server (ZFS on HW Raid): 117000

(4 times faster on HW Raid)

HammerDB - Transaction count graphs

First server (ZFS managed disks)

View attachment 60520

Second server (ZFS on HW Raid)

View attachment 60521

Conclusions
It seems that, on performance side and with database-type workload, having a Raid card with (battery-protected) caching gives a huge advantage.
ZFS on Hw Raid measured 4 times the performance on DB workload VS ZFS on raw disks!
Also remembering that we are comparing an hardware RAID 5 versus a RAID 10 on ZFS, clearly against the hardware raid with a database type workload...but despite this, we obtained enormously superior performance.

Considering what was said in the introduction and in front of these results, I sincerely think that the use of an Hardware RAID (again, on enterprise-grade platforms with battery-backed write caching) can bring great advantages, in addition to the beautiful features that ZFS has (snapshots, ARC, Compression, etc...).

I also think that passing through a controller-managed drives, can give also advantages in terms of write amplification on SSDs (not tested yet, just speculation)

Still considering what said in the introduction about the use of enterprise-grade hardware, are there actually such disadvantages as to give up this performance boost of ZFS on HW Raid over RAW disks?

I hope to get some opinions from you too.

In case of opinions against this setup (I imagine they will be mostly on data resilience), I would kindly ask you to bring sources and real world examples on enterprise-hardware properly configured systems.

(Please also note that, as already said, I was totally in favour of ZFS on RAW disks by years... until these tests, so it is absolutely not a provocative post)

Bye,
Edoardo

Hi,

Sitting in the same boat here ;-)

Did you also examined ya tests with dells fastpath feature ?
https://www.dell.com/support/manual...7396f8-b45d-4c95-86fd-ff00a09ede61&lang=en-us

EdoFede · Apr 2, 2025

mfpck said:
Hi,

Sitting in the same boat here ;-)

Did you also examined ya tests with dells fastpath feature ?
https://www.dell.com/support/manual...7396f8-b45d-4c95-86fd-ff00a09ede61&lang=en-us

Hi!

Yes, we always use No Read Ahead on the controller, only Write-Through caching with battery backed-up controllers.

I’d like to take this opportunity to point out that we’ve had 3 servers in production with this configuration for the past 16 months, and not a single issue has emerged.
This applies both to the hosts (zfs status, etc.) and to all the running VMs (around 80 VMs from 40 different clients).

If there really were a risk, as many claim, we would certainly have encountered issues over these months.
Therefore, I can confirm that the configuration is absolutely stable.

On the physical server dedicated to backups with PBS, since we don’t have an enterprise-grade controller available (as we do on the main servers), we chose instead a Raidz2 setup with disks fully managed by ZFS.

Bye!

sofa · Thursday at 05:06

Thankyou for this.. I am moving to proxmox from vmware. I have always used raid 6 on my h720pmini for my r730xd and h710p mini on my r720.. For people who complain about bit rot.. that is why there is a thing call partol read. I am thinking of doing my 8 disk raid 6 with single disk zfs.. sounds like this may be fine.. Plan is No Read Ahead, write-trouhg raid 6.. I saw mention of disabling checksum on zfs.. This would be interesting to look into.. I have been using enterprise HW always

LnxBil · Thursday at 13:06

sofa said:
For people who complain about bit rot.. that is why there is a thing call partol read.

Yes, you're (often) able to detect it, but you are not always able to heal it. I've had raid punch holes in all RAID controller variaties, yet never with ZFS.

sofa · Thursday at 15:30

LnxBil said:
Yes, you're (often) able to detect it, but you are not always able to heal it. I've had raid punch holes in all RAID controller variaties, yet never with ZFS.

It has autocorrect features.. I don't know I have been running raid in enterprise servers since 2005 and never have had bit rot.. Just have not been using xfs. Are we saying that xfs on top of a single raid vd will corrupt the data. If so then it is an xfs limitation or issue here.. A file system should never corrupt data. I have had a disks fail plenty of times.. but rebuilds fine I have never had bit root.. I am using enterprise controllers it would be different if I was using an ebay raid controller.

alexskysilk · Thursday at 17:54

EdoFede said:
If there really were a risk, as many claim, we would certainly have encountered issues over these months.

The term, risk, doesnt mean what you think it means. there's a RISK of a meteor falling on you. the fact it didnt happen doesnt lower that risk.

EdoFede said:
Therefore, I can confirm that the configuration is absolutely stable.

Same goes for the term absolutely. see Dunning-Krueger effect.

The difference between a RAID controller and a filesystem integrated RAID (eg, ZFS) is that ZFS is aware of fault at the FILE SYSTEM level. RAID does not. RAID can (and does) lead to silent corruption that you as the operator would not be aware of until you access the resource(s) impacted. Typical use patterns follow the 80/20 rule (80% of requests are for only 20% of the data) this may take years. or never.

The RISK of this happening is relatively low (think 1 in 10^14 iops) which means you may not ever see it happen. but it can and does.

sofa · Thursday at 18:03

alexskysilk said:
The term, risk, doesnt mean what you think it means. there's a RISK of a meteor falling on you. the fact it didnt happen doesnt lower that risk.

Same goes for the term absolutely. see Dunning-Krueger effect.

The difference between a RAID controller and a filesystem integrated RAID (eg, ZFS) is that ZFS is aware of fault at the FILE SYSTEM level. RAID does not. RAID can (and does) lead to silent corruption that you as the operator would not be aware of until you access the resource(s) impacted. Typical use patterns follow the 80/20 rule (80% of requests are for only 20% of the data) this may take years. or never.

The RISK of this happening is relatively low (think 1 in 10^14 iops) which means you may not ever see it happen. but it can and does.

Depends on the raid controller. we are talking about enterprise battery backed up raid controllers that does patrol read cycles to check every block on the devices for issues. When it sees an issue it fixes the issue.. We are not talking about a controller off amazon that is in a home built system. Sounds like zf's scrubs doesn't it? It does same exatc thing.. I would think having HW raid doing theses checks and xfs checking fs would be a good combination not a bad. I am talking using single vd device in the zfs not use its SW raid functionality. Plan is to leverage compression dedupe, and caching. Will keep HW raid with HW controller. I talked to a Linux subject matter expert at work who has been doing zfs for 20 years and he was like how would putting a zfs fs on a vd be an issue?

IsThisThingOn · Thursday at 18:25

I would not say it is dangerous per se, I only think it adds an unnecessary point of failue. Simply because it is added hardware. Sure, the same thing is true for an HBA card, but not to the same extent, since it is less complex.

But my personal money quote from the linked blog post on how HW RAID is not dangerous with ZFS is this:

While some hardware RAID cards may have a "pass-through" or "JBOD" mode that simply presents each disk to ZFS, the combination of the potential masking of S.M.A.R.T. information, high controller cost, and anecdotal evidence that any RAID mode is about 5% slower than non-RAID "target" mode results in zero reasons for using a hardware RAID card with ZFS.

Why spend money on BBU RAID, when for less money you can get a PLP SLOG?

alexskysilk · Thursday at 18:34

sofa said:
epends on the raid controller. we are talking about enterprise battery backed up raid controllers that does patrol read cycles to check every block on the devices for issues. When it sees an issue it fixes the issue.. We are not talking about a controller off amazon that is in a home built system. Sounds like zf's scrubs doesn't it? It does same exatc thing

I forgot about this thread and that your question (and almost everything else I just said) was covered earlier in the thread

IsThisThingOn said:
I would not say it is dangerous per se, I only think it adds an unnecessary point of failue.

THIS.

sofa · Thursday at 18:47

IsThisThingOn said:
I would not say it is dangerous per se, I only think it adds an unnecessary point of failue. Simply because it is added hardware. Sure, the same thing is true for an HBA card, but not to the same extent, since it is less complex.

But my personal money quote from the linked blog post on how HW RAID is not dangerous with ZFS is this:

While some hardware RAID cards may have a "pass-through" or "JBOD" mode that simply presents each disk to ZFS, the combination of the potential masking of S.M.A.R.T. information, high controller cost, and anecdotal evidence that any RAID mode is about 5% slower than non-RAID "target" mode results in zero reasons for using a hardware RAID card with ZFS.

Why spend money on BBU RAID, when for less money you can get a PLP SLOG?

Well key is the fact that i already have production h710p and h720p cards.. I am hesitant flashing to IT mode.. I plan to use a single vd with a single zfs vdev.. Disabiling No read ahead disables the controllers ability to anticipate future reads and cache them.. Wright Throuhg means the data is directly writen to the storage device and acknowledged to the host system only after write is completed without using the controllers cache for buffering writes. My plan is to use a enterprise Intel Optane Data center P3700 PCI nvme for slog when im done with my testing in about a month.. I want to leverage arc caching. on my esxi server i had this intel P3700 with a slog vmdk on a nested trunas vm for testing. I gave it like 128gigs of memory for arc cache and it worked very well on my 10gig network to my windows pc with 10gig networking also between my servers with a 10gig dac direct connect..

You can see on the raid controller it does checks and patrol reads to look for block corruption and mismatches kinda how scrubs work on zfs..

2025-07-03 12_42_11-Dell EMC PowerEdge RAID Controller 10 User’s Guide PERC H345, H740P, H745,...png

I would think this would actually be not a problem.. Granted the community may not have enterprise hardware like the original poster had. Another concern is if I have patrol read going on would it be best to turn off scrubbing in zfs since I am not using the built in raid and since I already have that functionality as noted above in my perc. I have had probably 100 power outages over the years, I have never had any controller issues even with hard shutdowns. Granted this was on vmware datastores, and ext3 and xfs on Linux. The intel card is a solid state card also so it would not lose data as with slog.. Also I read even if you lose a slog it does not damage the pool. I of corse am setting up my ups with NUT to do graceful shutdown of proxmox during an outage seems it had tools that will work better with my ups than my scripted way in esxi.

Example of patrol read running on my perc controller from tty log it cycles trough every block like zfs does with its checks.

06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=00
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=01
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=02
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=03
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=04
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=05
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=06
06/28/25 3:00:00: prDiskStart: starting Patrol Read on PD=07
06/28/25 3:00:00: EVT#18129-06/28/25 3:00:00: 39=Patrol Read started
ÿ06/28/25 3:57:51: prCallback: PR completed for pd=05
06/28/25 4:10:50: prCallback: PR completed for pd=01
ÿ06/28/25 4:11:59: prCallback: PR completed for pd=06
06/28/25 4:13:02: prCallback: PR completed for pd=07
ÿ06/28/25 4:20:19: prCallback: PR completed for pd=03
06/28/25 4:22:08: prCallback: PR completed for pd=04
ÿ06/28/25 4:32:36: prCallback: PR completed for pd=02
06/28/25 6:40:02: prCallback: PR completed for pd=00
06/28/25 6:40:02: PR cycle complete
06/28/25 6:40:02: EVT#18130-06/28/25 6:40:02: 35=Patrol Read complete
06/28/25 6:40:02: Next PR scheduled to start at 07/05/25 3:00:00

Yet another "ZFS on HW-RAID" Thread (with benchmarks)

Member

Renowned Member

Member

Active Member

Active Member

Active Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

New Member

Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Well-Known Member

Distinguished Member

New Member

We value your privacy