Yet another "ZFS on HW-RAID" Thread (with benchmarks)

EdoFede

Member
Nov 10, 2023
44
13
8
Hi,

I would like to talk about a topic already discussed hundreds of times here on the forum, but in slightly more scientific and practical terms:
ZFS on top of Hardware RAID

Introduction
I already know all the warnings regarding this configuration, but since in most cases the references mentioned are experiments on small home-lab, issues on cheap hardware and so on, I would like to cover this topic in the "enterprise servers" context.

I would start from an assumption: enterprise-class RAID controllers, as well as enterprise storage solutions with LUNs, have been present for decades and keep the majority of IT systems running without evidence of continuous catastrophes or problems, with any kind FS on top.

So in this thread, on the "Hardware-RAID" side, I'm talking about professional RAID cards with battery backed-up caching, redundant arrays with T10 data protection and so on.... not raid on consumer motherboard chipsets or similar cheap solutions.

Using ZFS directly on raw disks has its advantages of course (but also disadvantages, such as the impossibility of expansion by single disks additions).

If the performances were equivalent, I think no one would have any doubts about the choice: letting ZFS directly manage the disks is certainly better!
But the reality is that ZFS disk management VS Hw-RAID seems to have a huge impact on performance, especially under database-type load profiles.

Testing configuration
Let me start by saying that I have been using ZFS since Solaris 10 and, based on what I have always read, I have always thought that it was undoubtedly better to let ZFS manage the disks directly.
However, after some rather shocking tests (very low performance) on a fully managed ZFS pool, very intrigued by this topic, I investigated futhermore and went down the rabbit hole, doing many kind of testing on many configurations.

I've done many test on two identical machine. I would like to show you the most relevant ones.

Common hardware configuration
  • Dell Poweredge R640 with 8x2,5" SATA/SAS backplane
  • 2x Intel(R) Xeon(R) Gold 5120
  • 128GB of RAM
  • Dell PERC H730 Mini with 1GB of cache and Battery PLP
  • 2x Crucial BX500 as Boot drives
  • 4x Kingston DC600M (3840GB) as Data drives
First server setup
  • PERC in HBA mode, bypassing all the drives to the OS
  • One ZFS 2 drive mirror for boot/OS
  • One ZFS 4 drive stripped mirrors (RAID 10 equivalent) for data
Second server setup
  • PERC in RAID mode, Caching Writeback and No read-ahead
  • One Virtual disk with 2 drive in RAID 1 for boot/OS
  • One Virtual disk with 4 drive in RAID 5 for data

Testing environment setup
I've installed PVE on both nodes, configured everything (network, etc...), and created the ZFS data pools on both nodes with the same configs (comp LZ4, ashift 12).

Then, in order to test a typical workload from our existing infrastructure, I've created a Windows 2022 virtual machine on both nodes with this config:
  • 32 vCPUs (2 sockets, 16 core, NUMA enabled)
  • 20GB of RAM (balooning enabled, but not triggered since the hosts never reaches the threshold)
  • One 80GB vDisk for OS
  • One 35GB vDisk for DB Data/Log files (Formatted with 64K allocation size, according to SQL Server guidelines)
  • VirtIO SCSI single for disks
  • Caching set to "Default (No cache)", Discard enabled
  • VirtIO full package drivers installed on guest
On both hosts, I've limited the ZFS arc to 20 GB (using zfs_arc_max parameter)
No other ZFS options nor optimization are used on both setups

Testing metodology
I've used CrystalDiskMark for a rapid test with these parameters:
  • Duration: 20 sec
  • Interval: 10 sec
  • Number of tests: 1
  • File size: 1 GiB

SQL Server 2022 Developer edition and HammerDB was used for DB workload benchmarking,
with this configuration-metodology:
  • Created an empty DB with 8x datafiles (20 GB total) and 1x logfile (5 GB)
  • Populated the DB with HammerDB, using 160 Warehouses
  • Backed-up the DB (useful to do multiple test starting with the same condition)
  • All tests are done after a PVE host restart and 2 minutes wait after the VM startup and no other VMs/applications/backups are running on the nodes during tests
HammerDB testing parameters:
  • No encryption (direct connection on local DB)
  • Windows authentication
  • Use all warehouses: ON
  • Checkpoint when complete: ON
  • Virtual users: 100
  • User delay: 20ms

RESULTS

CrystalDiskMark rapid test


First server (ZFS managed disks)
CrystalDiskMark - ZFS managed disks.png

Second server (ZFS on HW Raid)
CrystalDiskMark - ZFS on HW Raid.png

No major difference here, except the sequential write.


HammerDB - Orders per minutes
First server (ZFS managed disks): 28700
Second server (ZFS on HW Raid): 117000

(4 times faster on HW Raid)


HammerDB - Transaction count graphs

First server (ZFS managed disks)

HammerDB TPM - ZFS managed disks.jpeg

Second server (ZFS on HW Raid)

HammerDB TPM - ZFS on HW Raid.jpeg


Conclusions
It seems that, on performance side and with database-type workload, having a Raid card with (battery-protected) caching gives a huge advantage.
ZFS on Hw Raid measured 4 times the performance on DB workload VS ZFS on raw disks!
Also remembering that we are comparing an hardware RAID 5 versus a RAID 10 on ZFS, clearly against the hardware raid with a database type workload...but despite this, we obtained enormously superior performance.

Considering what was said in the introduction and in front of these results, I sincerely think that the use of an Hardware RAID (again, on enterprise-grade platforms with battery-backed write caching) can bring great advantages, in addition to the beautiful features that ZFS has (snapshots, ARC, Compression, etc...).

I also think that passing through a controller-managed drives, can give also advantages in terms of write amplification on SSDs (not tested yet, just speculation)

Still considering what said in the introduction about the use of enterprise-grade hardware, are there actually such disadvantages as to give up this performance boost of ZFS on HW Raid over RAW disks?

I hope to get some opinions from you too.

In case of opinions against this setup (I imagine they will be mostly on data resilience), I would kindly ask you to bring sources and real world examples on enterprise-hardware properly configured systems.

(Please also note that, as already said, I was totally in favour of ZFS on RAW disks by years... until these tests, so it is absolutely not a provocative post)

Bye,
Edoardo
 
Last edited:
I already know all the warnings regarding this configuration, but since in most cases the references mentioned are experiments on small home-lab, issues on cheap hardware and so on, I would like to cover this topic in the "enterprise servers" context.
The ZFS documentation got a nice summary with points why not to use ZFS on top of HW raid:
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#hardware-raid-controllers

So its more a decision of reliability vs performance.

But thanks for the detailed benchmarks.
 
  • Like
Reactions: IsThisThingOn
Hi Dunuin,

thanks for your response.

This is one of the dozen of documentation pages I've read during evaluation of the config to be used in our systems.
(it is precisely one of the documents that made me think about this topic)

While I understand that in the ZFS developer's point of view, the best solution is to bring raw drives to ZFS (as it was designed, therefore without surprise), I find most of the points questionable.

Allow me to quote and comment on.

Hardware RAID will limit opportunities for ZFS to perform self healing on checksum failures.
Of course true, from ZFS point of view (which I find great for handling data errors.).
But checksums and self-healing isn't also a feature of most HW Raid controller?


RAID controller failures can require that the controller be replaced with the same model, or in less extreme cases, a model from the same manufacturer. Using ZFS by itself allows any controller to be used.
Not always correct, but even if...not a problem on enterprise servers, since a replacement is always made with the same model.
It's very difficult to have critical servers in production whose hardware components are so old as to be no more available.


If a hardware RAID controller’s write cache is used, an additional failure point is introduced that can only be partially mitigated by additional complexity from adding flash to save data in power loss events.

...
Not a real problem.
In this point it seems that the authors ignore the existence of battery-backed ECC cache in enterprise-grade controllers (which has been the standard for many years).


Behavior during RAID reconstruction when silent corruption damages data is undefined.

...
Not enough experience to write something about (never happened to me yet), but I think it should be better detailed: what problems were found and on which grade of controllers?
Since array reconstructions is a common activity by decades on hardware controllers, is it really such a big problem? (And why would it only be a big problem on ZFS?)


IO response times will be reduced whenever the OS blocks on IO operations because the system CPU blocks on a much weaker embedded CPU used in the RAID controller. This lowers IOPS relative to what ZFS could have achieved
This is not true, as clearly highlighted by our tests.


The controller’s firmware is an additional layer of complexity that cannot be inspected by arbitrary third parties. The ZFS source code is open source and can be inspected by anyone.
Well, if we think in these terms, we must consider that disks also have a "black box" firmware, in particular SSDs which have rather complex data writing and cache management. So, what's the point?


If multiple RAID arrays are formed by the same controller and one fails, the identifiers provided by the arrays exposed to the OS might become inconsistent.
Don't know if other controllers works differently (I don't think so...), but in my case this is not true:
The controller expose two disk identified by a unique ID by the OS, exactly as with raw disk access.

Bash:
root@pve1:~# zpool list -v
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zdt                                       10.5T   149G  10.3T        -         -     0%     1%  1.00x    ONLINE  -
  scsi-361866da08b4d34002d199fdb6dee2ef8  10.5T   149G  10.3T        -         -     0%  1.39%      -    ONLINE

Bash:
root@pve1:~# ls -l /dev/disk/by-id/
...
lrwxrwxrwx 1 root root  9 Dec 30 09:40 scsi-361866da08b4d34002d199fdb6dee2ef8 -> ../../sdb
lrwxrwxrwx 1 root root 10 Dec 30 09:40 scsi-361866da08b4d34002d199fdb6dee2ef8-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Dec 30 09:40 scsi-361866da08b4d34002d199fdb6dee2ef8-part9 -> ../../sdb9


I really appreciate ZFS, all its wonderfull functions and the security of the data that it is possible to get, even with cheap systems... but faced with this performance test and the information read, I wonder if it is really so "dangerous" to run ZFS on top of a proper hardware RAID (again, on enterprise-grade hardware, not cheap ones!).

Bye,
Edoardo
 
  • Like
Reactions: HellNope
But checksums and self-healing isn't also a feature of most HW Raid controller?
at the block level, not filesystem. you can have file system corruption with the raid controller being non the wiser. its rare but possible.

Thanks for the benchmarks, but they're too shallow to be instructive. a 4 time increase in a benchmark on the same disks just means your test is small enough to fit in cache. what you may want to do is use fio which allows you to control all variables to simulate actual use patterns.
 
Thanks for the benchmarks, but they're too shallow to be instructive. a 4 time increase in a benchmark on the same disks just means your test is small enough to fit in cache. what you may want to do is use fio which allows you to control all variables to simulate actual use patterns.

Sorry, but I don't see the point.
I think you either didn't read well the test mode or don't know HammerDB.

I agree with you if you are referring to the test with CrystalDiskMark (which in fact I only included as a quick test for an "on the fly" comparison),
but using HammerDB we've benchmarked both systems on a real database with a medium-sized DB (20 GB) and with "real-life" workload.

This test reflects the performance expected in production from a system with a real DB load, it's not a "quick" benchmark, that read/writes some MB of data.
It generates a typical OLTP workload, so I think there isn't anything more close to a real-world workload (and, in fact, HammerDB is widely used, even by hardware manufacturers, to test and compare complete systems with real workload). The test is absolutely reproducible (I've done it 2-3 times), with a slight difference on the final values.

There is no difference in terms of caching between the two, because the only read cache is ZFS ARC (which was identical on both systems, as written).
The cache on the HW Raid controlled is used only for writes, not for reads (and I think the biggest difference is exactly here, as previously written)

Bye,
Edoardo
 

Done a rapid test on both hosts, as requested.

Created a dataset with compression=off and caching only for metadata.
Bash:
zfs create \
    -o compression=off \
    -o primarycache=metadata \
    -o secondarycache=metadata \
    -o sync=standard \
    zdt/test

4K Sequential READ tests
Command: fio --ioengine=libaio --direct=1 --sync=1 --rw=read --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name seq_read --filename=/zdt/test/testfile

ZFS on RAW Disks (RAID 10)
Code:
seq_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=673MiB/s][r=172k IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=86739: Wed Jan  3 22:11:31 2024
  read: IOPS=175k, BW=684MiB/s (718MB/s)(40.1GiB/60001msec)
    slat (usec): min=3, max=931, avg= 4.43, stdev= 3.35
    clat (nsec): min=870, max=146316, avg=1030.15, stdev=623.83
     lat (usec): min=3, max=938, avg= 5.46, stdev= 3.43
    clat percentiles (nsec):
     |  1.00th=[  900],  5.00th=[  940], 10.00th=[  948], 20.00th=[  964],
     | 30.00th=[  972], 40.00th=[  980], 50.00th=[  980], 60.00th=[  988],
     | 70.00th=[ 1004], 80.00th=[ 1012], 90.00th=[ 1032], 95.00th=[ 1048],
     | 99.00th=[ 1160], 99.50th=[ 4512], 99.90th=[11968], 99.95th=[16768],
     | 99.99th=[17280]
   bw (  KiB/s): min=660968, max=709352, per=100.00%, avg=701359.46, stdev=5215.44, samples=119
   iops        : min=165242, max=177338, avg=175339.83, stdev=1303.90, samples=119
  lat (nsec)   : 1000=68.78%
  lat (usec)   : 2=30.49%, 4=0.07%, 10=0.50%, 20=0.15%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=23.77%, sys=76.14%, ctx=256, majf=0, minf=19
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10510846,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=684MiB/s (718MB/s), 684MiB/s-684MiB/s (718MB/s-718MB/s), io=40.1GiB (43.1GB), run=60001-60001msec


ZFS on HW RAID (RAID 5 - PERC H730 Mini)
Code:
seq_read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=672MiB/s][r=172k IOPS][eta 00m:00s]
seq_read: (groupid=0, jobs=1): err= 0: pid=545456: Wed Jan  3 22:11:31 2024
  read: IOPS=177k, BW=692MiB/s (726MB/s)(40.6GiB/60001msec)
    slat (usec): min=3, max=1406, avg= 4.38, stdev= 2.68
    clat (nsec): min=876, max=996559, avg=1026.13, stdev=599.99
     lat (usec): min=4, max=1407, avg= 5.41, stdev= 2.76
    clat percentiles (nsec):
     |  1.00th=[  940],  5.00th=[  956], 10.00th=[  972], 20.00th=[  980],
     | 30.00th=[  980], 40.00th=[  988], 50.00th=[  988], 60.00th=[  996],
     | 70.00th=[ 1012], 80.00th=[ 1032], 90.00th=[ 1064], 95.00th=[ 1080],
     | 99.00th=[ 1160], 99.50th=[ 1192], 99.90th=[11840], 99.95th=[16768],
     | 99.99th=[17024]
   bw (  KiB/s): min=666376, max=727064, per=100.00%, avg=709074.22, stdev=7123.43, samples=119
   iops        : min=166594, max=181766, avg=177268.67, stdev=1780.85, samples=119
  lat (nsec)   : 1000=63.67%
  lat (usec)   : 2=36.07%, 4=0.01%, 10=0.13%, 20=0.12%, 50=0.01%
  lat (usec)   : 100=0.01%, 1000=0.01%
  cpu          : usr=24.48%, sys=75.45%, ctx=221, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10630522,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1


Run status group 0 (all jobs):
   READ: bw=692MiB/s (726MB/s), 692MiB/s-692MiB/s (726MB/s-726MB/s), io=40.6GiB (43.5GB), run=60001-60001msec

Very similar results (just to confirm that the controller read cache is not in use).


4K Sequential WRITE tests (Sync)
Now we see big differences

ZFS on RAW Disks (RAID 10)
Code:
seq_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
seq_write: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=10.8MiB/s][w=2759 IOPS][eta 00m:00s]
seq_write: (groupid=0, jobs=1): err= 0: pid=92657: Wed Jan  3 22:18:18 2024
  write: IOPS=2769, BW=10.8MiB/s (11.3MB/s)(649MiB/60001msec); 0 zone resets
    slat (usec): min=292, max=28075, avg=355.74, stdev=230.37
    clat (nsec): min=1704, max=145732, avg=3798.42, stdev=803.32
     lat (usec): min=293, max=28084, avg=359.53, stdev=230.46
    clat percentiles (nsec):
     |  1.00th=[ 2256],  5.00th=[ 2832], 10.00th=[ 3504], 20.00th=[ 3632],
     | 30.00th=[ 3664], 40.00th=[ 3664], 50.00th=[ 3696], 60.00th=[ 3728],
     | 70.00th=[ 3952], 80.00th=[ 4256], 90.00th=[ 4320], 95.00th=[ 4384],
     | 99.00th=[ 4640], 99.50th=[ 4832], 99.90th=[15424], 99.95th=[17536],
     | 99.99th=[25216]
   bw (  KiB/s): min= 9616, max=12144, per=100.00%, avg=11085.01, stdev=502.97, samples=119
   iops        : min= 2404, max= 3036, avg=2771.25, stdev=125.74, samples=119
  lat (usec)   : 2=0.11%, 4=70.78%, 10=28.97%, 20=0.11%, 50=0.03%
  lat (usec)   : 250=0.01%
  cpu          : usr=1.81%, sys=18.55%, ctx=332351, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,166167,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=649MiB (681MB), run=60001-60001msec


ZFS on HW RAID (RAID 5 - PERC H730 Mini)
Code:
seq_write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
seq_write: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=34.5MiB/s][w=8841 IOPS][eta 00m:00s]
seq_write: (groupid=0, jobs=1): err= 0: pid=552388: Wed Jan  3 22:18:15 2024
  write: IOPS=7286, BW=28.5MiB/s (29.8MB/s)(1708MiB/60001msec); 0 zone resets
    slat (usec): min=72, max=59266, avg=133.48, stdev=257.26
    clat (nsec): min=1161, max=50400, avg=2637.99, stdev=876.00
     lat (usec): min=74, max=59275, avg=136.11, stdev=257.38
    clat percentiles (nsec):
     |  1.00th=[ 1272],  5.00th=[ 1528], 10.00th=[ 1768], 20.00th=[ 2024],
     | 30.00th=[ 2160], 40.00th=[ 2288], 50.00th=[ 2416], 60.00th=[ 2704],
     | 70.00th=[ 3056], 80.00th=[ 3536], 90.00th=[ 3632], 95.00th=[ 3696],
     | 99.00th=[ 4048], 99.50th=[ 4256], 99.90th=[13632], 99.95th=[15808],
     | 99.99th=[18816]
   bw (  KiB/s): min=15248, max=37568, per=99.99%, avg=29143.39, stdev=5791.67, samples=119
   iops        : min= 3812, max= 9392, avg=7285.87, stdev=1447.88, samples=119
  lat (usec)   : 2=18.32%, 4=80.63%, 10=0.92%, 20=0.12%, 50=0.01%
  lat (usec)   : 100=0.01%
  cpu          : usr=3.50%, sys=30.79%, ctx=500149, majf=0, minf=17
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,437207,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=28.5MiB/s (29.8MB/s), 28.5MiB/s-28.5MiB/s (29.8MB/s-29.8MB/s), io=1708MiB (1791MB), run=60001-60001msec


4K Random WRITE tests (Sync)
Let's do a quick test also on Random sync writes.

ZFS on RAW Disks (RAID 10)
Code:
rnd_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=5420KiB/s][w=1355 IOPS][eta 00m:00s]
seq_write: (groupid=0, jobs=1): err= 0: pid=109438: Wed Jan  3 22:38:11 2024
  write: IOPS=1395, BW=5581KiB/s (5715kB/s)(327MiB/60001msec); 0 zone resets
    slat (usec): min=299, max=16258, avg=709.70, stdev=554.98
    clat (nsec): min=1617, max=191982, avg=4339.36, stdev=1553.80
     lat (usec): min=300, max=16267, avg=714.03, stdev=555.85
    clat percentiles (nsec):
     |  1.00th=[ 3056],  5.00th=[ 3568], 10.00th=[ 3632], 20.00th=[ 3696],
     | 30.00th=[ 3728], 40.00th=[ 3760], 50.00th=[ 4016], 60.00th=[ 4128],
     | 70.00th=[ 4256], 80.00th=[ 4448], 90.00th=[ 5152], 95.00th=[ 7904],
     | 99.00th=[ 8256], 99.50th=[ 8896], 99.90th=[17280], 99.95th=[21376],
     | 99.99th=[43264]
   bw (  KiB/s): min= 1472, max= 8680, per=100.00%, avg=5585.14, stdev=1753.12, samples=119
   iops        : min=  368, max= 2170, avg=1396.29, stdev=438.28, samples=119
  lat (usec)   : 2=0.02%, 4=49.61%, 10=50.19%, 20=0.12%, 50=0.05%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=1.19%, sys=16.96%, ctx=197413, majf=0, minf=506
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,83719,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=5581KiB/s (5715kB/s), 5581KiB/s-5581KiB/s (5715kB/s-5715kB/s), io=327MiB (343MB), run=60001-60001msec

ZFS on HW RAID (RAID 5 - PERC H730 Mini)
Code:
rnd_write: (groupid=0, jobs=1): err= 0: pid=572338: Wed Jan  3 22:38:16 2024
  write: IOPS=1922, BW=7689KiB/s (7873kB/s)(451MiB/60001msec); 0 zone resets
    slat (usec): min=108, max=27307, avg=513.75, stdev=992.19
    clat (nsec): min=1733, max=49718, avg=4036.34, stdev=1068.24
     lat (usec): min=110, max=27315, avg=517.79, stdev=992.65
    clat percentiles (nsec):
     |  1.00th=[ 2704],  5.00th=[ 3216], 10.00th=[ 3504], 20.00th=[ 3664],
     | 30.00th=[ 3728], 40.00th=[ 3792], 50.00th=[ 3920], 60.00th=[ 3984],
     | 70.00th=[ 4080], 80.00th=[ 4192], 90.00th=[ 4384], 95.00th=[ 4896],
     | 99.00th=[ 8096], 99.50th=[ 8256], 99.90th=[17280], 99.95th=[20352],
     | 99.99th=[32640]
   bw (  KiB/s): min=  664, max=13104, per=100.00%, avg=7710.12, stdev=2872.96, samples=119
   iops        : min=  166, max= 3276, avg=1927.53, stdev=718.24, samples=119
  lat (usec)   : 2=0.03%, 4=60.63%, 10=39.15%, 20=0.14%, 50=0.05%
  cpu          : usr=1.59%, sys=21.23%, ctx=251345, majf=0, minf=16
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,115335,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=7689KiB/s (7873kB/s), 7689KiB/s-7689KiB/s (7873kB/s-7873kB/s), io=451MiB (472MB), run=60001-60001msec



I'll do some more tests as soon as I have time.
If you want me to run any particular tests, let me know.

Bye,
Edoardo
 
  • Like
Reactions: UdoB and _gabriel
Conclusions
It seems that, on performance side and with database-type workload, having a Raid card with (battery-protected) caching gives a huge advantage.
ZFS on Hw Raid measured 4 times the performance on DB workload VS ZFS on raw disks!

nice study.

to be setups more even, pls add some log ssd disk to zfs (as hw raid has ram cache), maybe also ssd cache. tnx
 
  • Like
Reactions: EdoFede
nice study.

to be setups more even, pls add some log ssd disk to zfs (as hw raid has ram cache), maybe also ssd cache. tnx

Thanks!

I have received some NVMe SSDs (WD Red 500GB) that I would like to try as SLOG (and testing the load with a DB) and as special devices on both systems, but unfortunately I don't think it's possible anymore on these servers.

We had a critical emergency last week: one of the datacenters where we have some servers in housing, declared bankruptcy without any warning and we narrowly managed to take out all our VMs via VPN before they cut the connectivity... (our servers are blocked there, awaiting the investigation... we cannot take nor access it physically).

During the weekend I had to put one of the two machines I had tested on into production on the fly to restore services.

Really absurd situation...!! :mad: but fortunately all data and VMs are safe now.

I can do some testing with NVMe on the "HW Raid" configuration (currently in use), but to repurpose the system to "ZFS SW Raid" I have to make two trips to the new datacenter (which is 1 hour away).
Can't do it remotely because I would have to physically extract the boot disks and put others in (to not lose all the configurations made so far).
:(

I'll see if I can do some more tests in the future on other servers.
 
Aside of this issue, I'm a bit surprised that no real considerations have emerged regarding the use of ZFS on enterprise-grade HW Raid...

The advantages are clear after this tests, but is there a real cost/risk?

I mean, in addition to the "usual" recommendations that I have read in various posts and guides for a long time, I have no objective and practical evidence of any problems (on enterprise HW).

Is it a real risk (any documentation about it?), or simply a best-practice that is over-considered and not related to the underlying system in use?
 
Using zfs over hwraid 10 (MegaRAID SAS 9361-4i with cachevault bbu) for years (since 2019) with no issues (zvols - vm disks). Replications + pbs also.
Also noticed performance increase (opposed to raw disks).
Consistency checks and disks patrol reads performed by hwraid firmware.
 
Last edited:
  • Like
Reactions: EdoFede
Using zfs over hwraid 10 (MegaRAID SAS 9361-4i with cachevault bbu) for years (since 2019) with no issues ......

This is exactly the point.

I had already noticed that ZFS has vary poor performance (expecially random write) compared to other file systems, on same HW:
Single disk ext4: 45.600 orders/min
Single disk ZFS: 4.100 orders/min

Same Server/HW/VM/DB/Data used for testing. Just changed the host FS
10 times better performance on ext4
Using it with HW Raid, tried out of pure curiosity, the performance became acceptable again and I wondered why it was so strongly discouraged.

After years of documentation and posts read on how dangerous this configuration is and should be avoided, by investigating further, I find more and more testimonies of successful uses of ZFS on HW Raid without any kind of problem over years of use.

So I wonder if what I have read so far is actually founded (even on enterprise HW) or if simply parroted endlessly without any scientific basis.

I have carefully read the entire official OpenZFS article, posted also by Dunuin, but I find that almost all the arguments in favor of the raw disks solution are absolutely questionable and they all seem to relate to raids on low-cost HW, not enterprise ones.
(I reasoned and wrote what I think about most of the points of the document in a previous post)

It's absurd that If I hadn't done this test out of pure curiosity, I would never have even considered this configuration.

I think it's time to treat this topic with a more scientific approach (starting from the authors of the official OpenZFS documentation themselves), stopping continuing to say "absolutely, should not be done", without ever a documented and real explanation of why.

If the performances were equivalent, it wouldn't even make sense to talk about it (obviously), but since there is a huge difference, perhaps it would be time to reevaluate the topic a bit. Not?
 
Using it with HW Raid, tried out of pure curiosity, the performance became acceptable again and I wondered why it was so strongly discouraged.
because "performance" isnt the only consideration for storage. This is explained in the admonition. When you use a zfs on top a simulated block device, you lose all the data resiliency features provided by it, at which point you'd be better served by lvm anyway since performance will be better still, and you'll have less write amplification.
 
When you use a zfs on top a simulated block device, you lose all the data resiliency features provided by it
Sure,
but this is a (beautiful) additional feature that does not exist in almost any other file system (which runs fine by many and many years).

While I appreciate it very much (I'm also currently creating two TrueNAS servers, both with raw access to disks for a customer with data resilience requirements), in 20 years of work in IT, I have only had silent file corruptions twice (not catastrophic for the file system, but only at file level).... and both on consumer-grade systems (without RAM ECC plus).

you'd be better served by lvm anyway since performance will be better still, and you'll have less write amplification.
From a pure performance point of view, sure...
But on proxmox, fundamental features such as replication are lost without ZFS, as indicated by rj45 above.
Therefore, the choice of ZFS is practically mandatory for small systems (2-3 servers without shared storage).

There are also many other extremely interesting functions of ZFS, such as snapshots and compression which bring very great advantages to a virtualization system.


My point is: since it doesn't seem that tragic to run ZFS on enterprise-grade HW Raid, why is it so strongly discouraged (ignoring the clear advantage in performance)?
 
Is there a way to find out if HW-raid is better than fast sync writes due to the BBU? I wonder if SSD's with PLP on ZFS is exactly as fast as a single drive on (enterprise) HW-raid with a similar SSD without PLP. Maybe this can be tested by making syn writes unsafe (just for testing)? Or maybe test a sinle Optane drive with and without HW-raid? If write performance is similar (even though ZFS does more checksums), then additional tests (read & write) with multiple drives would be interesting.
 
in 20 years of work in IT, I have only had silent file corruptions twice (not catastrophic for the file system, but only at file level).... and both on consumer-grade systems (without RAM ECC plus).
How do you know? I dont know about what you mean by "working in IT" but assuming responsibility for customer data means not taking such an unnecessary risk. the costs involved in "doing it right" are negligible. Regardless, you can take whatever risks you feel you should.
My point is: since it doesn't seem that tragic to run ZFS on enterprise-grade HW Raid, why is it so strongly discouraged (ignoring the clear advantage in performance)?
You keep asking, but dont actually ever listen to the answer. it was first pointed out here: https://forum.proxmox.com/threads/y...aid-thread-with-benchmarks.138947/post-620404

you dont need to agree with it, but dont keep pretending it doesnt exist.
 
  • Like
Reactions: UdoB
How do you know? I dont know about what you mean by "working in IT" but assuming responsibility for customer data means not taking such an unnecessary risk. the costs involved in "doing it right" are negligible. Regardless, you can take whatever risks you feel you should.

Pure statistics.
In 20 years and 4 companies I've been involved in every kind of enterprise storage solution, from simple servers with a 3 drive Raid-5 to storage solutions hosting hundreds of VMs and huge databases (e.g. one Oracle RAC for a public utility company that have tables with several TB of data on) with every kind of infrastructure and file systems.
Never ever had any data corruption issues on enterprise system I managed or worked on (no OS errors or anomalous behavior, no corruption on databases, no user/customers who has ever reported problems with the data...).

This leads me to a simple reasoning:
The entire IT world has been running on file systems for decades without the advanced data resilience features that ZFS offers, while maintaining excellent reliability (otherwise we would have these problems on a daily basis).

So personally I see two possibilities:
  • ZFS is so unreliable that it cannot work properly on hardware solutions (on which every other file systems works fine)
  • The recommendation is simply derived from an exaggerated interpretation of what is actually a very interesting feature, but not so critical in its absence
I absolutely lean towards the second hypothesis, and this is precisely why I am trying to understand in depth this situation, since I have not yet found a single documented cases of catastrophic ZFS data corruption on enterprise-grade HW Raid solutions.


You keep asking, but dont actually ever listen to the answer. it was first pointed out here: https://forum.proxmox.com/threads/y...aid-thread-with-benchmarks.138947/post-620404

you dont need to agree with it, but dont keep pretending it doesnt exist.

No, I've read everything very carefully (I'm researching about this theme for 3 months).
Indeed, it seems that no one wants to consider the very simple technical objections that I reported in the following post. (objections that any IT technician with experience on these topics can make).

Just to be clear: I have no interest in necessarily being right. :)
But if I'm wrong, I would like to be corrected with real data in hand, because it can be useful not only to me but also to anyone else reading.

Right now it seems like a religious discussion more than a technical discussion. "Either you believe what is written, or there is no point in talking about it."
I think that the technical world can only benefit from discussions addressed on a technical level. We're almost all technicians here, aren't we?
 
  • Like
Reactions: HellNope
Is there a way to find out if HW-raid is better than fast sync writes due to the BBU? I wonder if SSD's with PLP on ZFS is exactly as fast as a single drive on (enterprise) HW-raid with a similar SSD without PLP. Maybe this can be tested by making syn writes unsafe (just for testing)? Or maybe test a sinle Optane drive with and without HW-raid? If write performance is similar (even though ZFS does more checksums), then additional tests (read & write) with multiple drives would be interesting.

This is very interesting.
Certainly the write-protected cache of the controllers brings notable benefits in writing.

I don't think I will have the opportunity to do this type of testing on these systems now, because they are already installed in datacenter.

Instead, I'm doing some tests right now with NVMe disks added to this servers and I'm encountering very strange results that I want to investigate further. For example, adding two NVMe as vdev logs to the pool led to a huge decrease in performance (from 122k to 18k orders/min o_O) in the test using a lot of parallelism.

I'll let you know!
 
For example, adding two NVMe as vdev logs to the pool led to a huge decrease in performance (from 122k to 18k orders/min o_O) in the test using a lot of parallelism.
Do you mean as an SLOG for the pool? Then all (sync) writes go to the NVMe first (which probably does not have PLP, like the HW-raid) and will become as slow as they usually are without PLP. Sound like a real waste of NVMe which have limited write durability, if they don't have PLP.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!