Best practices: Hardware vs software RAID

nycvelo

New Member
Sep 6, 2021
16
1
3
65
Greetings. I'm new to Proxmox, and am looking for config recommendations to ensure the best balance of performance and data integrity.

On a Supermicro server with a LSI 9260-i4 RAID controller and no battery backup with 4 HDDs attached, is it better to use software RAID with ZFS over hardware RAID? Why or why not?

In my lab I installed Proxmox VE 7.2 on this relatively old Supermicro server with an LSI MegaRAID 9260-4i controller with 4 x Seagate Ironwolf 1TB NAS drives. The controller is currently configured for a RAID 6 array with read and write caches enabled.

pveperf reports relatively slow performance, especially with fsyncs:

root@somehost:~# pveperf
CPU BOGOMIPS: 121600.20
REGEX/SECOND: 1503580
HD SIZE: 93.93 GB (/dev/mapper/pve-root)
BUFFERED READS: 305.86 MB/sec
AVERAGE SEEK TIME: 9.66 ms
FSYNCS/SECOND: 185.26
DNS EXT: 49.78 ms
DNS INT: 2.08 ms (subnet.example.tld)

Reading through other threads here, most seem to suggest going with ZFS software RAID. A few say hardware RAID is OK but use it with caching disabled. I'm unclear on which is preferable, and could use your guidance before rebuilding this server.

Thanks in advance for your configuration clues!
 
Last edited:
Hi,

On a Supermicro server with a LSI 9260-i4 RAID controller and no battery backup with 4 HDDs attached, is it better to use software RAID with ZFS over hardware RAID? Why or why not?
AFAICT usually software RAID is often easier to recover. If your hardware RAID controller fails it might be difficult to recover information about what disk carries which information etc. ZFS is typically easier to recover plus you can activate Scrubs and ZED notifications to inform you about failing arrays [1]. Note that if you use ZFS you should turn off any underlying hardware RAID controller.

I think you would probably be best served by using ZFS in RAIDZ2 (that's more or less the ZFS term for RAID 6).

[1]: https://pve.proxmox.com/wiki/ZFS_on_Linux#_configure_e_mail_notification
 
Ideally, the raid controller should pass the drives directly through to the operating system for ZFS (aka IT-Mode or HBA-Mode). Unfortunately I don't think the LSI 9260 supports IT-Mode so the best you would be able to do is JBOD mode but in this case you may run into integrity issues as ZFS does not 'see' the drives or their SMART info directly.

Best to consider replacing the controller if you wish to use ZFS. Second-hand LSI controllers are widely available (pm me if you are in the UK)

Secondly, you're never going to get great performance from hard drives even under ZFS, striped mirrors are going to be your best option but that is 50% data and 50% parity.

A couple of enterprise-grade SSD's will be your best bet for performance
 
Secondly, you're never going to get great performance from hard drives even under ZFS, striped mirrors are going to be your best option but that is 50% data and 50% parity.

A couple of enterprise-grade SSD's will be your best bet for performance
True, but if you really want to use spinners (I do that too), I recommend a three-tier setup:
- spinners (LOTS of them, the more the better and used as stripped mirrors)
- 2x Enterprise SSD (mirrored) for storing all metadata of the blocks for the spinners
- PCIe Optane (optionally mirrored if you also need every bit written instead of the everything older than 5 seconds) for SLOG

Everything should be redundant, so e.g. multiple HBAs instead of multiplexing, mirrored vdevs should always be on different controllers, e.g.
 
Update: Reconfigured the RAID controller only to learn it doesn't support JBOD. After some misadventures with PXE boot (this host is remote), I set up RAID6 again, this time without caching enabled. Performance is much improved.

BEFORE:
FSYNCS/SECOND: 185.26

AFTER:
FSYNCS/SECOND: 1946.91

Still not stellar, but I think any further gains will require new hardware, like a controller that does JBOD and some SSDs. Thanks to all who responded!
 
I set up RAID6 again, this time without caching enabled. Performance is much improved.

BEFORE:
FSYNCS/SECOND: 185.26

AFTER:
FSYNCS/SECOND: 1946.91
Are you sure its not the opposite? Fsyncs are sync write IOPS so 185 would be what I expect from a HDD raid6. Everything the high above should be cached. So sync writes are handled as async writes which can be dangerous in case you rain controller got no BBU attached.
Now your HDDs are reported to be faster than my Intel S3700 Enterprise MLC SSD in raid1.
 
Are you sure its not the opposite? Fsyncs are sync write IOPS so 185 would be what I expect from a HDD raid6. Everything the high above should be cached. So sync writes are handled as async writes which can be dangerous in case you rain controller got no BBU attached.
Now your HDDs are reported to be faster than my Intel S3700 Enterprise MLC SSD in raid1.
The before and after numbers are correct as observed. On subsequent runs of pveperf with caching disabled, FSYNCs dropped a bit, down to the high 1600s.

My RAID controller does not have a BBU. This host has redundant power supplies and resides in a data center that has had 100.00 percent uptime for at least the past five years.

Yes, I would prefer to have a newer controller, one with BBU and JBOD support, and I'll switch to one next time I visit the data center, which is in a different state. I don't want to tempt fate, but for now I'm not all that concerned about power loss. Or were you referring to some other scenario where data loss could occur? Thanks.
 
Please check with fio, those numbers look very bogus and wrong.
Pardon my ignorance, but I'm unfamiliar with the fio tool. Which of these fio tests would offer a meaningful check of pveperf's fsync numbers?

random reads
file random read/writes
random read/writes
sequential reads

And are we looking to measure IOPS or throughput or latency?
 
And are we looking to measure IOPS or throughput or latency?
sync write IOPS/latency.

So something like this: fio --directory=/path/to/your/mounted/storage/ --name=sync_write_iops --rw=randwrite --bs=4K --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --runtime=300 --time_based
Will synchronously random write 4K blocks without any caching into folder /path/to/your/mounted/storage/ for 5 minutes.

And for async read iops: fio --directory=/path/to/your/mounted/storage/ --name=async_read_iops --rw=randread --bs=4K --direct=1 --sync=0 --numjobs=1 --ioengine=libaio --iodepth=64 --refill_buffers --runtime=300 --time_based
Will asyncronously random read 4K blocks with caching from folder /path/to/your/mounted/storage/ for 5 minutes.

These will hit your HDDs as hard as possible. So don't expect high IOPS or throughput shown there. Will basically show the worst case scenario.
 
Last edited:
sync write IOPS/latency.

So something like this: fio --directory=/path/to/your/mounted/storage/ --name=sync_write_iops --rw=randwrite --bs=4K --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --runtime=300 --time_based
Will synchronously random write 4K blocks without any caching into folder /path/to/your/mounted/storage/ for 5 minutes.

And for async read iops: fio --directory=/path/to/your/mounted/storage/ --name=async_read_iops --rw=randread --bs=4K --direct=1 --sync=0 --numjobs=1 --ioengine=libaio --iodepth=64 --refill_buffers --runtime=300 --time_based
Will asyncronously random read 4K blocks with caching from folder /path/to/your/mounted/storage/ for 5 minutes.

These will hit your HDDs as hard as possible. So don't expect high IOPS or throughput shown there. Will basically show the worst case scenario.

Thanks for these. Both commands want a --size parameter. Just checking: Are you looking for "--size=4K" here? If not, what other value?
 
You can add a "--size=1G" to read/write up to 1GB of data.

OK, thanks.

I ran each test five times and saw very consistent results. I'm pasting the full output below of iteration 1 from each test. I've also pasted the result from a pveperf test run after both sets of fio benchmarks. Let me know if anyone wants to see the output of all five iterations of each test.

I should also note the storage tested here is an NFS mount of a TrueNAS 13.0-U1 box on the same subnet. That system has four spinning disks in a raidz2 array.

root@somehost:/mnt/pve/proxhosts_tafi# for i in {1..5} ; do fio --directory=/mnt/pve/proxhosts_tafi --name=sync_write_iops --rw=randwrite --bs=4K --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --runtime=300 --time_based --size=1G ; done
sync_write_iops: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
sync_write_iops: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=204KiB/s][w=51 IOPS][eta 00m:00s]
sync_write_iops: (groupid=0, jobs=1): err= 0: pid=308775: Sat Jul 23 14:56:43 2022
write: IOPS=98, BW=392KiB/s (402kB/s)(115MiB/300008msec); 0 zone resets
clat (msec): min=5, max=233, avg=10.19, stdev= 7.43
lat (msec): min=5, max=233, avg=10.19, stdev= 7.43
clat percentiles (msec):
| 1.00th=[ 9], 5.00th=[ 9], 10.00th=[ 9], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 9], 60.00th=[ 9],
| 70.00th=[ 9], 80.00th=[ 9], 90.00th=[ 12], 95.00th=[ 17],
| 99.00th=[ 50], 99.50th=[ 60], 99.90th=[ 79], 99.95th=[ 92],
| 99.99th=[ 117]
bw ( KiB/s): min= 80, max= 480, per=99.97%, avg=392.17, stdev=92.25, samples=599
iops : min= 20, max= 120, avg=98.04, stdev=23.06, samples=599
lat (msec) : 10=84.04%, 20=11.47%, 50=3.56%, 100=0.91%, 250=0.02%
cpu : usr=0.25%, sys=0.73%, ctx=29458, majf=0, minf=30
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,29409,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

root@somehost:/mnt/pve/proxhosts_tafi# for i in {1..5} ; do fio --directory=/mnt/pve/proxhosts_tafi --name=async_write_iops --rw=randread --bs=4K --direct=1 --sync=0 --numjobs=1 --ioengine=libaio --iodepth=64 --refill_buffers --runtime=300 --time_based --size=1G ; done
async_write_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
async_write_iops: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=109MiB/s][r=27.9k IOPS][eta 00m:00s]
async_write_iops: (groupid=0, jobs=1): err= 0: pid=314636: Sat Jul 23 15:30:23 2022
read: IOPS=27.9k, BW=109MiB/s (114MB/s)(31.0GiB/300003msec)
slat (usec): min=2, max=268, avg= 6.84, stdev= 3.52
clat (usec): min=1069, max=7803, avg=2282.64, stdev=94.17
lat (usec): min=1078, max=7809, avg=2289.80, stdev=93.47
clat percentiles (usec):
| 1.00th=[ 2073], 5.00th=[ 2147], 10.00th=[ 2180], 20.00th=[ 2212],
| 30.00th=[ 2212], 40.00th=[ 2245], 50.00th=[ 2278], 60.00th=[ 2311],
| 70.00th=[ 2343], 80.00th=[ 2376], 90.00th=[ 2409], 95.00th=[ 2442],
| 99.00th=[ 2507], 99.50th=[ 2507], 99.90th=[ 2573], 99.95th=[ 2606],
| 99.99th=[ 2671]
bw ( KiB/s): min=110384, max=111848, per=100.00%, avg=111774.08, stdev=64.20, samples=599
iops : min=27596, max=27962, avg=27943.52, stdev=16.05, samples=599
lat (msec) : 2=0.10%, 4=99.90%, 10=0.01%
cpu : usr=12.34%, sys=25.72%, ctx=1742146, majf=0, minf=89
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=8378367,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

root@somehost:/mnt/pve/proxhosts_tafi# pveperf
CPU BOGOMIPS: 121593.96
REGEX/SECOND: 1486538
HD SIZE: 93.93 GB (/dev/mapper/pve-root)
BUFFERED READS: 290.56 MB/sec
AVERAGE SEEK TIME: 9.43 ms
FSYNCS/SECOND: 1341.35
DNS EXT: 38.57 ms
DNS INT: 1.83 ms (subnet.example.tld)
 
My knowledge of storage would fill a thimble, but if it's at all like network benchmarking, where I have a bit of experience, then (a) different tools can measure different things and (b) all benchmarks are broken in some way.

The question here: Is anything with this setup or these results "very bogus and wrong"? (Not your quote, I realize.)

As previously stated, I would prefer to have a RAID controller that does JBOD, and some enterprise-grade SSDs. But this is the hardware I have for now. I'm seeing acceptable (to me) performance from a few VMs, and am not too worried about data loss, even without a BBU, because of stable and well-conditioned power in the data center where this host lives.

Thanks!
 
The question here: Is anything with this setup or these results "very bogus and wrong"? (Not your quote, I realize.)
The fsync benchmark is wrong, the fio benchmark is what what we (Dunuin and I) expecteded from such a setup.

I would prefer to have a RAID controller that does JBOD
So, just an HBA without any RAID logic even if disabled. @bobmc already stated that it would be best to just buy a used HBA for that, e.g. a SAS2008-based HBA with IT firmware would just do the trick and you could use ZFS.

I'm seeing acceptable (to me) performance from a few VMs, and am not too worried about data loss, even without a BBU, because of stable and well-conditioned power in the data center where this host lives.
The BBU just enabled you (per default) to have a write cache. You could also force the write cache if you have a good and stable power supply as you stated. This is not a disk cache, it's a controller cache so that every write will go to the cache and eventually to the disk (write-back-caching). You initial high fsync values implied having such a controller based write cache enabled and we therefore concluded that those numbers cannot be direct (without any cache) numbers.

If your system seems sufficiently fast to you, just go with it.
 
Leaving things alone, then, at least until I lay hands on a new controller and SSDs. Thanks all for your guidance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!