ZFS on DELL NVMe direct drives vs HWRAID Drives

guido.lamoto

New Member
Jul 3, 2025
1
0
1
Hi,

I'm new to the Proxmox world. I've search for an answer on this topic but couldn't work it out.

We would like to buy 4 Dell R760/R770 to run PVE. Right now we have 4 old DELL connected to an old SAN with LVM on top and plain KVM/QEMU. We have ~50 VMs and we are quite static on this.

We never used ZFS so we are newbie to this world. We read many times to not use ZFS on top of HW raid for data integrity reason. Still, since our try with Proxmox might not work well we would like to leave the door open to going back to LVM+ext4 and we are undecided between these two options:

16 800GB NVMe direct drives, Smart Flow
or
16 800GB NVMe HWRAID Drives, Smart Flow, Dual Controller, Front PERC 12

With the first option we will have the chance to present all 16 drives as an HBA would do, correct? So we can pool toghether with ZFS and do our stuff directly with it. Some colleague says that there will still be some type of controller (like S160!?), but I read that each NVMe has a dedicated PCIe line to the CPU, so it doesn't make any sense to have something between them.

With the second option we will leave the door open BUT we don't understand if configuring all the disks as "non-raid" will still use some kind of cache on the HW raid controllers. Following this seems that "non-raid disks" will have the "Write-Through" options (good, right?) but will be presented as SAS (and not NVMe) so I don't get it if we will lose some ZFS functionality.

Aside for the fact that PERC will cost us more than 1000 euro more than the "direct drives" options, can the second option be a viable road or it could lead to data loss or corruption as everyone said about ZFS on top of HWRAID?

Also, Proxmox basic (or standard) subscription will cover this type of issue or this is stuff for the ZFS guys?
 
Last edited:
With the first option we will have the chance to present all 16 drives as an HBA would do, correct? So we can pool toghether with ZFS and do our stuff directly with it. Some colleague says that there will still be some type of controller (like S160!?), but I read that each NVMe has a dedicated PCIe line to the CPU, so it doesn't make any sense to have something between them.
I sadly don't know yet if there is a S160 in between, but something like a multiplexer could be in between, Dell with Intel CPUs do most likely not have enough PCIe lanes to power all NVMes (depending on the slot number) directly with the CPU(s). This heavily depends on the used layout and only Dell can answer that for you. You may need to switch to AMD in order to do that: 6444Y has 80 lanes, EPYC 9124 has 128 lanes.

16 800GB NVMe HWRAID Drives, Smart Flow, Dual Controller, Front PERC 12
We run a couple of R760s with PERC H965i and 6x 3.84 TiB NVMe CD8 U (according to Dell) and it is underwheling performance wise. 8K workloads are VERY VERY slow for what you might expect, even slower than on a single NVMe in an AMD-setup we did last year. All are done with fio on the block devices without any volume manager / filesystem in between:

fio 8k sequential read, 1 thread, QD32: 2.6 GB/s; 16 threads "only" 3 GB/s, yet blocksize 128K 22 GB/s.
fio 4k random read, 1 thread, QD32: 1,3 GB/s; 16 threads 1,5 GB/s, yet 128K blocksize 20 GB/s.

If you have a system / application with a good read-ahead e.g. like an Oracle Database (multiblock_read_count), you will have the full performance available, yet not all situations are like this. Also, YMMV.

So with a ZFS zvol 16K setup, this may not be that fast is the blocks are not congruent, yet a CEPH or LVM system with a large enough blocksize will be faster (CEPH on HW is also a NO-NO). You may get better performance by playing around with the default blocksize of the ZFS.

On the AMD system, we had entry level NVMes PCIe 4.0 without a controller and reached in both tests > 5 GB/s one a single NVMe, so that was better suited for ZFS workloads.
 
I'm not an expert on this but if you use those shiny new PCIe Gen5 NVME SSD's then you should avoid any controller and go for direct attached with ZFS mirror because the raid controller itself only features PCIe Gen4 (16 Lanes), so you max out at about 32GB/s - the newest NVME SSD's can go up to 14GB/s and that's just the max bandwith, latency and iops will most likely be better too if you avoid any device in between.

Sure, ZFS will take away some of this theoretical performance ...
 
Direct means just that - direct to the CPU, no HBA or muxing, however in some cases this may not be faster due to NUMA core pinning to VMs and dedicating lanes to each CPU...

Then there is yet another option on the R760: "Direct Drive Switched" which adds a PCIe switch to switch the lanes between CPUs - if you have more x4 drives than PCIe lanes.
1752731986775.png

If you do go with a HWRAID card, unless you need HA or replication on 2 hosts I think LVM is a better choice than ZFS for speed and simplicity.

Check this basic comparison of Direct vs HWRAID (does not cover "Direct switched"):
https://www.storagereview.com/review/dell-poweredge-direct-drives-vs-perc-12-review

Or another option if you want to keep your current shared storage design for HA needs, just do stripped down basic compute nodes with no storage and Blockbridge storage - https://www.blockbridge.com/proxmox/ (dells storage markups are stupid)
 
Last edited:
  • Like
Reactions: LnxBil