Yet another "ZFS on HW-RAID" Thread (with benchmarks)

Exactly, two mirrored consumer-grade NVMe (Transcend MTE220S), no PLP, but it's just an experiment.
On the same test made some times ago with pure ZFS on raw disks they bring an improvement, but with the HW Raid with BBU cache, seems to become a bottleneck on DB workload (unexpected to be this huge).

I'll do also fio testing on the host.
 
Exactly, two mirrored consumer-grade NVMe (Transcend MTE220S), no PLP, but it's just an experiment.
On the same test made some times ago with pure ZFS on raw disks they bring an improvement, but with the HW Raid with BBU cache, seems to become a bottleneck on DB workload (unexpected to be this huge).
Well, you're sending all the sync writes (from the DB which could easily be sync write heavy for data integrety) to the devices that cannot handle them very well (instead of using the HW-raid that can)...
 
Yeah, but not expected from this NVMe drives.
Done a fio bench, they perform 1/4 of the declared IOPS.
I'll check to see if it's a problem with the PCIe adapter card (even it shouldn't be...it uses the motherboard's bifurcation) or configuration mistakes.
 
Yeah, but not expected from this NVMe drives.
Done a fio bench, they perform 1/4 of the declared IOPS.
I'll check to see if it's a problem with the PCIe adapter card (even it shouldn't be...it uses the motherboard's bifurcation) or configuration mistakes.
SLOGs only purpose is writing sync writes. Thats all its doing. Consumer SSDs got no PLP so they can't cache sync writes in DRAM = horrible performance. Using a consumer SSD is similar to running a HW raid without cache+BBU. So totally don't makes to use anything except for a Enterprise SSD for SLOG.

You probably didn't did those fia benchmarks using sync writes but async ones?
 
Of course, but it's only a test with some NVMe bought for another project that is deleyed.

The tests show that however these drives struggle to reach the declared performances (I saw a number of IOPS a little close to the declared only with a certain queue depth...however, the manufacturer does not declare the test specifications for the numbers declared...so.... ).
Probably something is also influenced by the numa nodes (by doing a test on a specific numa node the result becomes more acceptable), given that the server has two CPUs and the adapter cards are connected one on first and one on second CPU.

Just to play a little more :)

The test reported on previous is done via HammerDB inside a VM, so sync writes.
The fio benchmark is done on async writes and shows that the drives

I have no need for SLOG on this system, most are VMs with accounting or small ERP software with few users and low load.


Anyway, speaking of SLOG, I had done a search for NVMe M.2 drives with PLP for another project with NFS and sync writes (for a recording studios), but only found this old and messy list online:
https://www.truenas.com/community/threads/list-of-ssds-with-power-loss-protection.63998/

There seems to be very little choice.

Is there an updated and well-made list somewhere?
 
Is there an updated and well-made list somewhere?
It's German:
2280 M.2 with PLP (so only options are Kingston DC1000B and Micron 7300/7400/7450 Max/Pro and up to 1TB): https://geizhals.de/?cat=hdssd&xf=4643_Power-Loss+Protection~4832_3~4836_7
22110 M.2 with PLP (so Micron 7300/7450 Pro, Samsung PM983/9A3, Solidigm D7-P4511 and up to 4TB): https://geizhals.de/?cat=hdssd&xf=4643_Power-Loss+Protection~4832_3~4836_8

And yes, there are only few models. Enterprise SSDs usually use U.2/U.3 and not M.2 as M.2 is a terrible formfactor for everythigng except laptops and thin-clients...
 
Last edited:
Totally agree.
Unfortunately, on some servers it is a "forced" choice given that they only have SAS/SATA backplanes.

In the mentioned project I will have to recondition two dated servers and unfortunately the only way to insert NVMe drives, avoidind SATA, is to use adapter cards and M.2 drives :(

Yes, very little choice... understandable anyway, given the size of the form factor and the need to have the capacitor bank on-board.
Furthermore, they are practically all for "read intensive" use, therefore probably also unsuitable or not best choice for this purpose.

Thanks for the links!
Used that site some time ago to look for SATA SSD disks with a parametric search, but I forgot about it!! :D
 
Last edited:
Hi everybody in this thread,
I'm reading the posts carefully, because I'm more interested in reliability rather than in performance.
In addition to that I'm on my way with VMware for about 20years and we mostly used SAN-based shared storage and let VMware handle the filesystem. Why is it not recommended to do the same with ProxMox? See, I'm new to ProxMox and I like to plan and build our little environment "the right way".

As far as I see things, if an enterprise-class storage-system does its job, then a host sees LUNs as local disks and the storage-system keeps the LUNs consistent. This is what I could experience over many years with systems like VMware or Windows. So, its the job of the host to keep the filesystem consistent. Why are there such precautions needed, when thinking about LUNs for "ZFS over iSCSI"?
Any why is "LVM / Ext4 over iSCSI" not that much an issue?
For my use-case replication is optional, but shared iSCSI-storage is a must, since I already have such hardware in place, which I can't replace quickly.

Aside of all this: I do hard to see an advantage when using computing cycles for software-ZFS-RAID from the main CPUs, which also drive the VMs, rather than offloading this job to a dedicated controller.
 
  • Like
Reactions: EdoFede
First of all - Why you want to use ZFS?

All other stuff can do it too: H/R, mdadm, LVM, filesystems, etc.

But none of them do data integrity check. Except others filesystems like btrfs.

If you don't need data integrity validation before it reach program then use regular tools and by happy.

If you want online data validation and automatic online healing - don't make it hard for ZFS to detect.


Few points:

* Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.

* Then enterprise HDD starts to become broken ZFS first report the problem before you can see something in S.M.A.R.T
 
In addition to that I'm on my way with VMware for about 20years and we mostly used SAN-based shared storage and let VMware handle the filesystem. Why is it not recommended to do the same with ProxMox?
No one is NOT recommending this ;) the only caveat using this approach is the loss of snapshots as you have to use LVM thick provisioning.

So, its the job of the host to keep the filesystem consistent. Why are there such precautions needed, when thinking about LUNs for "ZFS over iSCSI"? Any why is "LVM / Ext4 over iSCSI" not that much an issue?
ZFS is not multiple initiator aware. As such, it doesn't allow multiple simultaneous host connections to a file system. Its possible to shoehorn such a solution on top of additional corosync management but this is not included in Proxmox. LVM is clusterable, but as mentioned above its got limitations.
Aside of all this: I do hard to see an advantage when using computing cycles for software-ZFS-RAID from the main CPUs, which also drive the VMs, rather than offloading this job to a dedicated controller.
External storage that can do both block and file level checksums, snapshots, inline compression, and deduplication have a cost. Even then, without specialized storage drivers, snapshots and alternate streams would not be visible to the host. ZFS allows you to achieve all of that with only your host; The host performance requirements are negligible for mirror pools- the only direct cost is the added RAM (and you wouldnt use parity raid for vm storage anyway.)

iSCSI/FCAL/FCF isnt necessarily required for shared storage in a vsphere envoironment- vSAN, Nutanix, etc exist. Same goes for Proxmox- ceph is integreted into the distribution :)

First of all - Why you want to use ZFS?
see above :)

But none of them do data integrity check. Except others filesystems like btrfs.
In a perfect world, btfrs would be used for everything and we could let zfs go off to the sunset. Unfortunately, btrfs code quality is spotty, and does not have sufficient development resources backing it to be considered true competition. As long as single host virtualization continues to be an application people want to deploy, zfs will continue to be the best option.
Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.
"ceph" doesnt care what you put on it. Neither filestore or bluestore will allow pg corruption. You would have to force it to operate with min_size of 1- and if you did that, thats on you- its not the fault of the system for not protecting you from yourself.
 
First of all - Why you want to use ZFS?

In my case, mainly because of replication (ZFS is the only supported FS by PVE for replication "out-of-the box").



Few points:

* Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.

* Then enterprise HDD starts to become broken ZFS first report the problem before you can see something in S.M.A.R.T

Excellent indications, in particular on CEPH, on which I have no experience yet.
I would like to try it in another lab in the near future, as soon as I have hardware available to play with.
 
Hello EdoFede,
I found your article very useful. However, I'm going crazy with Windows VMs.

I have the following setup:
- Dell PowerEdge R640 with 378GB of ram
- Proxmox installed in a dedicated SSD
- 2 x SSD zfs RAIDZ0 -> zpool storage1
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2
- SATA controller in AHCI mode

I usually have only Linux containers/vms on my proxmox node however I need to build a Windows server 2022 and here the problem has started.

I configured the VM to run on "storage2" so I expected it could be a bit slow but instead it was a complete disaster. It has been a nightmare installing the OS and after every task that requires a read/write operation, the VM seems also disconnected from the network making it impossible to manage the VM using RDP.
As well the console interface got freeze every time or was impossible to connect to the console

I tried a couple of zfs settings with no success like disabling zfs sync `zfs set sync=disabled storage2/VMs`, changing the dirty value to 64MB or 128MB as suggested in other discussions on the forum, or setting min/max values to zfs arc with no success.

I know this is not the best setup ever made, by the way, do you have suggestions or things that I could try?
 
Last edited:
Install hwraid pci-e enterprise raid card and connect drives to hwraid with raid-10 (created with raid card bios). Then create zfs raid0 via proxmox gui or shell.

NO. Just no. That is terrible advice, and you are setting this poor guy up for failure down the road when disks start failing.

> I have the following setup:
- Dell PowerEdge R640 with 378GB of ram
- Proxmox installed in a dedicated SSD
- 2 x SSD zfs RAIDZ0 -> zpool storage1
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2
- SATA controller in AHCI mode

@djzoidberg - First of all, running SSDs in RAID0 is going to be a disaster when 1 drive dies, taking out the whole pool. Rebuild as a mirror unless you have reliable, daily backups and 1-2 spare drives. Add more 2-drive mirrors if you need more free space.

Secondly - Seagate Barracudas are DESKTOP-rated drives (and they don't tend to last long.) Putting them in raidz1 gives you the IOPs of (1) drive + the ZFS overhead.

I would buy Ironwolf or Ironwolf Pro drives (or even better, Toshiba NAS for speed - I've had good results with N300 drives) and rebuild as a mirror pool if you need speed out of it.

Do NOT put your ZFS pool on top of hardware RAID. You want a SAS HBA in IT mode (actively cooled) unless you like failure and pain, weird errors, and restoring from backups. You could also consider SAS drives, as they are full-duplex where SATA is half-duplex.
 
:)) It's pretty safe zfs on enterprise hwraid. All disks checks (patrol reads and consistency checks) performed by hwraid firmware.
Replaced many disks on hwraid10 with zfs swraid0 ontop.

How HWRAID protect from data distortion or bit rot ?
 
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2

Why ashift 12?
On spinning drive it's better to match the logical/physical block size reported by smartctl.
Don't know the 2TB Barracuda, but spinning desktop disks are more likely to be 512 bytes sectors, so try with ashift 9.
 
NO. Just no. That is terrible advice, and you are setting this poor guy up for failure down the road when disks start failing.


Do NOT put your ZFS pool on top of hardware RAID. You want a SAS HBA in IT mode (actively cooled) unless you like failure and pain, weird errors, and restoring from backups. You could also consider SAS drives, as they are full-duplex where SATA is half-duplex.

This entire thread it's exactly about this topic.

I quote part of my own post from the previous page.


The entire IT world has been running on file systems for decades without the advanced data resilience features that ZFS offers, while maintaining excellent reliability (otherwise we would have these problems on a daily basis).

So personally I see two possibilities:
  • ZFS is so unreliable that it cannot work properly on hardware solutions (on which every other file systems works fine)
  • The recommendation is simply derived from an exaggerated interpretation of what is actually a very interesting feature, but not so critical in its absence
I absolutely lean towards the second hypothesis, and this is precisely why I am trying to understand in depth this situation, since I have not yet found a single documented cases of catastrophic ZFS data corruption on enterprise-grade HW Raid solutions.


Just to be clear: I have no interest in necessarily being right. :)
But if I'm wrong, I would like to be corrected with real data in hand, because it can be useful not only to me but also to anyone else reading.

Right now it seems like a religious discussion more than a technical discussion. "Either you believe what is written, or there is no point in talking about it."
I think that the technical world can only benefit from discussions addressed on a technical level. We're almost all technicians here, aren't we?

I have not yet found detailed interventions regarding this "terrible advice" and its consequences, using Enterprise-grade hardware.

To be clear, just last week I installed two servers with ZFS, flashing the PERC H330 controllers in IT mode (HBA330), because of the absence of battery backed-up cache.
So I absolutely support this solution, but when it actually makes technical sense.

On another three servers with ZFS on a proper HW-Raid controller, I've got already 2 disk failure without any issue.
Replaced, rebuilded, not a single error.
 
  • Like
Reactions: NaiX

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!