Yet another "ZFS on HW-RAID" Thread (with benchmarks)

EdoFede · Jan 21, 2024

Exactly, two mirrored consumer-grade NVMe (Transcend MTE220S), no PLP, but it's just an experiment.
On the same test made some times ago with pure ZFS on raw disks they bring an improvement, but with the HW Raid with BBU cache, seems to become a bottleneck on DB workload (unexpected to be this huge).

I'll do also fio testing on the host.

leesteken · Jan 21, 2024

EdoFede said:
Exactly, two mirrored consumer-grade NVMe (Transcend MTE220S), no PLP, but it's just an experiment.
On the same test made some times ago with pure ZFS on raw disks they bring an improvement, but with the HW Raid with BBU cache, seems to become a bottleneck on DB workload (unexpected to be this huge).

Well, you're sending all the sync writes (from the DB which could easily be sync write heavy for data integrety) to the devices that cannot handle them very well (instead of using the HW-raid that can)...

EdoFede · Jan 21, 2024

Yeah, but not expected from this NVMe drives.
Done a fio bench, they perform 1/4 of the declared IOPS.
I'll check to see if it's a problem with the PCIe adapter card (even it shouldn't be...it uses the motherboard's bifurcation) or configuration mistakes.

Dunuin · Jan 21, 2024

EdoFede said:
Yeah, but not expected from this NVMe drives.
Done a fio bench, they perform 1/4 of the declared IOPS.
I'll check to see if it's a problem with the PCIe adapter card (even it shouldn't be...it uses the motherboard's bifurcation) or configuration mistakes.

SLOGs only purpose is writing sync writes. Thats all its doing. Consumer SSDs got no PLP so they can't cache sync writes in DRAM = horrible performance. Using a consumer SSD is similar to running a HW raid without cache+BBU. So totally don't makes to use anything except for a Enterprise SSD for SLOG.

You probably didn't did those fia benchmarks using sync writes but async ones?

EdoFede · Jan 22, 2024

Of course, but it's only a test with some NVMe bought for another project that is deleyed.

The tests show that however these drives struggle to reach the declared performances (I saw a number of IOPS a little close to the declared only with a certain queue depth...however, the manufacturer does not declare the test specifications for the numbers declared...so.... ).
Probably something is also influenced by the numa nodes (by doing a test on a specific numa node the result becomes more acceptable), given that the server has two CPUs and the adapter cards are connected one on first and one on second CPU.

Just to play a little more

The test reported on previous is done via HammerDB inside a VM, so sync writes.
The fio benchmark is done on async writes and shows that the drives

I have no need for SLOG on this system, most are VMs with accounting or small ERP software with few users and low load.

Anyway, speaking of SLOG, I had done a search for NVMe M.2 drives with PLP for another project with NFS and sync writes (for a recording studios), but only found this old and messy list online:
https://www.truenas.com/community/threads/list-of-ssds-with-power-loss-protection.63998/

There seems to be very little choice.

Is there an updated and well-made list somewhere?

Dunuin · Jan 22, 2024

EdoFede said:
Is there an updated and well-made list somewhere?

It's German:
2280 M.2 with PLP (so only options are Kingston DC1000B and Micron 7300/7400/7450 Max/Pro and up to 1TB): https://geizhals.de/?cat=hdssd&xf=4643_Power-Loss+Protection~4832_3~4836_7
22110 M.2 with PLP (so Micron 7300/7450 Pro, Samsung PM983/9A3, Solidigm D7-P4511 and up to 4TB): https://geizhals.de/?cat=hdssd&xf=4643_Power-Loss+Protection~4832_3~4836_8

And yes, there are only few models. Enterprise SSDs usually use U.2/U.3 and not M.2 as M.2 is a terrible formfactor for everythigng except laptops and thin-clients...

EdoFede · Jan 22, 2024

Totally agree.
Unfortunately, on some servers it is a "forced" choice given that they only have SAS/SATA backplanes.

In the mentioned project I will have to recondition two dated servers and unfortunately the only way to insert NVMe drives, avoidind SATA, is to use adapter cards and M.2 drives

Yes, very little choice... understandable anyway, given the size of the form factor and the need to have the capacitor bank on-board.
Furthermore, they are practically all for "read intensive" use, therefore probably also unsuitable or not best choice for this purpose.

Thanks for the links!
Used that site some time ago to look for SATA SSD disks with a parametric search, but I forgot about it!!

Dunuin · Jan 22, 2024

EdoFede said:
In the mentioned project I will have to recondition two dated servers and unfortunately the only way to insert NVMe drives, avoidind SATA, is to use adapter cards and M.2 drives

There are U.2 PCIe cards too:

Seabob · Jan 22, 2024

Hi everybody in this thread,
I'm reading the posts carefully, because I'm more interested in reliability rather than in performance.
In addition to that I'm on my way with VMware for about 20years and we mostly used SAN-based shared storage and let VMware handle the filesystem. Why is it not recommended to do the same with ProxMox? See, I'm new to ProxMox and I like to plan and build our little environment "the right way".

As far as I see things, if an enterprise-class storage-system does its job, then a host sees LUNs as local disks and the storage-system keeps the LUNs consistent. This is what I could experience over many years with systems like VMware or Windows. So, its the job of the host to keep the filesystem consistent. Why are there such precautions needed, when thinking about LUNs for "ZFS over iSCSI"?
Any why is "LVM / Ext4 over iSCSI" not that much an issue?
For my use-case replication is optional, but shared iSCSI-storage is a must, since I already have such hardware in place, which I can't replace quickly.

Aside of all this: I do hard to see an advantage when using computing cycles for software-ZFS-RAID from the main CPUs, which also drive the VMs, rather than offloading this job to a dedicated controller.

Nemesiz · Jan 22, 2024

First of all - Why you want to use ZFS?

All other stuff can do it too: H/R, mdadm, LVM, filesystems, etc.

But none of them do data integrity check. Except others filesystems like btrfs.

If you don't need data integrity validation before it reach program then use regular tools and by happy.

If you want online data validation and automatic online healing - don't make it hard for ZFS to detect.

Few points:

* Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.

* Then enterprise HDD starts to become broken ZFS first report the problem before you can see something in S.M.A.R.T

alexskysilk · Jan 22, 2024

Seabob said:
In addition to that I'm on my way with VMware for about 20years and we mostly used SAN-based shared storage and let VMware handle the filesystem. Why is it not recommended to do the same with ProxMox?

No one is NOT recommending this

the only caveat using this approach is the loss of snapshots as you have to use LVM thick provisioning.

Seabob said:
So, its the job of the host to keep the filesystem consistent. Why are there such precautions needed, when thinking about LUNs for "ZFS over iSCSI"? Any why is "LVM / Ext4 over iSCSI" not that much an issue?

ZFS is not multiple initiator aware. As such, it doesn't allow multiple simultaneous host connections to a file system. Its possible to shoehorn such a solution on top of additional corosync management but this is not included in Proxmox. LVM is clusterable, but as mentioned above its got limitations.

Seabob said:
Aside of all this: I do hard to see an advantage when using computing cycles for software-ZFS-RAID from the main CPUs, which also drive the VMs, rather than offloading this job to a dedicated controller.

External storage that can do both block and file level checksums, snapshots, inline compression, and deduplication have a cost. Even then, without specialized storage drivers, snapshots and alternate streams would not be visible to the host. ZFS allows you to achieve all of that with only your host; The host performance requirements are negligible for mirror pools- the only direct cost is the added RAM (and you wouldnt use parity raid for vm storage anyway.)

iSCSI/FCAL/FCF isnt necessarily required for shared storage in a vsphere envoironment- vSAN, Nutanix, etc exist. Same goes for Proxmox- ceph is integreted into the distribution

Nemesiz said:
First of all - Why you want to use ZFS?

see above

Nemesiz said:
But none of them do data integrity check. Except others filesystems like btrfs.

In a perfect world, btfrs would be used for everything and we could let zfs go off to the sunset. Unfortunately, btrfs code quality is spotty, and does not have sufficient development resources backing it to be considered true competition. As long as single host virtualization continues to be an application people want to deploy, zfs will continue to be the best option.

Nemesiz said:
Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.

"ceph" doesnt care what you put on it. Neither filestore or bluestore will allow pg corruption. You would have to force it to operate with min_size of 1- and if you did that, thats on you- its not the fault of the system for not protecting you from yourself.

EdoFede · Jan 23, 2024

Dunuin said:
There are U.2 PCIe cards too:

Nice find, thank you!
These doesn't fit in this project (1RU server with half-size slots), but good to know!

EdoFede · Jan 23, 2024

Nemesiz said:
First of all - Why you want to use ZFS?

In my case, mainly because of replication (ZFS is the only supported FS by PVE for replication "out-of-the box").

Nemesiz said:
Few points:

* Long time ago I played with ceph ( bluestore wasn't created ) and I created situation with broken data - ceph did not validate and returned corrupted data to the program.

* Then enterprise HDD starts to become broken ZFS first report the problem before you can see something in S.M.A.R.T

Excellent indications, in particular on CEPH, on which I have no experience yet.
I would like to try it in another lab in the near future, as soon as I have hardware available to play with.

Nemesiz · Jan 23, 2024

It is good to play with half way broken HDD to understand how filesystem react to the problems.

djzoidberg · Mar 1, 2024

Hello EdoFede,
I found your article very useful. However, I'm going crazy with Windows VMs.

I have the following setup:
- Dell PowerEdge R640 with 378GB of ram
- Proxmox installed in a dedicated SSD
- 2 x SSD zfs RAIDZ0 -> zpool storage1
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2
- SATA controller in AHCI mode

I usually have only Linux containers/vms on my proxmox node however I need to build a Windows server 2022 and here the problem has started.

I configured the VM to run on "storage2" so I expected it could be a bit slow but instead it was a complete disaster. It has been a nightmare installing the OS and after every task that requires a read/write operation, the VM seems also disconnected from the network making it impossible to manage the VM using RDP.
As well the console interface got freeze every time or was impossible to connect to the console

I tried a couple of zfs settings with no success like disabling zfs sync `zfs set sync=disabled storage2/VMs`, changing the dirty value to 64MB or 128MB as suggested in other discussions on the forum, or setting min/max values to zfs arc with no success.

I know this is not the best setup ever made, by the way, do you have suggestions or things that I could try?

Kingneutron · Mar 1, 2024

rj45 said:
Install hwraid pci-e enterprise raid card and connect drives to hwraid with raid-10 (created with raid card bios). Then create zfs raid0 via proxmox gui or shell.

NO. Just no. That is terrible advice, and you are setting this poor guy up for failure down the road when disks start failing.

> I have the following setup:
- Dell PowerEdge R640 with 378GB of ram
- Proxmox installed in a dedicated SSD
- 2 x SSD zfs RAIDZ0 -> zpool storage1
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2
- SATA controller in AHCI mode

@djzoidberg - First of all, running SSDs in RAID0 is going to be a disaster when 1 drive dies, taking out the whole pool. Rebuild as a mirror unless you have reliable, daily backups and 1-2 spare drives. Add more 2-drive mirrors if you need more free space.

Secondly - Seagate Barracudas are DESKTOP-rated drives (and they don't tend to last long.) Putting them in raidz1 gives you the IOPs of (1) drive + the ZFS overhead.

I would buy Ironwolf or Ironwolf Pro drives (or even better, Toshiba NAS for speed - I've had good results with N300 drives) and rebuild as a mirror pool if you need speed out of it.

Do NOT put your ZFS pool on top of hardware RAID. You want a SAS HBA in IT mode (actively cooled) unless you like failure and pain, weird errors, and restoring from backups. You could also consider SAS drives, as they are full-duplex where SATA is half-duplex.

Nemesiz · Mar 2, 2024

rj45 said:
) It's pretty safe zfs on enterprise hwraid. All disks checks (patrol reads and consistency checks) performed by hwraid firmware.
Replaced many disks on hwraid10 with zfs swraid0 ontop.

How HWRAID protect from data distortion or bit rot ?

EdoFede · Mar 3, 2024

djzoidberg said:
- 4 x HDD 2TB Seagate BarraCuda RAIDZ1 ashift: 12 -> zpool storage2

Why ashift 12?
On spinning drive it's better to match the logical/physical block size reported by smartctl.
Don't know the 2TB Barracuda, but spinning desktop disks are more likely to be 512 bytes sectors, so try with ashift 9.

UdoB · Mar 3, 2024

Barracudas larger than 500 GB are probably 4K: https://www.seagate.com/staticfiles/docs/pdf/datasheet/disc/barracuda-ds1737-1-1111us.pdf
So ashift=9 would be a bad idea...

EdoFede · Mar 3, 2024

Kingneutron said:
NO. Just no. That is terrible advice, and you are setting this poor guy up for failure down the road when disks start failing.

Do NOT put your ZFS pool on top of hardware RAID. You want a SAS HBA in IT mode (actively cooled) unless you like failure and pain, weird errors, and restoring from backups. You could also consider SAS drives, as they are full-duplex where SATA is half-duplex.

This entire thread it's exactly about this topic.

I quote part of my own post from the previous page.

EdoFede said:
The entire IT world has been running on file systems for decades without the advanced data resilience features that ZFS offers, while maintaining excellent reliability (otherwise we would have these problems on a daily basis).

So personally I see two possibilities:

ZFS is so unreliable that it cannot work properly on hardware solutions (on which every other file systems works fine)

The recommendation is simply derived from an exaggerated interpretation of what is actually a very interesting feature, but not so critical in its absence

I absolutely lean towards the second hypothesis, and this is precisely why I am trying to understand in depth this situation, since I have not yet found a single documented cases of catastrophic ZFS data corruption on enterprise-grade HW Raid solutions.

Just to be clear: I have no interest in necessarily being right.
But if I'm wrong, I would like to be corrected with real data in hand, because it can be useful not only to me but also to anyone else reading.

Right now it seems like a religious discussion more than a technical discussion. "Either you believe what is written, or there is no point in talking about it."
I think that the technical world can only benefit from discussions addressed on a technical level. We're almost all technicians here, aren't we?

I have not yet found detailed interventions regarding this "terrible advice" and its consequences, using Enterprise-grade hardware.

To be clear, just last week I installed two servers with ZFS, flashing the PERC H330 controllers in IT mode (HBA330), because of the absence of battery backed-up cache.
So I absolutely support this solution, but when it actually makes technical sense.

On another three servers with ZFS on a proper HW-Raid controller, I've got already 2 disk failure without any issue.
Replaced, rebuilded, not a single error.

Yet another "ZFS on HW-RAID" Thread (with benchmarks)

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

New Member

Renowned Member

Distinguished Member

Member

Member

Renowned Member

Member

Active Member

Renowned Member

Member

Famous Member

Member