Hardware Raid or ZFS

Digitaldaz · Thursday at 16:02

Tomorrow I have 4 x DC600M 3.84TB disks arriving. The whole reason I am getting these is due to very slow restores when using HDD.

The question now is do I use hardware raid or zfs. I have a hardware raid card with BBU and also LSI HBAs so I can go with either.

Also in terms of raid level, I would generally always use 10 or striped mirrors using zfs.

Would using something like RAID5/RAIDZ have a significant impact?

Any input would be appreciated.

Thanks
Daz

tcabernoch · Thursday at 20:02

Oh, you just wanted to start a ZFS fight to kick off the new year, didn't you?
Have you done this every year since 2014? So its like a decade now? That's cool. I like a good nerd battle.

I know you're teasing, but the question deserves a response.

ZFS gives you additional capabilities that LVM does not, particularly in a clustered Proxmox environment.
There's an overhead to these capabilities. A virtualized filesystem like ZFS will never be quite as fast LVM.

ZFS is still maturing, and has recently adapted to new high-speed storage with Special VDEVs. The metadata and small blocks get stored on fast SSD, while large blocks go to the array. Consider it.

Special vdev
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

More special vdev. This is gold.
https://klarasystems.com/articles/openzfs-understanding-zfs-vdev-types/

waltar · Thursday at 20:20

tcabernoch said:
A virtualized filesystem like ZFS will never be quite as fast LVM.

Digitaldaz said:
The whole reason I am getting these is due to very slow restores when using HDD.

I think it's for a pbs installation ...

tcabernoch said:
with Special VDEVs. The metadata and small blocks get stored on fast SSD, while large blocks go to the array.

Additional mirrored nvme's to 4x (4TB) sata ssd's are quiet unnormal ...

If you are a zfs fan go with zfs. A zfs dual mirror with lz4 would give about same capacity as with hw-raid5 while hw-raid10 just 2/3 (of dual mirror).
If you are a hw-raid fan ... is your raid-ctrl. ready for ssd's (eg has 4GB cache means actual) otherwise you may observe bad iops.

tscret · Thursday at 20:53

Digitaldaz said:
Tomorrow I have 4 x DC600M 3.84TB disks arriving. The whole reason I am getting these is due to very slow restores when using HDD.

The question now is do I use hardware raid or zfs. I have a hardware raid card with BBU and also LSI HBAs so I can go with either.

Also in terms of raid level, I would generally always use 10 or striped mirrors using zfs.

Would using something like RAID5/RAIDZ have a significant impact?

Any input would be appreciated.

Thanks
Daz

Hi Daz

Answer is as allway it depends... when you're host is built with some extra RAM (ECC?) and CPU I'd prefer ZFS over an HW RAID.
The snapshot and replication possibilities of ZFS are usefull and even if you're RAID-Controller might fail, you easy could export / import the ZFS-Pool on an other Machine later on. With four 3.84 Disks I'd go for a RAIDZ1 (simplified 1 Disk parity).

May have an read on this: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_raid_considerations

!! One small point I struggled sometimes was the vzdump Backup of LXC Containers on ZFS Storage - but there are Tutorial in this Forum how to edit the Config !!

Other Point: Is it an Proxmox Backup Server or an PVE? And are you sure that not the SATA Interface might be the bottleneck of your "slow" restores?

Regards
tscret

UdoB · Thursday at 20:55

Digitaldaz said:
Would using something like RAID5/RAIDZ have a significant impact?

For VM storage? My thoughts about it: https://forum.proxmox.com/threads/fabu-can-i-use-zfs-raidz-for-my-vms.159923/

tcabernoch · Thursday at 21:10

:]

Haha. I was in first!

Might as well summon the devil too. This team produced a testing script that can be used to benchmark your installation. If you run it before you rebuild, you'll be able to quantify the improvements and accurately evaluate any subsequent tuning. The script is not for the faint of heart, and (of course) should not be run on an in-use production system.

PBS .chunks Abuse Script - Der Harry

D

[TUTORIAL] Thread 'Datastore Performance Tester for PBS'

Jun 11, 2024

This is a performance tester for datastores of your PBS.

-> Intended before you setup a production PBS. <-

Bash:

apt-get update
apt-get install git
git clone https://github.com/egandro/pbs-storage-perf-test.git
cd pbs-storage-perf-test
# replace datastore-dir with your own directory
./create_random_chunks.py /datastore-dir
rm -rf /datastore-dir/dummy-chunks

- Read how/what/and why we test in this way: https://github.com/egandro/pbs-storage-perf-test/blob/main/README.md
- Our results: https://github.com/egandro/pbs-storage-perf-test/blob/main/results.md
- Our conclusion...

waltar · Thursday at 22:15

I would hardly prefere elbencho for simulating pbs throughput on any raid config - https://github.com/breuner/elbencho
Generate pbs like data:
elbencho -w -d -t 2 -n 256 -N 512 -s 4m -b 1m /your_test_raid_set_mount(/dir|dataset)
Read data:
elbencho -r -t 2 -n 256 -N 512 -s 4m -b 1m /your_test_raid_set_mount(/dir|dataset)

Digitaldaz · Thursday at 23:50

Thanks for all your replies.

This is for a PBS server. The bottleneck is definitely not the SATA interface. I have been doing some migrating of VMs between various datacenters and this is where the slowness of the restore came to light.

Its certainly PBS related as to speed things up I started just backing up to NFS, ie full backup and then restoring from there which was much faster.

It was when I started digging into the slowness of it that I came across recommendations for PBS that said you should not use HDD.

This is when I decided to build a new PBS server. I do have time to do some testing as I'm not going to upgrade the existing one, I'm going to replace it.

fraroy · 2025-01-06T05:00:31+0100

Daz,

I'm also building a pbs server with same drives, and I'd like to add a third option to your query. But first a description of my main objective: I want restores to the pve hosts to saturate the 10Gbps links (100% utilization) of the network. Since a single SATA interface has an effective throughput of approx. 5Gbps (for SSDs such as the DC600M), you need to add up the bandwidth of at least 2 drives to get 10Gbps. How you organize those drives (parity raid, mirror, stripe...) is the next logical question. Contrary to common wisdom, I chose stripe (raid0). This is a homelab and my resources are limited. Backups are very important to me, but not required to run the "production" load. To alleviate the future failure of the stripe, after each backup window the pbs datastore residing on the stripe is synched to a removable datastore (external sata HDD). The pbs server is powered-on only a few hours every week, most of this time spent synching the just finished backup to the external drive. I have 2 external drives, which are rotated offsite.

Now, back to the original question. Hardware raid has been out of my scope from the start. The pbs "server" (a desktop tower, i7-10700 with dual-ports 10Gbps nic) has no pci slot avalaible. As for software raid options with pbs, I benchmarked zfs and btrfs. btrfs consumes less cpu and much less RAM than zfs. It also came on top for effective throughput. I tried to tweak the benchmark to account for the very specific load of pbs, a continuous stream of files, sizes between 1Mbytes and 4Mbytes. Here are the results for a stripe of three DC600M 7.6TB:

I can provide detailed configuration of setup and also exact commands used for the benchmark. My hypothesis: zfs is hindered by it's cache and all associated overhead for a continuous stream of files that are referenced only once during a job.

I copied a .chunks directory over the network to confirm the benchmark results. To rule out the impact of encryption, I used rsync. The pbs server was configured with the rsync daemon and the pve host pulled the .chunks directory from pbs, writing it to the VMs datastore (a pair of mirrored pcie4.0 NVMe). This flow mimicks a restore from the network and storage points of view. With a single rsync task, the observed throughput:

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.18%
Port TX Utilization - 77.39%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.18%
Port TX Utilization - 75.87%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.15%
Port TX Utilization - 74.68%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.19%
Port TX Utilization - 77.43%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.18%
Port TX Utilization - 76.52%

Note the throughput isn't perfectly steady in this sample, varying between 7,46Gbps and 7,74Gbps. I then started two simultaneous instances of rsync, the first copying chunks from 0000 to 7fff and the second copying chunks from 8000 to ffff, saturating the link:

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.77%
Port TX Utilization - 99.94%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 1.02%
Port TX Utilization - 99.92%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.73%
Port TX Utilization - 99.97%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.89%
Port TX Utilization - 99.94%

tlsx3008f-1#show system-info int te 1/0/2
Port RX Utilization - 0.80%
Port TX Utilization - 100.00%

Next steps: finish customizing the pbs server, then rebuild the two pve hosts, from 8.1 to 8.3. Again, if you have any questions about my setup, please don't hesitate.

François

UdoB · 2025-01-06T08:49:33+0100

fraroy said:
Contrary to common wisdom, I chose stripe (raid0). ... . Backups are very important to me

For me this doesn't make sense. Especially as I realize you know this already: as soon as one single device fails (and at some point in the future it will - without any doubt) the whole pool is toast. I have more than one PBS in my Homelab, and still the third one has redundant drives. (Mostly older hardware with ZFS, turned on only once per week.)

Of course it is fine, as it seems to work for you. Until it doesn't...

Good luck! (I mean it!)

_gabriel · 2025-01-06T09:01:36+0100

fraroy said:
To rule out the impact of encryption, I used rsync. [...] . This flow mimicks a restore from the network and storage points of view.

PBS flow isn't sequential as rsync.
PBS restore task get each required chunks listed in the backup manifest (=pbs call it snapshot).

Do you have numbers about real PBS restore ?

waltar · 2025-01-06T09:16:10+0100

For pbs read performance is important as writes could be neglect nearly total.

fraroy · 2025-01-07T05:37:08+0100

_gabriel said:
PBS flow isn't sequential as rsync.
PBS restore task get each required chunks listed in the backup manifest (=pbs call it snapshot).

Do you have numbers about real PBS restore ?

Not yet. I understand your point about rsync. This is the best approximation I could come up with at this point. PBS is performing a lot more (indexing, compression, encryption...) in the background, so I'll be very surprised if it can reach the same speed. I just needed to set a baseline for the network and storage. Also, consider this:

When I look at the distribution of file sizes in the test .chunks directory I have here, over 85% of the files (count) are larger than 1M, occupying 98% of the total space. A surprising number of chunks are larger than 5M. In this post from Ramalama, you can see similar numbers. My point being files of size greater than 1M should be considered as "sequential" (if sector size of drive is 4K, this means >256 hopefully contiguous sectors, and even more so with 512b drives). On the other hand, each of those chunks/file will be written only once (dedup) and scattered all over the multi-terabyte datastore, this is the "random" part. Maybe I'm wrong, but SSDs should have no problem reading scattered files at line rate, provided they are large enough. Many benchmarks will use 1M blocksizes to test sequential access.

fraroy · 2025-01-07T16:19:21+0100

waltar said:
For pbs read performance is important as writes could be neglect nearly total.

For me too the read throughput is important as my first objective is very fast restores, ideally saturating the 10Gbps network. I wanted to thank you for suggesting elbencho with parameters to replicate pbs workloads. I read your post as I was completing raid0 zfs vs btrfs benchmarking. Btrfs was a little ahead in throughput and also consumed less cpu&memory, but nothing decisive. However the read througputs with elbencho (722MiB/S vs 1322MiB/s) made the choice very easy. I tripled checked this result. I even recreated the zfs pool with ashift=12 (instead of 9), but the resulting throughputs were the same, with btrfs still a clear winner.

waltar · 2025-01-07T16:52:30+0100

ashift should just make rosa noise to benchmark results but recordsize from default 128k to 1M make big difference on hdd pool while then metadata performance go down extreme which then again could be neclect if implement a special device. Don't know yet if in a ssd (nvme) pool a recordsize would a make a throughput win or not but if you like zfs you should try that too with elbencho read.
Nevertheless I like ever to test the extreme for a future use, eg. if a fileserver normally has 40M files I test before production with 100M (smaller eg. elbencho or/and couple of some copies of production) files which give validity to not run into problems next time as even a storage should always run a "burn in" (but that's personally).

IsThisThingOn · 2025-01-07T17:02:18+0100

fraroy said:
My hypothesis: zfs is hindered by it's cache

ZFS does not have "cache"

fraroy said:
This flow mimicks a restore from the network and storage points of view

There are multiple possible bottlenecks that are not storage based.
There is a reason why there is a PBS benchmark tool. Use that.

fraroy said:
However the read througputs with elbencho (722MiB/S vs 1322MiB/s) made the choice very easy. I tripled checked this result. I even recreated the zfs pool with ashift=12 (instead of 9), but the resulting throughputs were the same, with btrfs still a clear winner.

ZFS just like BTRFS can operate at hard drive speeds. If you notice a difference, that is because you "misconfigured" something.

Anyway, since you don't care about availability, and PBS does not profit from a CoW filesystem. I would probably recommend just enabling RAID in your BIOS (which could be faster if it CPU based like VROC) and use PBS with the none CoW filesystem ext4.

waltar · 2025-01-07T17:09:07+0100

IsThisThingOn said:
Also for me personally, speeds of creating a backup are way more important than restoring one.

For shure backup is daily and restore perhaps better but reads are harder to reach and writes of 1-4MB files are easy and even for zfs - so creating pbs files is not a problem on a pbs host with local zfs but eg. for a pbe which uses a nfs store for.

fraroy · 2025-01-07T17:16:49+0100

for future reference here is the zfs config I used for benchmarking (most parameters straight from Ramalama in this thread)

zfs.conf:
options zfs zfs_arc_min=17179869184
options zfs zfs_arc_max=21474836480
#
options zfs zfs_flags=0x10 < - - < < checked with/without and only a few percent penalty, so left enabled since no ECC RAM

then:

zpool create \
-m /mntzfs/sataraid0 \
-o ashift=9 \ < - - < < also tried 12 for a few spotchecks, very little difference
-O compression=off \
-O xattr=sa \
-O dnodesize=auto \
-O recordsize=1M \
sataraid0 \
/dev/disk/by-id/ata-KINGSTON_SEDC600M7680G_****redacted**** \
/dev/disk/by-id/ata-KINGSTON_SEDC600M7680G_****redacted**** \
/dev/disk/by-id/ata-KINGSTON_SEDC600M7680G_****redacted****

Search

Search

Hardware Raid or ZFS

Digitaldaz

Renowned Member

tcabernoch

Active Member

waltar

Renowned Member

tscret

New Member

UdoB

Distinguished Member

tcabernoch

Active Member

[TUTORIAL] Thread 'Datastore Performance Tester for PBS'

waltar

Renowned Member

Digitaldaz

Renowned Member

fraroy

New Member

UdoB

Distinguished Member

_gabriel

Renowned Member

waltar

Renowned Member

fraroy

New Member

fraroy

New Member

waltar

Renowned Member

IsThisThingOn

Active Member

waltar

Renowned Member

fraroy

New Member