ZFS Config Help for Proxmox Backup Server (PBS) - 22x 16TB HDDs (RAIDZ2 vs. dRAID2)

Nov 8, 2024
21
0
1
Hi every one.

I’m building a dedicated Proxmox Backup Server and would appreciate your feedback on the best ZFS layout for my hardware. My primary goals are high random-I/O performance (for garbage collection and small writes), robust data integrity, and reasonable capacity.

Hardware

  • 22 × 16 TB HDDs (2 reserved as hot spares)
  • 2 × 3.84 TB MU NVMe (to be used as a mirrored special-metadata vdev)
  • 2 × 480 GB RI NVMe (mirrored OS boot; remaining partitions for SLOG)

Option A: Traditional RAIDZ2​

  • Data pool: 2 × 10-drive RAIDZ2 vdevs
  • Hot spares: 2 HDDs as global spares
  • Performance vdevs:
    – Special metadata: mirror of 3.84 TB NVMes
    – SLOG: mirror of leftover 480 GB NVMe partitions
Pros: Well-understood, excellent parallel I/O from two vdevs, proven reliability
Cons: Slow resilver on 16 TB drives


Option B: dRAID2​

  • Data pool: single draid2:10d:22c:2s vdev (10 data + 2 parity per stripe, 2 distributed spares)
  • Performance vdevs: same NVMe mirrors as in Option A
Pros: 5–10× faster resilver, instant use of distributed spares
Cons: Fixed stripe width may hurt small-file efficiency, only 2 failures tolerated


My Questions:​

  1. Random I/O & Throughput: Which option gives better IOPS/throughput for PBS workloads?
  2. Resilver Time & Risk: Real-world resilver times on 16 TB drives—does dRAID’s speed justify its lower failure tolerance?
  3. Capacity Efficiency: Post-parity usable space difference between configurations?
  4. NVMe Metadata/SLOG: Does using a special vdev and SLOG make the HDD layout choice less critical?
  5. Complexity vs. Expansion: For a fixed, large pool, is dRAID worth the added complexity over RAIDZ2?
I’m currently leaning toward 2 × 10-drive RAIDZ2 for its maturity, but dRAID’s faster rebuilds are tempting. Any real-world experience, benchmarks, or tuning tips would be hugely helpful!

Thanks in advance.
 
2 × 480 GB RI NVMe (mirrored OS boot; remaining partitions for SLOG)
The SLOG will only help with SYNC writes. As far as I know PBS .chunks are written "normal". (Can someone prove me wrong?)

And that SLOG will accept incoming data of one single 5 seconds interval while a second TXG (transaction group) is being flushed out to the HDDs. So it stores 10 seconds worth of data maximum. With 1 GBit/s you can get 125 MB/s * 5s * 2 txg = 1.25 GB. With 10 GBit/s --> 12.5 GB/s. With 100 GBit/s it is 125 GByte. If the SLOG is larger then the additional space will never be used.

The Special Device is much more important - and worth it.

If you go for RaidZ2 then the Special Device should consist of three devices mirrored. The redundancy level of RaidZ2 allows for losing two HDDs --> it should also allow two SD devices to fail.


Which option gives better IOPS/throughput for PBS workloads?
Mirrored vdevs! Plus that Special Device.

Restoring large VMs will always be slow if your main storage is a) HDD and b) only a very few spindles. Remember that RaidZx gives the IOPS of a single drive!

And for reading data the physical head has to move a zillion times - with or w/o Special Device! There is no optimization like during write, where the ZIL optimizes that 5 seconds interval.


Disclaimer: you are way larger than my small clusters and PBS's - where my own experience comes from.
 
Last edited: