PBS Verify duration on large HDD-based datastore — how to tune settings?

SRU

Well-Known Member
Dec 2, 2020
42
4
48
25
Environment

- PBS 4.1.1 on ZFS RAIDZ2, 14× 20TB HDD + NVMe mirror as Special Device
- `special_small_blocks = 0`
- Datastore: ~75TB used, 257 groups, ~11.400 snapshots, growing
- Verify config: 1 reader, 4 workers, 30-day interval, chunk iteration order: inode

---

Observation

After a full initial sync from the source server the verify run took approximately 5 days. During that time the server is largely unresponsive for UI operations — snapshot browsing times out consistently. We expect this to be the baseline duration for every subsequent 30-day run, growing as the datastore fills.

We measured resource utilization during an active verify run:

- IO-Delay: 1.5% — HDDs are nearly idle
- Load average: 1.76 / 1.88 / 1.84 on a 64-core system — effectively 2.75% CPU utilization
- Transfer rate: 50–200 MB/s, IOPS: ~100 — well below what 14 HDDs in RAIDZ2 could sustain

Neither CPU nor HDDs appear to be the bottleneck. The verify process appears to be limited by a single sequential read path — which matches the default configuration of
1 verification reader.

---

What we have considered

Increasing readers might seem counterproductive on HDDs due to seek contention. However, given that IO-Delay is only 1.5% and IOPS are around 100, the drives appear to have significant headroom. With `chunk iteration order: inode` reducing seek overhead, it is unclear whether additional readers would cause contention or simply better utilize available throughput.

Increasing workers beyond 4 seems unlikely to help — SHA256 computation is fast and the workers appear to be starved by the single reader rather than being the bottleneck themselves.

Namespace-based staggering does not help either since deduplication is datastore-wide — chunks referenced across namespaces exist once physically and must still be read regardless of which namespace is being verified.

---

Questions

Given the measured utilization numbers — 1.5% IO-Delay, ~2.75% CPU, ~100 IOPS on 14 HDDs — how should verification readers and workers be tuned for a datastore of this size? Is 1 reader intentionally conservative, and what are the trade-offs of increasing it?

Is there a recommended approach for large HDD-based datastores that we are missing?

Tanks for your efforts.
Greetings, Stefan
 
Find out yourself with your hardware:
  • apt install sysstat
  • Start two terminals:
    • iostat -dx 2
    • top -H
  • Increase readers until the load on iostat for every disk is ~90% (or some lower limit to leave headroom for other activities on the same zpool).
  • Increase workers by one if you see all default 4 threads mostly at 100%.
IME, in RAIDz + special device using readers==#ofHDDs gives good performance while still leaving headroom. Workers depend 100% on your CPU performance, so adjust as stated above. I've only had to increase workers on NVMe only PBS servers.

As you may already know, RAIDz will perform roughly as much as your slowest disk, so don't expect miracles: to read the whole 75TB at an (very optimistic) average of 250MB/s you'll still need three and half days.
 
ZFS RAIDZ2, 14× 20TB HDD

A single vdev? That's worst case - there is no way no make it slower than that! (Even a RaidZ3 has the same IOPS = same speed...)

My recommendation would have been 7*mirrors --> seven times the IOPS of that construct...

Sorry, I have no trivial cure for that situation. You may increase the the size of the SD and set small_blocks higher. But this is only a marginal change and does only help for newly written data, not for a verify.
 
Last edited:
  • Like
Reactions: news
Thanks to both of you.

@VictorSTS
Will do, that might give me some insight.

@UdoB
That is a very valueable hint that i will definitely try with the remaining PBS tats needs to be deployed.

For other readers:
I had success in cutting down sync and verification job times using the following settings in above scenario:

Code:
zfs set special_small_blocks=0 storage-bk
zfs set recordsize=1M storage-bk
echo 68719476736 > /sys/module/zfs/parameters/zfs_arc_max
#Make that persist reboots
echo "options zfs zfs_arc_max=68719476736" >> /etc/modprobe.d/zfs.conf
update-initramfs -u
 
Its more bad as you think!
PBS 4.1.1 on ZFS RAIDZ2, 14× 20TB HDD + NVMe mirror as Special Device
There is only one NVMe mirror as Special Device, so you can loos everythink, you have no second oder third Dive on your Special Device!
Use a Special Device als ZFS Mirror Device with N >= 2 Device.
 
Last edited:
  • Like
Reactions: UdoB
There is only one NVMe mirror as Special Device
Good point!! I did assume OP had at least a mirror of NVMe for special device. Should be a mirror of 3 to give it the same redundancy as the HDD part.

I had success in cutting down sync and verification job times using the following settings in above scenario:
None of those settings will really help to increase performance on your big zpool on verify, maybe on sync if a bigger ARC allows some frequently used chunks to stay in ram.

My recommendation would have been 7*mirrors --> seven times the IOPS of that construct...
For better IOPs consider switching your setup to striped mirrors ( aka RAID10) plus special device
Or at least two RAIDZ2 vdevs to get more usable space (~10 disks), increased resiliency (may lose 4 disks without dataloss if lucky enough to be on the right vdev) and potentially double the performance.
 
Or at least two RAIDZ2 vdevs to get more usable space (~10 disks), increased resiliency (may lose 4 disks without dataloss if lucky enough to be on the right vdev) and potentially double the performance.

I might be wrong but if I recall correct RAIDZ (no matter which level) performs worse than a mirror or striped mirror setup. Of course this might be an acceptable tradeoff if one prefer the higher resilency. Or do you suggest to build a mirror which consists out of two RAIDz2?
 
I mean a "RAID0 of two RAIDz2 vdevs". Something like this:

Code:
zpool create tank \
  raidz2 disk1 disk2 disk3 disk4 disk5 disk6 disk7 \
  raidz2 disk8 disk9 disk10 disk11 disk12 disk13 disk14

I might be wrong but if I recall correct RAIDZ (no matter which level) performs worse than a mirror or striped mirror setup
That's true, as I mentioned above:
As you may already know, RAIDz will perform roughly as much as your slowest disk, so don't expect miracles: to read the whole 75TB at an (very optimistic) average of 250MB/s you'll still need three and half days.

That happens because ZFS can only issue iops at vdev level, independently of the number of disks that compose it.
 
  • Like
Reactions: Johannes S