I was seeing write speeds to the RAIDZ of somewhere in the region of 500MB/s
That bandwidth does not help at all. IOPS is key, not the maximum speed for large files.
I know that the following is not new, has been discussed several times and does not solve your problem, but let me add it here nevertheless as a random datapoint. This is from one machine in my Homelab:
One of my pools consists of
four 6TB Western Digital / Seagate drives. And I had the glorious idea to go for
RaidZ2 without an adequate "Special Device". I added a simple (read-) cache later, but that does not help at all. To show what this gets me to I utilize fio like this:
Code:
zfs create -o compression=off rpool/fio
Code:
/rpool/fio# fio --name=randrw --ioengine=libaio --direct=1 --rw=randrw --bs=2M --numjobs=1 --iodepth=16 --size=20G --time_based --runtime=60
randrw: (g=0): rw=randrw, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=16
read: IOPS=20, BW=40.9MiB/s (42.9MB/s)(2458MiB/60108msec)
write: IOPS=21, BW=42.3MiB/s (44.3MB/s)(2542MiB/60108msec); 0 zone resets
That bs=2M was chosen because the PBS chunks are 4M uncompressed. May be there are more 2MB than 4MB. For comparison the same with 4MB:
Code:
:/rpool/fio# fio --name=randrw --ioengine=libaio --direct=1 --rw=randrw --bs=4M --numjobs=1 --iodepth=16 --size=20G --time_based --runtime=60
read: IOPS=16, BW=66.4MiB/s (69.6MB/s)(3984MiB/60014msec)
write: IOPS=17, BW=70.4MiB/s (73.8MB/s)(4224MiB/60014msec); 0 zone resets
From tests like this I get my totally (not) surprising understanding that rust is slow ;-)
If you really need to utilize classic hdd go for multiple vdevs (use mirrors!) AND add a "Special Device" early in the process. If you add it later you need to send/recv all the data (or copy it once by other means).
In any case those "20 IOPS" are barely useable...
At the same time I can proof (wrongly) to be able to write with 300 MB/s:
Code:
:/rpool/fio# dd if=/dev/urandom bs=2M count=2000 of=4gb.dd status=progress
4171235328 bytes (4.2 GB, 3.9 GiB) copied, 13 s, 321 MB/s
2000+0 records in
2000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.1896 s, 318 MB/s
The above is lying, the proof is this the same command plus an extra "sync":
Code:
:/rpool/fio# time ( dd if=/dev/urandom bs=2M count=2000 of=4gb.dd status=progress ; sync )
4173332480 bytes (4.2 GB, 3.9 GiB) copied, 13 s, 321 MB/s
2000+0 records in
2000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.1658 s, 319 MB/s
real 1m6.581s
user 0m0.008s
sys 0m9.233s
It is async writes with normal ZIL. The "dd" ends early after 13 seconds. But the "real" time measures 66 seconds = 5 times longer.
So the bandwidth is more like 60 MB/s than 300 MB/s...
The result gets worse if dd is directly requesting sync writes:
Code:
:/rpool/fio# dd if=/dev/urandom bs=2M count=1000 of=2gb.dd status=progress oflag=sync
2090860544 bytes (2.1 GB, 1.9 GiB) copied, 168 s, 12.4 MB/s
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 168.575 s, 12.4 MB/s
Now I am down to 12 MB/s. This is what I meant with "performance of a single spindle" and "multiple head movements due to meta data handling etc.".
This was a write-test. For reading data a cache may help, but only for repetitive requests
of the same data.