PBS and ZFS Special Allocation Class VDEV ... aka Fusion Drive

tcabernoch

Active Member
Apr 27, 2024
253
53
28
Portland, OR
www.gnetsys.net
I've been chewing on this for months. Getting hardware into place to test it. Won't bore you with the story. I didn't waste much money if it isn't wonderful, but I'm hoping for wonderful.

Special VDEV is ... you can add a couple SSDs to a ZFS pool to speed it up. This isn't old-style hybrid drive with a cache up front. Like most ZFS stuff, its sorta like that, but different. The SSD vdev contains all the metadata for the pool. You can also tell it to accept files up to a certain size. That gives you fast searches and ... this part is me interpreting ... let's you address some of the worst part of a performance curve for enterprise storage.

There's two config items at play.
The recordsize is the point where a record gets broken into chunks. Anything larger is cut up into pieces of recordsize.
The special_small_blocks is the size block that will be written to the special vdev. Files of size larger than special_small_blocks get written to the main pool.

I need to figure out where the balance point between recordsize and special_small_blocks is for a PBS server.
Here's a histogram from 2 of them.
(Warning. If you run this code, do it in the datastore with the files you want to count, and be aware that it might run overnight.)

On b0x1, I think
recordsize=512k
special_small_blocks=256k

On b0x2, I think
recordsize=1M
special_small_blocks=256k


I'm fairly new to ZFS. This is an advanced topic. Any insight, or even just your own interpretation of these histograms would be welcome.
Thanks.


Code:
b0x1: /mnt/datastore/Backups]# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:   3484
  2k:   2391
  4k:   3367
  8k:   6027
 16k:  10471
 32k:  16789
 64k:  32644
128k:  74453
256k: 215039
512k: 383804
  1M: 385362
  2M: 405238
  4M:  69865
  8M:      7

Code:
b0x2:/rpool/BACKUP# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:  10372
  2k:   7137
  4k:   3602
  8k:   6272
 16k:  10094
 32k:  20460
 64k:  33754
128k: 100055
256k: 195302
512k: 453410
  1M: 394942
  2M: 530326
  4M:  80923
  1G:      1
 
Last edited:
Well, I did it. b0x2

zpool add rpool -o -f ashift=12 special mirror scsi-<gptid>-part3 scsi-<gptid>-part3
zfs set recordsize=1M rpool
zfs set special_small_blocks=256K

The create_random_chunks.py PBS storage abuse script shows some interesting changes.
(That script can be obtained here. https://forum.proxmox.com/threads/datastore-performance-tester-for-pbs.148694/)

I'm not sure how much that tells me, other than some operations are using the SSD and others aren't.

Type of AbuseBeforeAfter
sha256_name_generation0.77s0.76s
create_buckets3.61s3.77s
create_random_files749.22s91.57s
create_random_files_no_buckets84.91s46.72s
read_file_content_by_id571.08s8.44s
read_file_content_by_id_no_buckets4.26s4.30s
stat_file_by_id6.08s5.81s
stat_file_by_id_no_buckets1.40s1.49s
find_all_files552.26s52.77s
find_all_files_no_buckets0.45s0.44s


Write Latency Test
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1G count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=64M count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1M count=256 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=8K count=10K oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=512 count=1000 oflag=dsync
BeforeAfter
1073741824​
bytes(1.1GB,
1​
GiB)copied,
1.24446​
s,
863​
MB/s
1.1​
GB/s
67108864​
bytes(67MB,
64​
MiB)copied,
0.099322​
s,
676​
MB/s
756​
MB/s
268435456​
bytes(268MB,
256​
MiB)copied,
2.20857​
s,
122​
MB/s
124​
MB/s
83886080​
bytes(84MB,
80​
MiB)copied,
95.1172​
s,
882​
kB/s
967​
kB/s
512000​
bytes(512kB,
500​
KiB)copied,
8.88222​
s,
57.6​
kB/s
59.3​
kB/s

I should have done more pre-testing with bs<256K. That was not well thought-out.
Fortunately, this is currently a test build, and I will have another chance to test when I rebuild.
 
Last edited:
I got to finish this up.
TLDR; It did not speed up backups. I think my bottleneck on the backups must be the local server disk speed.

I did test for file sizes less than 256k. I expected to see the most improvement there, but no, its essentially unchanged.

dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=1G count=1 oflag=dsync1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.43998 s, 746 MB/s1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.22767 s, 875 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=64M count=1 oflag=dsync67108864 bytes (67 MB, 64 MiB) copied, 0.101041 s, 664 MB/s67108864 bytes (67 MB, 64 MiB) copied, 0.0869583 s, 772 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=1M count=256 oflag=dsync268435456 bytes (268 MB, 256 MiB) copied, 2.26298 s, 119 MB/s268435456 bytes (268 MB, 256 MiB) copied, 2.18099 s, 123 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=8K count=10K oflag=dsync83886080 bytes (84 MB, 80 MiB) copied, 92.9258 s, 903 kB/s83886080 bytes (84 MB, 80 MiB) copied, 86.8977 s, 965 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=512 count=1000 oflag=dsync512000 bytes (512 kB, 500 KiB) copied, 9.31836 s, 54.9 kB/s512000 bytes (512 kB, 500 KiB) copied, 8.61145 s, 59.5 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=256 count=1000 oflag=dsync256000 bytes (256 kB, 250 KiB) copied, 7.45161 s, 34.4 kB/s256000 bytes (256 kB, 250 KiB) copied, 8.38339 s, 30.5 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=128 count=1000 oflag=dsync128000 bytes (128 kB, 125 KiB) copied, 8.81583 s, 14.5 kB/s128000 bytes (128 kB, 125 KiB) copied, 8.51855 s, 15.0 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=64 count=1000 oflag=dsync64000 bytes (64 kB, 62 KiB) copied, 8.1428 s, 7.9 kB/s64000 bytes (64 kB, 62 KiB) copied, 8.72252 s, 7.3 kB/s

And I again confirmed that a number of disk functions are much faster.
Deleting a chunks forlder with its 65 thousand entries takes just seconds after the special vdev is added.

The create_random_chunks.py storage abuse script again showed significant improvement on a couple items.
filesystem detected by stat(1): zfsfilesystem detected by stat(1): zfs
files to write: 500000files to write: 500000
files to read/stat: 50000files to read/stat: 50000
buckets: 65536buckets: 65536
sha256_name_generation: 0.70ssha256_name_generation: 0.69s
create_buckets: 3.47screate_buckets: 3.44s
create_random_files: 686.38screate_random_files: 90.63s
create_random_files_no_buckets: 48.48screate_random_files_no_buckets: 46.32s
read_file_content_by_id: 156.37sread_file_content_by_id: 10.30s
read_file_content_by_id_no_buckets: 4.30sread_file_content_by_id_no_buckets: 4.26s
stat_file_by_id: 6.04sstat_file_by_id: 5.85s
stat_file_by_id_no_buckets: 1.59sstat_file_by_id_no_buckets: 1.49s
find_all_files: 195.05sfind_all_files: 53.96s
find_all_files_no_buckets: 0.45sfind_all_files_no_buckets: 0.43s



Along with all that before and after testing, I ran some backups.
I did not see a speed increase when running individual backups. I've concluded that the bottleneck must be local disk speed, not the target.

Trying to saturate the PBS server, I ran backups on all servers at once.
It ran pretty darn good.
With all 6 servers in the cluster running backups, with the exception of the fastest one, they maintained the backup speeds previously sampled.

I don't think this setup will push the PBS server to its limits.

--------------------------
Note that its hard to benchmark PBS performance. You have to feed it brand new data its never seen before, or the dedupe features completely b0rk your test.
This log is for a brand new backup that this PBS server has never seen before. 81% deduped. Brand new backup.
That's amazing, but makes performance very difficult to test and/or tweak.

INFO: backup was done incrementally, reused 147.39 GiB (81%)

--------------------------

Here's' the final result zpool iostat while I'm running 6 backups.
3/4 of the writes are going to the SSD vdev. That's probably not good. Not awful, but its just gonna fill up.
Here's the config, if anyone has suggestions.
I could re-run the file size histogram, but I tried this based on the last results, and it seems like more of a guestimate anyway.

ashift=12
zfs set recordsize=1M rpool
zfs set special_small_blocks=256K

1721613912003.png

@fiona is gonna ignore my posts forever for asking, but I wonder if this has been discussed by the devs.
What's the proper tuning for a zfs special vdev that holds a PBS chunks system?
 
Last edited:
The zpool iostats seem to have changed quite a bit. And they aren't changing when I have active backups, so this seems to be the final shape.
I don't know, I guess I like this shape. Seems proportional to the respective roles.

1722031328091.png
 
I tried using 2 special vdevs. Ya, don't do that.

  • Starting with 2 spinners in a mirror + 6 unallocated SSD.
  • Test 'before' performance w/create_random_chunks.py PBS storage abuse script.
  • Add a 3-disk mirror special vdev.
  • Configure 1m/512k
  • Test.
  • Add second 3-disk mirror special vdev.
  • Test.

TLDR;
There are (the expected) exponential improvements with the addition of the first special vdev.
The second special vdev provides incremental improvements, but does not have a very significant impact.
I don't think 2 special vdevs is an effective way to deploy this group of disks, and I'll try a different geometry.

I'm considering building on a pair of SSD raidz1 vdevs and then just add the spinners as a mirror vdev.
I understand zfs will sort things out and downgrade the slower drives without having to do anything special at all.
(I'm good with single-disk redundancy here. It's a backup server. It needs more capacity than the SSD provide.)
I bet that's as fast or faster than this 2-special-vdev mistake I've created today.

=====================================================


Add a special vdev.
zpool add rpool -f -o ashift=12 special mirror scsi-<>-part3 scsi-<>-part3 scsi-<>-part3

Configure it.
zfs set recordsize=1M rpool
zfs set special_small_blocks=512K rpool


1722650706363.png

=====================================================
Test results

filesystem detected by stat(1)Just spinners1 special vdev2 special vdevs
files to write500000500000500000
files to read/stat500005000050000
buckets655366553665536
sha256_name_generation1.01s1.00s1.11s
create_buckets2.98s2.92s2.89s
create_random_files682.80s127.97s122.96s
create_random_files_no_buckets77.95s40.44s40.47s
read_file_content_by_id274.66s7.93s8.24s
read_file_content_by_id_no_buckets3.69s3.57s3.90s
stat_file_by_id6.84s6.15s5.42s
stat_file_by_id_no_buckets1.29s1.26s1.27s
find_all_files334.89s47.20s49.79s
find_all_files_no_buckets0.40s0.39s0.40s
 
Well, this has been a long and educational day.
I'm done. Both of these machines are going into production.

The spinner mirror + 2 SSD special vdevs was just a waste of resources. No advantage was gained from the second special vdev, and a lot of disks were used.​
A pair of SSD raidz1 vdevs performed the best. This was unsurprising. I imagine if speed was the goal, 3 vdevs of mirrors would have done better. If speed was the only goal, we'd have stuck more SSD in the spinner bays.​
And finally ... just adding a mirror of fat spinners to the pair of SSD vdevs ... that didn't hurt the test results much at all, and it gives the array sufficient capacity to fill the PBS role. This surprised me. I expected the whole array to just die when I added the spinners. Apparently ZFS has ways of managing slow disks and vdevs. (I've seen this hidden management and balancing alluded to, but haven't read anything about it.) This configuration appeared to be the most efficient use of the resources.​

This project was largely motivated by the terrible performance we were seeing out of our original virtualized PBS deployments.
I wanted to make sure any actual hardware I allocated was being well used and configured optimally.
Both of the servers I got had 2 empty drive bays, and thus the all-SSD server got a pair of capacity spinners, and the all-spinner server got a pair of MU SSD.
I think i've done ok here. I spent a less than $2k, learned an enormous amount, and converted two not-so-useful Dell Gen12 machines into high-performance baremetal Proxmox Backup Server deployments.

PBS is extremely difficult to test and benchmark, given all the dedupe it does. It would be easy to shoot holes in my methods here.
If there are better ways to look at PBS servers, their disk performance, and backup performance, I'm certainly interested.


spinner mirror + 2 SSD special vdev2 raidz1 SSD vdevs2 raidz1 SSD vdevs + mirror vdev of spinners
files to write500000500000500000
files to read/stat500005000050000
buckets655366553665536
sha256_name_generation1.11s1.13s1.16s
create_buckets2.89s2.92s3.73s
create_random_files122.96s121.02s116.64s
create_random_files_no_buckets40.47s40.68s41.12s
read_file_content_by_id8.24s8.12s8.50s
read_file_content_by_id_no_buckets3.90s3.93s3.51s
stat_file_by_id5.42s5.52s6.05s
stat_file_by_id_no_buckets1.27s1.26s1.27s
find_all_files49.79s48.85s47.58s
find_all_files_no_buckets0.40s0.40s0.40s
 
Last edited:
  • Like
Reactions: Johannes S

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!