PBS and ZFS Special Allocation Class VDEV ... aka Fusion Drive

tcabernoch

Member
Apr 27, 2024
38
2
8
Portland, OR
I've been chewing on this for months. Getting hardware into place to test it. Won't bore you with the story. I didn't waste much money if it isn't wonderful, but I'm hoping for wonderful.

Special VDEV is ... you can add a couple SSDs to a ZFS pool to speed it up. This isn't old-style hybrid drive with a cache up front. Like most ZFS stuff, its sorta like that, but different. The SSD vdev contains all the metadata for the pool. You can also tell it to accept files up to a certain size. That gives you fast searches and ... this part is me interpreting ... let's you address some of the worst part of a performance curve for enterprise storage.

There's two config items at play.
The recordsize is the point where a record gets broken into chunks. Anything larger is cut up into pieces of recordsize.
The special_small_blocks is the size block that will be written to the special vdev. Files of size larger than special_small_blocks get written to the main pool.

I need to figure out where the balance point between recordsize and special_small_blocks is for a PBS server.
Here's a histogram from 2 of them.
(Warning. If you run this code, do it in the datastore with the files you want to count, and be aware that it might run overnight.)

On b0x1, I think
recordsize=512k
special_small_blocks=256k

On b0x2, I think
recordsize=1M
special_small_blocks=256k


I'm fairly new to ZFS. This is an advanced topic. Any insight, or even just your own interpretation of these histograms would be welcome.
Thanks.


Code:
b0x1: /mnt/datastore/Backups]# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:   3484
  2k:   2391
  4k:   3367
  8k:   6027
 16k:  10471
 32k:  16789
 64k:  32644
128k:  74453
256k: 215039
512k: 383804
  1M: 385362
  2M: 405238
  4M:  69865
  8M:      7

Code:
b0x2:/rpool/BACKUP# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:  10372
  2k:   7137
  4k:   3602
  8k:   6272
 16k:  10094
 32k:  20460
 64k:  33754
128k: 100055
256k: 195302
512k: 453410
  1M: 394942
  2M: 530326
  4M:  80923
  1G:      1
 
Last edited:
Well, I did it. b0x2

zpool add rpool -o -f ashift=12 special mirror scsi-<gptid>-part3 scsi-<gptid>-part3
zfs set recordsize=1M rpool
zfs set special_small_blocks=256K

The create_random_chunks.py PBS storage abuse script shows some interesting changes.
I'm not sure how much that tells me, other than some operations are using the SSD and others aren't.

Type of AbuseBeforeAfter
sha256_name_generation0.77s0.76s
create_buckets3.61s3.77s
create_random_files749.22s91.57s
create_random_files_no_buckets84.91s46.72s
read_file_content_by_id571.08s8.44s
read_file_content_by_id_no_buckets4.26s4.30s
stat_file_by_id6.08s5.81s
stat_file_by_id_no_buckets1.40s1.49s
find_all_files552.26s52.77s
find_all_files_no_buckets0.45s0.44s


Write Latency Test
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1G count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=64M count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1M count=256 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=8K count=10K oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=512 count=1000 oflag=dsync
BeforeAfter
1073741824​
bytes(1.1GB,
1​
GiB)copied,
1.24446​
s,
863​
MB/s
1.1​
GB/s
67108864​
bytes(67MB,
64​
MiB)copied,
0.099322​
s,
676​
MB/s
756​
MB/s
268435456​
bytes(268MB,
256​
MiB)copied,
2.20857​
s,
122​
MB/s
124​
MB/s
83886080​
bytes(84MB,
80​
MiB)copied,
95.1172​
s,
882​
kB/s
967​
kB/s
512000​
bytes(512kB,
500​
KiB)copied,
8.88222​
s,
57.6​
kB/s
59.3​
kB/s

I should have done more pre-testing with bs<256K. That was not well thought-out.
Fortunately, this is currently a test build, and I will have another chance to test when I rebuild.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!