PBS and ZFS Special Allocation Class VDEV ... aka Fusion Drive

tcabernoch · Jun 14, 2024

I've been chewing on this for months. Getting hardware into place to test it. Won't bore you with the story. I didn't waste much money if it isn't wonderful, but I'm hoping for wonderful.

Special VDEV is ... you can add a couple SSDs to a ZFS pool to speed it up. This isn't old-style hybrid drive with a cache up front. Like most ZFS stuff, its sorta like that, but different. The SSD vdev contains all the metadata for the pool. You can also tell it to accept files up to a certain size. That gives you fast searches and ... this part is me interpreting ... let's you address some of the worst part of a performance curve for enterprise storage.

There's two config items at play.
The recordsize is the point where a record gets broken into chunks. Anything larger is cut up into pieces of recordsize.
The special_small_blocks is the size block that will be written to the special vdev. Files of size larger than special_small_blocks get written to the main pool.

I need to figure out where the balance point between recordsize and special_small_blocks is for a PBS server.
Here's a histogram from 2 of them.
(Warning. If you run this code, do it in the datastore with the files you want to count, and be aware that it might run overnight.)

On b0x1, I think
recordsize=512k
special_small_blocks=256k

On b0x2, I think
recordsize=1M
special_small_blocks=256k

I'm fairly new to ZFS. This is an advanced topic. Any insight, or even just your own interpretation of these histograms would be welcome.
Thanks.

Code:

b0x1: /mnt/datastore/Backups]# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:   3484
  2k:   2391
  4k:   3367
  8k:   6027
 16k:  10471
 32k:  16789
 64k:  32644
128k:  74453
256k: 215039
512k: 383804
  1M: 385362
  2M: 405238
  4M:  69865
  8M:      7

Code:

b0x2:/rpool/BACKUP# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:  10372
  2k:   7137
  4k:   3602
  8k:   6272
 16k:  10094
 32k:  20460
 64k:  33754
128k: 100055
256k: 195302
512k: 453410
  1M: 394942
  2M: 530326
  4M:  80923
  1G:      1

tcabernoch · Jun 16, 2024

Well, I did it. b0x2

zpool add rpool -o -f ashift=12 special mirror scsi-<gptid>-part3 scsi-<gptid>-part3
zfs set recordsize=1M rpool
zfs set special_small_blocks=256K

The create_random_chunks.py PBS storage abuse script shows some interesting changes.
(That script can be obtained here. https://forum.proxmox.com/threads/datastore-performance-tester-for-pbs.148694/)

I'm not sure how much that tells me, other than some operations are using the SSD and others aren't.

Type of Abuse	Before	After
sha256_name_generation	0.77s	0.76s
create_buckets	3.61s	3.77s
create_random_files	749.22s	91.57s
create_random_files_no_buckets	84.91s	46.72s
read_file_content_by_id	571.08s	8.44s
read_file_content_by_id_no_buckets	4.26s	4.30s
stat_file_by_id	6.08s	5.81s
stat_file_by_id_no_buckets	1.40s	1.49s
find_all_files	552.26s	52.77s
find_all_files_no_buckets	0.45s	0.44s

Write Latency Test
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1G count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=64M count=1 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=1M count=256 oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=8K count=10K oflag=dsync
dd if=/dev/zero of=/rpool/BACKUP/testy/test2.img bs=512 count=1000 oflag=dsync

									Before		After
1073741824	bytes	(1.1	GB,	1	GiB)	copied,	1.24446	s,	863	MB/s	1.1	GB/s
67108864	bytes	(67	MB,	64	MiB)	copied,	0.099322	s,	676	MB/s	756	MB/s
268435456	bytes	(268	MB,	256	MiB)	copied,	2.20857	s,	122	MB/s	124	MB/s
83886080	bytes	(84	MB,	80	MiB)	copied,	95.1172	s,	882	kB/s	967	kB/s
512000	bytes	(512	kB,	500	KiB)	copied,	8.88222	s,	57.6	kB/s	59.3	kB/s

I should have done more pre-testing with bs<256K. That was not well thought-out.
Fortunately, this is currently a test build, and I will have another chance to test when I rebuild.

tcabernoch · Jul 22, 2024

I got to finish this up.
TLDR; It did not speed up backups. I think my bottleneck on the backups must be the local server disk speed.

I did test for file sizes less than 256k. I expected to see the most improvement there, but no, its essentially unchanged.

dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=1G count=1 oflag=dsync	1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.43998 s, 746 MB/s	1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.22767 s, 875 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=64M count=1 oflag=dsync	67108864 bytes (67 MB, 64 MiB) copied, 0.101041 s, 664 MB/s	67108864 bytes (67 MB, 64 MiB) copied, 0.0869583 s, 772 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=1M count=256 oflag=dsync	268435456 bytes (268 MB, 256 MiB) copied, 2.26298 s, 119 MB/s	268435456 bytes (268 MB, 256 MiB) copied, 2.18099 s, 123 MB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=8K count=10K oflag=dsync	83886080 bytes (84 MB, 80 MiB) copied, 92.9258 s, 903 kB/s	83886080 bytes (84 MB, 80 MiB) copied, 86.8977 s, 965 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=512 count=1000 oflag=dsync	512000 bytes (512 kB, 500 KiB) copied, 9.31836 s, 54.9 kB/s	512000 bytes (512 kB, 500 KiB) copied, 8.61145 s, 59.5 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=256 count=1000 oflag=dsync	256000 bytes (256 kB, 250 KiB) copied, 7.45161 s, 34.4 kB/s	256000 bytes (256 kB, 250 KiB) copied, 8.38339 s, 30.5 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=128 count=1000 oflag=dsync	128000 bytes (128 kB, 125 KiB) copied, 8.81583 s, 14.5 kB/s	128000 bytes (128 kB, 125 KiB) copied, 8.51855 s, 15.0 kB/s
dd if=/dev/zero of=/rpool/backups/testy/test2.img bs=64 count=1000 oflag=dsync	64000 bytes (64 kB, 62 KiB) copied, 8.1428 s, 7.9 kB/s	64000 bytes (64 kB, 62 KiB) copied, 8.72252 s, 7.3 kB/s

And I again confirmed that a number of disk functions are much faster.
Deleting a chunks forlder with its 65 thousand entries takes just seconds after the special vdev is added.

The create_random_chunks.py storage abuse script again showed significant improvement on a couple items.

filesystem detected by stat(1): zfs	filesystem detected by stat(1): zfs
files to write: 500000	files to write: 500000
files to read/stat: 50000	files to read/stat: 50000
buckets: 65536	buckets: 65536
sha256_name_generation: 0.70s	sha256_name_generation: 0.69s
create_buckets: 3.47s	create_buckets: 3.44s
create_random_files: 686.38s	create_random_files: 90.63s
create_random_files_no_buckets: 48.48s	create_random_files_no_buckets: 46.32s
read_file_content_by_id: 156.37s	read_file_content_by_id: 10.30s
read_file_content_by_id_no_buckets: 4.30s	read_file_content_by_id_no_buckets: 4.26s
stat_file_by_id: 6.04s	stat_file_by_id: 5.85s
stat_file_by_id_no_buckets: 1.59s	stat_file_by_id_no_buckets: 1.49s
find_all_files: 195.05s	find_all_files: 53.96s
find_all_files_no_buckets: 0.45s	find_all_files_no_buckets: 0.43s

Along with all that before and after testing, I ran some backups.
I did not see a speed increase when running individual backups. I've concluded that the bottleneck must be local disk speed, not the target.

Trying to saturate the PBS server, I ran backups on all servers at once.
It ran pretty darn good.
With all 6 servers in the cluster running backups, with the exception of the fastest one, they maintained the backup speeds previously sampled.

I don't think this setup will push the PBS server to its limits.

--------------------------
Note that its hard to benchmark PBS performance. You have to feed it brand new data its never seen before, or the dedupe features completely b0rk your test.
This log is for a brand new backup that this PBS server has never seen before. 81% deduped. Brand new backup.
That's amazing, but makes performance very difficult to test and/or tweak.

INFO: backup was done incrementally, reused 147.39 GiB (81%)

--------------------------

Here's' the final result zpool iostat while I'm running 6 backups.
3/4 of the writes are going to the SSD vdev. That's probably not good. Not awful, but its just gonna fill up.
Here's the config, if anyone has suggestions.
I could re-run the file size histogram, but I tried this based on the last results, and it seems like more of a guestimate anyway.

ashift=12
zfs set recordsize=1M rpool
zfs set special_small_blocks=256K

@fiona is gonna ignore my posts forever for asking, but I wonder if this has been discussed by the devs.
What's the proper tuning for a zfs special vdev that holds a PBS chunks system?

tcabernoch · Jul 27, 2024

The zpool iostats seem to have changed quite a bit. And they aren't changing when I have active backups, so this seems to be the final shape.
I don't know, I guess I like this shape. Seems proportional to the respective roles.

tcabernoch · Aug 3, 2024

I tried using 2 special vdevs. Ya, don't do that.

Starting with 2 spinners in a mirror + 6 unallocated SSD.
Test 'before' performance w/create_random_chunks.py PBS storage abuse script.
Add a 3-disk mirror special vdev.
Configure 1m/512k
Test.
Add second 3-disk mirror special vdev.
Test.

TLDR;
There are (the expected) exponential improvements with the addition of the first special vdev.
The second special vdev provides incremental improvements, but does not have a very significant impact.
I don't think 2 special vdevs is an effective way to deploy this group of disks, and I'll try a different geometry.

I'm considering building on a pair of SSD raidz1 vdevs and then just add the spinners as a mirror vdev.
I understand zfs will sort things out and downgrade the slower drives without having to do anything special at all.
(I'm good with single-disk redundancy here. It's a backup server. It needs more capacity than the SSD provide.)
I bet that's as fast or faster than this 2-special-vdev mistake I've created today.

=====================================================

Add a special vdev.
zpool add rpool -f -o ashift=12 special mirror scsi-<>-part3 scsi-<>-part3 scsi-<>-part3

Configure it.
zfs set recordsize=1M rpool
zfs set special_small_blocks=512K rpool

=====================================================
Test results

filesystem detected by stat(1)	Just spinners	1 special vdev	2 special vdevs
files to write	500000	500000	500000
files to read/stat	50000	50000	50000
buckets	65536	65536	65536
sha256_name_generation	1.01s	1.00s	1.11s
create_buckets	2.98s	2.92s	2.89s
create_random_files	682.80s	127.97s	122.96s
create_random_files_no_buckets	77.95s	40.44s	40.47s
read_file_content_by_id	274.66s	7.93s	8.24s
read_file_content_by_id_no_buckets	3.69s	3.57s	3.90s
stat_file_by_id	6.84s	6.15s	5.42s
stat_file_by_id_no_buckets	1.29s	1.26s	1.27s
find_all_files	334.89s	47.20s	49.79s
find_all_files_no_buckets	0.40s	0.39s	0.40s

tcabernoch · Aug 3, 2024

Well, this has been a long and educational day.
I'm done. Both of these machines are going into production.

The spinner mirror + 2 SSD special vdevs was just a waste of resources. No advantage was gained from the second special vdev, and a lot of disks were used.

A pair of SSD raidz1 vdevs performed the best. This was unsurprising. I imagine if speed was the goal, 3 vdevs of mirrors would have done better. If speed was the only goal, we'd have stuck more SSD in the spinner bays.

And finally ... just adding a mirror of fat spinners to the pair of SSD vdevs ... that didn't hurt the test results much at all, and it gives the array sufficient capacity to fill the PBS role. This surprised me. I expected the whole array to just die when I added the spinners. Apparently ZFS has ways of managing slow disks and vdevs. (I've seen this hidden management and balancing alluded to, but haven't read anything about it.) This configuration appeared to be the most efficient use of the resources.

This project was largely motivated by the terrible performance we were seeing out of our original virtualized PBS deployments.
I wanted to make sure any actual hardware I allocated was being well used and configured optimally.
Both of the servers I got had 2 empty drive bays, and thus the all-SSD server got a pair of capacity spinners, and the all-spinner server got a pair of MU SSD.
I think i've done ok here. I spent a less than $2k, learned an enormous amount, and converted two not-so-useful Dell Gen12 machines into high-performance baremetal Proxmox Backup Server deployments.

PBS is extremely difficult to test and benchmark, given all the dedupe it does. It would be easy to shoot holes in my methods here.
If there are better ways to look at PBS servers, their disk performance, and backup performance, I'm certainly interested.

	spinner mirror + 2 SSD special vdev	2 raidz1 SSD vdevs	2 raidz1 SSD vdevs + mirror vdev of spinners
files to write	500000	500000	500000
files to read/stat	50000	50000	50000
buckets	65536	65536	65536
sha256_name_generation	1.11s	1.13s	1.16s
create_buckets	2.89s	2.92s	3.73s
create_random_files	122.96s	121.02s	116.64s
create_random_files_no_buckets	40.47s	40.68s	41.12s
read_file_content_by_id	8.24s	8.12s	8.50s
read_file_content_by_id_no_buckets	3.90s	3.93s	3.51s
stat_file_by_id	5.42s	5.52s	6.05s
stat_file_by_id_no_buckets	1.27s	1.26s	1.27s
find_all_files	49.79s	48.85s	47.58s
find_all_files_no_buckets	0.40s	0.40s	0.40s

PBS and ZFS Special Allocation Class VDEV ... aka Fusion Drive

tcabernoch

Well-Known Member

tcabernoch

Well-Known Member

tcabernoch

Well-Known Member

tcabernoch

Well-Known Member

tcabernoch

Well-Known Member

tcabernoch

Well-Known Member

We value your privacy