Gomo

New Member
Jun 16, 2023
6
0
1
Hello all,

I would like to configure a metadata special device for my ZFS mirror pool (2x HDD's, 18TB each) but I am not entirely sure of the steps. I've found some documentation online but I am still hesitant as I don't want to mess things up & then re-build the mirror. I do have data backed up, it's just a lot of terabytes and it'll take days :X So here I am, asking for advice from someone who's done it before / knows how to do this in a proper way.

I purchased 2x WD RED SSD's (500GB) which I'd like to have in a mirror config and use as metadata special device for my HDD mirror pool.
I don't mind setting the "special_small_blocks" parameter to a bit higher value, actually, I would like to have it around 2MB, not sure if that's smart? (just below the size of average smartphone picture size).

I would greatly appreciate if someone would be kind enough to spare few mins and advise on what commands to use & how to set this up.

Thanks in advance!
 
Hello all,

I would like to configure a metadata special device for my ZFS mirror pool (2x HDD's, 18TB each) but I am not entirely sure of the steps. I've found some documentation online but I am still hesitant as I don't want to mess things up & then re-build the mirror. I do have data backed up, it's just a lot of terabytes and it'll take days :X So here I am, asking for advice from someone who's done it before / knows how to do this in a proper way.

I purchased 2x WD RED SSD's (500GB) which I'd like to have in a mirror config and use as metadata special device for my HDD mirror pool.
I don't mind setting the "special_small_blocks" parameter to a bit higher value, actually, I would like to have it around 2MB, not sure if that's smart? (just below the size of average smartphone picture size).

I would greatly appreciate if someone would be kind enough to spare few mins and advise on what commands to use & how to set this up.

Thanks in advance!
It is more or less not much more than:

zpool add POOLNAME special mirror /dev/sdX /dev/sdY

and for the blocksize:
zfs set special_small_blocks=1M POOLNAME
 
I don't mind setting the "special_small_blocks" parameter to a bit higher value, actually, I would like to have it around 2MB, not sure if that's smart? (just below the size of average smartphone picture size).
Depends on your recordsize or volblocksize. If that is lower, your will put all on your SSDs.

In order to use the metadata device, you need to send/receive your data to your own pool in order to save the metadata to the SSDs. Best to setup your pool and then fill it.
 
Depends on your recordsize or volblocksize. If that is lower, your will put all on your SSDs.
To explain that a bit more:
"special_small_blocks" defines what data gets written to the special devices. Every record or block smaller or equal that "special_small_blocks" will be written to the special device. Everything bigger will be written to the normal vdevs.

Lets say you got a zvol with a 8K volblocksize and a datasets with a 128K recordsize (which are the PVE7 defaults).

No matter how big your zvol is, it will always be chopped in blocks of 8K (your volblocksize). So even if your zvol is 8TB, it will consist of 1,000,000x 8K blocks. So if your special_small_blocks is greater or equal than 8K, all the 8TB will be stored on the special devices. If your special_small_blocks is smaller than 8K, then everything will be stored on the normal vdevs (your HDDs).

For files in a dataset it depends. The recordsize is more like a "up to" value. Store a 3KB file and it will create a 4K record. Store a 20KB file and it will create a 32K record. Store a 1MB file and it will create 8x 128K records (as 128K recordsize is the upper limit...everything bigger will be chopped in records of max this 128K recordsize). So with a 128K special_small_blocks and a 128K recordsize, everything would be written to the special device, no matter if that file is 1KB or 1TB.
So special_small_blocks=1M would be a terrible idea, unless your recordsize is 2M or bigger. You usually want the special_small_blocks smaller than your recordsize for your datasets.
 
Last edited:
Thank you all your answers!
So special_small_blocks=1M would be a terrible idea, unless your recordsize is 2M or bigger. You usually want the special_small_blocks smaller than your recordsize for your datasets.

I ran Wendell's commend for a histogram of file sizes on my pool, and this is what I got
datasets.png

What "special_small_blocks" would you recommend here?
Also, are there any downsides when it comes to metadata special device being outside of the main storage device, in regard to pool degradation / rebuilding, etc? Anything I should keep in mind (other than running it in a mirror to avoid data loss)

And, one more thing. If I'd go ahead with the setup, I don't need to create a ZFS mirror with SSD's first, right? itNGO's command or well, "sdX" and "sdY" represent each of the SSD's (in my scenario), right?
zpool add POOLNAME special mirror /dev/sdX /dev/sdY

Thank you!
 
What "special_small_blocks" would you recommend here?
Personally, I have distinct datasets with special_small_blocks, e.g. I want to control what goes where. It's hard to auto-optimize and if you run out of space there, your metadata will go to the harddisks. The only advice I can give is to split your data amount as many datasets as you can so that you can send/receive them in order to rebalance it if you need to do so.

Also, are there any downsides when it comes to metadata special device being outside of the main storage device, in regard to pool degradation / rebuilding, etc? Anything I should keep in mind (other than running it in a mirror to avoid data loss)
rebuilding is done on a vdev basis, so you have to rebuild your data disks from your data disks and your special device from you special device.

And, one more thing. If I'd go ahead with the setup, I don't need to create a ZFS mirror with SSD's first, right? itNGO's command or well, "sdX" and "sdY" represent each of the SSD's (in my scenario), right?
Yes, just add the special mirror and then send/receive each dataset of your pool in order to split the metadata. If you don't do that, only the newly created metadata goes to the SSDs.
 
The only advice I can give ...
Oh, that is acutally wrong. I ran a while back into a problem with the "free space" of the pool. You can run out of your data pool and don't see it directly via zpool list.

I have a virtual appliance that acts as a ZFS-over-iSCSI storage for PVE and it has three disks on different backend storage devices so that I can create a pool with data, metadata and SLOG/ZIL:

Code:
root@proxmox-zfs-storage ~ > zpool list -v
NAME                                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                   1.12T   856G   291G        -         -    62%    74%  1.00x    ONLINE  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi3  1020G   837G   183G        -         -    62%  82.0%      -  ONLINE
special                                     -      -      -        -         -      -      -      -  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi2   127G  19.3G   108G        -         -    13%  15.2%      -  ONLINE
logs                                        -      -      -        -         -      -      -      -  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi1  7.50G  20.4M  7.48G        -         -     0%  0.26%      -  ONLINE

I ran out of capacity despite the normal zpool list shows less used space.
 
Now that I checked the record size and realized I have a default value (128K) I see that my plan of setting "special_small_blocks" to 128K or higher is not going it work. So I guess, I'd need to go with 64K? right?
Yes, just add the special mirror and then send/receive each dataset of your pool in order to split the metadata. If you don't do that, only the newly created metadata goes to the SSDs.

I'd appreciate if you could tell me how to do this "send/receive" part for existing files?

Thanks! :)
 
I'd appreciate if you could tell me how to do this "send/receive" part for existing files?
Not files, datasets. If you have only one dataset and not enough space, you cannot move the metadata on your SSDs. In order to send/receive your datasets, you need to create a snapshot, let's walk through:

Assume you have a dataset mydata, I create it for the sake of this example:
Code:
root@proxmox-zfs-storage ~ > zfs create zpool/mydata

root@proxmox-zfs-storage ~ > zfs list zpool/mydata
NAME           USED  AVAIL     REFER  MOUNTPOINT
zpool/mydata    24K   132G       24K  /zpool/mydata

Then create a snapshot (you have to stop everything that could potentially write to the dataset)

Code:
root@proxmox-zfs-storage ~ > zfs snapshot zpool/mydata@move

Now comes the send/receive part:

Code:
root@proxmox-zfs-storage ~ > zfs send -v zpool/mydata@move | zfs receive -v zpool/mydata-moved
full send of zpool/mydata@move estimated size is 12.6K
total estimated size is 12.6K
receiving full stream of zpool/mydata@move into zpool/mydata-moved@move
received 43.9K stream in 1 seconds (43.9K/sec)

Now you have the dataset cloned:

Code:
root@proxmox-zfs-storage ~ > zfs list zpool/mydata zpool/mydata-moved
NAME                 USED  AVAIL     REFER  MOUNTPOINT
zpool/mydata          24K   132G       24K  /zpool/mydata
zpool/mydata-moved    24K   132G       24K  /zpool/mydata-moved

You can delete the snapshot and the old data

Code:
root@proxmox-zfs-storage ~ > zfs destroy zpool/mydata@move
root@proxmox-zfs-storage ~ > zfs destroy zpool/mydata

Now the data should be split.
 
  • Like
Reactions: semanticbeeng
Not files, datasets. If you have only one dataset and not enough space, you cannot move the metadata on your SSDs. In order to send/receive your datasets, you need to create a snapshot, let's walk through:

Assume you have a dataset mydata, I create it for the sake of this example:
Code:
root@proxmox-zfs-storage ~ > zfs create zpool/mydata

root@proxmox-zfs-storage ~ > zfs list zpool/mydata
NAME           USED  AVAIL     REFER  MOUNTPOINT
zpool/mydata    24K   132G       24K  /zpool/mydata

Then create a snapshot (you have to stop everything that could potentially write to the dataset)

Code:
root@proxmox-zfs-storage ~ > zfs snapshot zpool/mydata@move

Now comes the send/receive part:

Code:
root@proxmox-zfs-storage ~ > zfs send -v zpool/mydata@move | zfs receive -v zpool/mydata-moved
full send of zpool/mydata@move estimated size is 12.6K
total estimated size is 12.6K
receiving full stream of zpool/mydata@move into zpool/mydata-moved@move
received 43.9K stream in 1 seconds (43.9K/sec)

Now you have the dataset cloned:

Code:
root@proxmox-zfs-storage ~ > zfs list zpool/mydata zpool/mydata-moved
NAME                 USED  AVAIL     REFER  MOUNTPOINT
zpool/mydata          24K   132G       24K  /zpool/mydata
zpool/mydata-moved    24K   132G       24K  /zpool/mydata-moved

You can delete the snapshot and the old data

Code:
root@proxmox-zfs-storage ~ > zfs destroy zpool/mydata@move
root@proxmox-zfs-storage ~ > zfs destroy zpool/mydata

Now the data should be split.
Emm, to be honest I completely lost you there.

I have a zfs mirror pool with 2x 18TB HDD's, this pool is not attached to any VM's. Now, I'd like to add a special device for metadata storage, in a mirror setup, for the whole HDD pool -> 2x 18TB HDD's in a mirror with 2x 500GB SSD's as special storage, also in a mirror. The HDD pool has default record size (128K).

The question is, how do I add those two SSD's to be a mirror special device, for that HDD pool? And when configured with a small block size of 64K, will that mean that all actual files which are 65KB or lower will be moved to the SSDs, and larger files will only have their metadata on the SSDs? or well, actually it will only start applying for NEW files, not existing ones. Right?

But then, how do I ensure my new setup gets applied on existing files?
 
And when configured with a small block size of 64K, will that mean that all actual files which are 65KB or lower will be moved to the SSDs, and larger files will only have their metadata on the SSDs? or well, actually it will only start applying for NEW files, not existing ones. Right?
yes, all NEW files.

But then, how do I ensure my new setup gets applied on existing files?
As I tried to lay out ... send/receive or copy all data to another machine and back.
 
You really want to write a new copy of all the data and delete the old one. A clone should just reference the old records and don'T write them again.
 
  • Like
Reactions: LnxBil
So, after backing up the main pool, I've destroyed and re-created it and also added special device (mirrored SSDs) and set the "special_small_blocks" to 64K. Afterwards I copied all files from my backup location back into this pool.

There's one folder on this main storage pool which is served via samba to my window devices & when I do right click 'properties' on it. It is still taking as long to count the amount of files and folders as it used to before? I don't think this is a SMB related limitation (altho it could be), as there's only 1ms or less of distance between the two (same local network).

I expected those speeds to drastically improve.

pool with special device.png

Another thing. How do I check space allocation of my special devices in that pool? They're not shown as a pool in my Proxmox, my Zabbix agent isn't pooling any info from them.. and I'm not that Linux savvy.
 
Check zpool list -v to see how much space your special devices got left.
Ow I see, it says that 131 GB are allocated. Okay, so I guess the metadata, as well as the block part did indeed work. Or well, at least I don't think that those 131 GB are purely metadata.

Any way to check where the =<64K files are stored?

Running Wendell's histogram command from within the root of the pool
Code:
find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
lists pretty much same results as before (of course, because special device is part of the pool)
 
No matter how big your zvol is, it will always be chopped in blocks of 8K (your volblocksize). So even if your zvol is 8TB, it will consist of 1,000,000x 8K blocks. So if your special_small_blocks is greater or equal than 8K, all the 8TB will be stored on the special devices.
Firstly, sorry about resurrecting this old thread.
Does the special_small_blocks apply to zvols? I thought it only applied to datasets.
 
Firstly, sorry about resurrecting this old thread.
Does the special_small_blocks apply to zvols? I thought it only applied to datasets.
Yes it does. Then the volblocksize counts and needs to be bigger than special_small_blocks or everything will end up on the special devices.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!