ZFS and special device how to use unused free space ?

Mecanik

Well-Known Member
Mar 2, 2017
173
4
58
32
I know this might be a stupid question, but it also my be a good question. I've done a mistake and setup a special device with 700+ GB in mirrored mode, the mistake being too much space dedicated for this. I have already a separate partition from this for SLOG, but that's unrelated to the question.

The disks are high end nvme's and the main ones are normal hdd's. Special vdev only uses around ~15 gb with my workload, and I suspect it will stay the same.

However I would like to use some of the free space, and I don't know how. Basically there are a good few hundreds of GB's just sitting there for nothing and I would like to use it for storing images, templates etc.

From what I read removing special devices is not possible, so I must leave it as is. Also remaking the pool at this point is not possible, so the setup remains as is.

Is this even possible to do ? I could not find anything on google, or maybe I searched it wrong.

Please advise.
 
if the rest of the pool is raidz, you can't remove the special device(s). if it's a mirror and you have enough space, you can, but you will incur a performance penalty until all the data previously stored on the special devices is no longer needed (it will be 'remapped' and thus needs one extra layer of indirection for each access).

Code:
$ man zpool
[...]
     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command supports removing hot spare, cache, log, and both mirrored and non-redundant pri‐
             mary top-level vdevs, including dedup and special vdevs.  When the primary pool storage includes a top-level raidz vdev only hot spare,
             cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in the storage pool.  The specified device will be evacuated by copying all allo‐
             cated space from it to the other devices in the pool.  In this case, the zpool remove command initiates the removal and returns, while the
             evacuation continues in the background.  The removal progress can be monitored with zpool status.  If an IO error is encountered during the
             removal process it will be cancelled. The device_removal feature flag must be enabled to remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by specifying the top-level mirror for the same.  Non-log devices or data devices
             that are part of a mirrored configuration can be removed using the zpool detach command.

             -n      Do not actually perform the removal ("no-op").  Instead, print the estimated amount of memory that will be used by the mapping table
                     after the removal completes.  This is nonzero only for top-level vdevs.

             -p      Used in conjunction with the -n flag, displays numbers as parsable (exact) values.
[...]

if you are using Grub and this is your rpool, I'd be wary since it probably does not understand the device removal feature and your system will likely not boot anymore.

the best solution would probably to re-create your pool from scratch.
 
if the rest of the pool is raidz, you can't remove the special device(s). if it's a mirror and you have enough space, you can, but you will incur a performance penalty until all the data previously stored on the special devices is no longer needed (it will be 'remapped' and thus needs one extra layer of indirection for each access).

Code:
$ man zpool
[...]
     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command supports removing hot spare, cache, log, and both mirrored and non-redundant pri‐
             mary top-level vdevs, including dedup and special vdevs.  When the primary pool storage includes a top-level raidz vdev only hot spare,
             cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in the storage pool.  The specified device will be evacuated by copying all allo‐
             cated space from it to the other devices in the pool.  In this case, the zpool remove command initiates the removal and returns, while the
             evacuation continues in the background.  The removal progress can be monitored with zpool status.  If an IO error is encountered during the
             removal process it will be cancelled. The device_removal feature flag must be enabled to remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by specifying the top-level mirror for the same.  Non-log devices or data devices
             that are part of a mirrored configuration can be removed using the zpool detach command.

             -n      Do not actually perform the removal ("no-op").  Instead, print the estimated amount of memory that will be used by the mapping table
                     after the removal completes.  This is nonzero only for top-level vdevs.

             -p      Used in conjunction with the -n flag, displays numbers as parsable (exact) values.
[...]

if you are using Grub and this is your rpool, I'd be wary since it probably does not understand the device removal feature and your system will likely not boot anymore.

the best solution would probably to re-create your pool from scratch.

Thank you, I've read about this already. I can't rebuild the rpool, unfortunately at this point.

I was thinking to simply use mount to "/mnt/sdd" for example; is that not an option ? As I said, I would only store templates, etc.
 
no, you can't mount a special vdev (or any other vdev).
 
Yes, that's what I`m asking the question about.

So what is your question then? You can in fact use it like it is described on the webpage and it works, so if you read the page, you already know how to proceed:

Code:
root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  9,93G   101G        -         -     5%  8,94%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

root@proxmox ~ > zfs create -o special_small_blocks=1M zpool/test-small-blocks

root@proxmox ~ > dd if=/dev/urandom of=/zpool/test-small-blocks/test1 bs=1M count=1024
[...]

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  10,9G   100G        -         -     5%  9,79%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

You can see that the mirror now holds 1G more - that's why I'm using it. I also extremly oversized the special device (in my case 111G)
 
Usually you should thin provision to not get such problems.

You could just add a child vdev and mount it ?

You need to give us something to work with, im not a fortune teller

"zpool status" "zpool list" "zfs list"
 
So what is your question then? You can in fact use it like it is described on the webpage and it works, so if you read the page, you already know how to proceed:

Code:
root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  9,93G   101G        -         -     5%  8,94%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

root@proxmox ~ > zfs create -o special_small_blocks=1M zpool/test-small-blocks

root@proxmox ~ > dd if=/dev/urandom of=/zpool/test-small-blocks/test1 bs=1M count=1024
[...]

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  10,9G   100G        -         -     5%  9,79%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

You can see that the mirror now holds 1G more - that's why I'm using it. I also extremly oversized the special device (in my case 111G)

Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.

Usually you should thin provision to not get such problems.

You could just add a child vdev and mount it ?

You need to give us something to work with, im not a fortune teller

"zpool status" "zpool list" "zfs list"

I`m already using thin provision, that's not the case here. Space is not the issue, I could keep them on the pool where they are now, the issue is that I want to take advantage of the free space on the nvme's to help reduce the i/o when cloning.

Sorry I didn't mention this from the get go, but that's the ultimate goal.
 
Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.



I`m already using thin provision, that's not the case here. Space is not the issue, I could keep them on the pool where they are now, the issue is that I want to take advantage of the free space on the nvme's to help reduce the i/o when cloning.

Sorry I didn't mention this from the get go, but that's the ultimate goal.

I meant please post the output of the 3 commands:
"zpool status" "zpool list" "zfs list"

So we can take a look at the current configuration.
 
I meant please post the output of the 3 commands:
"zpool status" "zpool list" "zfs list"

So we can take a look at the current configuration.

Sure, here it is:

Bash:
~# zpool status
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 02:03:32 with 0 errors on Sun May 10 02:27:33 2020
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sda2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
        special
          mirror-2     ONLINE       0     0     0
            nvme1n1p2  ONLINE       0     0     0
            nvme0n1p2  ONLINE       0     0     0
        logs
          mirror-1     ONLINE       0     0     0
            nvme1n1p1  ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0

errors: No known data errors

Bash:
~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  6.20T  1.56T  4.65T        -         -    13%    25%  1.00x    ONLINE  -

Bash:
~# zfs list
NAME                               USED  AVAIL     REFER  MOUNTPOINT
rpool                             2.46T  2.82T      136K  /rpool
rpool/ROOT                        8.31G  2.82T       96K  /rpool/ROOT
rpool/ROOT/pve-1                  8.31G  2.82T     8.31G  /
rpool/data                          96K  2.82T       96K  /rpool/data
rpool/swap                        4.25G  2.82T      993M  -
/// machines disks
/// templates disks
 
Last edited:
Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.

The small_block_size is the upper limit for blocks that are included in the special device class, as it is stated on the wiki. So depended of the recordsize (also 1M in my case), all records with less than (and I don't know it less or less or equal) land on the special device, as you asked in the thread title. I use this e.g. for the PVE/root so that you boot of the SSDs in a special class of otherwise pure HDD pool. All other "need-to-be-fast" stuff can go into their own datasets there. The default for small_block_size is 0, so that nothing goes onto the special devices. You can also set this globally to 4K (as mentioned in the wiki article) so that all small files up to 4K go directly on the SSDs.

So in summary, this is what I thought you wanted to know.
 
  • Like
Reactions: takeokun
The small_block_size is the upper limit for blocks that are included in the special device class, as it is stated on the wiki. So depended of the recordsize (also 1M in my case), all records with less than (and I don't know it less or less or equal) land on the special device, as you asked in the thread title. I use this e.g. for the PVE/root so that you boot of the SSDs in a special class of otherwise pure HDD pool. All other "need-to-be-fast" stuff can go into their own datasets there. The default for small_block_size is 0, so that nothing goes onto the special devices. You can also set this globally to 4K (as mentioned in the wiki article) so that all small files up to 4K go directly on the SSDs.

So in summary, this is what I thought you wanted to know.

Thank you, however I already have set this to 128k to match recordsize of 128. volblocksize is 8k and I cannot change it... unfortunately. It works pretty well considering I only have win kvm machines at the moment.

What I wanted to know is if I can literally put 5 templates on that device, and clone from there to reduce hdd's io cost and delay.
 
then writing 0's to it.

Oh forgot to mention this: /dev/urandom is not zeros, I read that in another thread and also commented, but it wasn't understood properly. The device that gives zeros is /dev/zero. The character device /dev/urandom is a pseudorandom bitstream that should be incompressible, therefore I used it to generate 1 GB of random data. If I would have written zeros, you would have not seen anything special with ZFS and its compression, but I think that is why you're puzzled.
 
What I wanted to know is if I can literally put 5 templates on that device, and clone from there to reduce hdd's io cost and delay.

Yes, that should work flawlessly (on the template logic). After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM.
 
Thank you, however I already have set this to 128k to match recordsize of 128. volblocksize is 8k and I cannot change it... unfortunately. It works pretty well considering I only have win kvm machines at the moment.

You cannot change it afterwards. Just create a new zvol with the correct volblocksize and small_block_size and dd the old to the new device. Afterwards just destroy the old dataset and rename the new to the name of the old.
 
Yes, that should work flawlessly (on the template logic). After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM.

Interesting, can you give me an example how to put a template there ? I can't comprehend how to litereally drop a file there.

As for "After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM. " - Not sure I understand, is this bad ?

You cannot change it afterwards. Just create a new zvol with the correct volblocksize and small_block_size and dd the old to the new device. Afterwards just destroy the old dataset and rename the new to the name of the old.

Unfortunately I cannot do that... because it's production, and I cannot modify disks, and I cannot risk failures, etc... so it's how it is.
 
Unfortunately I cannot do that... because it's production, and I cannot modify disks, and I cannot risk failures, etc... so it's how it is.

What you can do is create a new dataset with the aforementioned small_block_size and add it as a new ZFS storage (e.g. zfs-fast) to PVE, then you can online move the disk to this new storage.

As for "After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM. " - Not sure I understand, is this bad ?

Not bad, but strange. Different I/O times can "feel" strange.

Interesting, can you give me an example how to put a template there ? I can't comprehend how to litereally drop a file there.

Either use the new dataset as a directory storage for storing files or you can use it as a zvol (Default in PVE) for VMs. I'd clone the dataset to be used as a shared template, but you have to that manually. I don't use PVE templates, I use normal machines and clone them, so I have an updatable template, which is much nicer than the static template. I only do this for Windows, I can reinstall a Linux via netboot automatically including all security updates faster than I would clone + update afterwards.
 
What you can do is create a new dataset with the aforementioned small_block_size and add it as a new ZFS storage (e.g. zfs-fast) to PVE, then you can online move the disk to this new storage.



Not bad, but strange. Different I/O times can "feel" strange.



Either use the new dataset as a directory storage for storing files or you can use it as a zvol (Default in PVE) for VMs. I'd clone the dataset to be used as a shared template, but you have to that manually. I don't use PVE templates, I use normal machines and clone them, so I have an updatable template, which is much nicer than the static template. I only do this for Windows, I can reinstall a Linux via netboot automatically including all security updates faster than I would clone + update afterwards.

This is very interesting information, thank you. I will do some testing and will see.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!