ZFS and special device how to use unused free space ?

Mecanik · May 18, 2020

I know this might be a stupid question, but it also my be a good question. I've done a mistake and setup a special device with 700+ GB in mirrored mode, the mistake being too much space dedicated for this. I have already a separate partition from this for SLOG, but that's unrelated to the question.

The disks are high end nvme's and the main ones are normal hdd's. Special vdev only uses around ~15 gb with my workload, and I suspect it will stay the same.

However I would like to use some of the free space, and I don't know how. Basically there are a good few hundreds of GB's just sitting there for nothing and I would like to use it for storing images, templates etc.

From what I read removing special devices is not possible, so I must leave it as is. Also remaking the pool at this point is not possible, so the setup remains as is.

Is this even possible to do ? I could not find anything on google, or maybe I searched it wrong.

Please advise.

fabian · May 19, 2020

if the rest of the pool is raidz, you can't remove the special device(s). if it's a mirror and you have enough space, you can, but you will incur a performance penalty until all the data previously stored on the special devices is no longer needed (it will be 'remapped' and thus needs one extra layer of indirection for each access).

Code:

$ man zpool
[...]
     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command supports removing hot spare, cache, log, and both mirrored and non-redundant pri‐
             mary top-level vdevs, including dedup and special vdevs.  When the primary pool storage includes a top-level raidz vdev only hot spare,
             cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in the storage pool.  The specified device will be evacuated by copying all allo‐
             cated space from it to the other devices in the pool.  In this case, the zpool remove command initiates the removal and returns, while the
             evacuation continues in the background.  The removal progress can be monitored with zpool status.  If an IO error is encountered during the
             removal process it will be cancelled. The device_removal feature flag must be enabled to remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by specifying the top-level mirror for the same.  Non-log devices or data devices
             that are part of a mirrored configuration can be removed using the zpool detach command.

             -n      Do not actually perform the removal ("no-op").  Instead, print the estimated amount of memory that will be used by the mapping table
                     after the removal completes.  This is nonzero only for top-level vdevs.

             -p      Used in conjunction with the -n flag, displays numbers as parsable (exact) values.
[...]

if you are using Grub and this is your rpool, I'd be wary since it probably does not understand the device removal feature and your system will likely not boot anymore.

the best solution would probably to re-create your pool from scratch.

Mecanik · May 19, 2020

fabian said:

if the rest of the pool is raidz, you can't remove the special device(s). if it's a mirror and you have enough space, you can, but you will incur a performance penalty until all the data previously stored on the special devices is no longer needed (it will be 'remapped' and thus needs one extra layer of indirection for each access).

Code:

$ man zpool
[...]
     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command supports removing hot spare, cache, log, and both mirrored and non-redundant pri‐
             mary top-level vdevs, including dedup and special vdevs.  When the primary pool storage includes a top-level raidz vdev only hot spare,
             cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in the storage pool.  The specified device will be evacuated by copying all allo‐
             cated space from it to the other devices in the pool.  In this case, the zpool remove command initiates the removal and returns, while the
             evacuation continues in the background.  The removal progress can be monitored with zpool status.  If an IO error is encountered during the
             removal process it will be cancelled. The device_removal feature flag must be enabled to remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by specifying the top-level mirror for the same.  Non-log devices or data devices
             that are part of a mirrored configuration can be removed using the zpool detach command.

             -n      Do not actually perform the removal ("no-op").  Instead, print the estimated amount of memory that will be used by the mapping table
                     after the removal completes.  This is nonzero only for top-level vdevs.

             -p      Used in conjunction with the -n flag, displays numbers as parsable (exact) values.
[...]

if you are using Grub and this is your rpool, I'd be wary since it probably does not understand the device removal feature and your system will likely not boot anymore.

the best solution would probably to re-create your pool from scratch.

Thank you, I've read about this already. I can't rebuild the rpool, unfortunately at this point.

I was thinking to simply use mount to "/mnt/sdd" for example; is that not an option ? As I said, I would only store templates, etc.

fabian · May 19, 2020

no, you can't mount a special vdev (or any other vdev).

Mecanik · May 19, 2020

Well that is super sad.

LnxBil · May 19, 2020

Mecanik said:
However I would like to use some of the free space, and I don't know how. Basically there are a good few hundreds of GB's just sitting there for nothing and I would like to use it for storing images, templates etc.

Have you tried this? I'm using it.

Mecanik · May 19, 2020

LnxBil said:
Have you tried this? I'm using it.

Yes, that's what I`m asking the question about.

LnxBil · May 19, 2020

Mecanik said:
Yes, that's what I`m asking the question about.

So what is your question then? You can in fact use it like it is described on the webpage and it works, so if you read the page, you already know how to proceed:

Code:

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  9,93G   101G        -         -     5%  8,94%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

root@proxmox ~ > zfs create -o special_small_blocks=1M zpool/test-small-blocks

root@proxmox ~ > dd if=/dev/urandom of=/zpool/test-small-blocks/test1 bs=1M count=1024
[...]

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  10,9G   100G        -         -     5%  9,79%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

You can see that the mirror now holds 1G more - that's why I'm using it. I also extremly oversized the special device (in my case 111G)

H4R0 · May 19, 2020

Usually you should thin provision to not get such problems.

You could just add a child vdev and mount it ?

You need to give us something to work with, im not a fortune teller

"zpool status" "zpool list" "zfs list"

Mecanik · May 19, 2020

LnxBil said:

So what is your question then? You can in fact use it like it is described on the webpage and it works, so if you read the page, you already know how to proceed:

Code:

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  9,93G   101G        -         -     5%  8,94%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

root@proxmox ~ > zfs create -o special_small_blocks=1M zpool/test-small-blocks

root@proxmox ~ > dd if=/dev/urandom of=/zpool/test-small-blocks/test1 bs=1M count=1024
[...]

root@proxmox ~ > zpool list -v zpool
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zpool                                           11,0T  5,59T  5,43T        -         -     0%    50%  1.00x    ONLINE  -
  [...]
special                                             -      -      -        -         -      -      -      -  -
  mirror                                         111G  10,9G   100G        -         -     5%  9,79%      -  ONLINE
    sda2                                            -      -      -        -         -      -      -      -  ONLINE
    sdb2                                            -      -      -        -         -      -      -      -  ONLINE
  [...]

You can see that the mirror now holds 1G more - that's why I'm using it. I also extremly oversized the special device (in my case 111G)

Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.

H4R0 said:
Usually you should thin provision to not get such problems.

You could just add a child vdev and mount it ?

You need to give us something to work with, im not a fortune teller

"zpool status" "zpool list" "zfs list"

I`m already using thin provision, that's not the case here. Space is not the issue, I could keep them on the pool where they are now, the issue is that I want to take advantage of the free space on the nvme's to help reduce the i/o when cloning.

Sorry I didn't mention this from the get go, but that's the ultimate goal.

H4R0 · May 19, 2020

Mecanik said:
Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.

I`m already using thin provision, that's not the case here. Space is not the issue, I could keep them on the pool where they are now, the issue is that I want to take advantage of the free space on the nvme's to help reduce the i/o when cloning.

Sorry I didn't mention this from the get go, but that's the ultimate goal.

I meant please post the output of the 3 commands:
"zpool status" "zpool list" "zfs list"

So we can take a look at the current configuration.

Mecanik · May 19, 2020

H4R0 said:
I meant please post the output of the 3 commands:
"zpool status" "zpool list" "zfs list"

So we can take a look at the current configuration.

Sure, here it is:

Bash:

~# zpool status
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 02:03:32 with 0 errors on Sun May 10 02:27:33 2020
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sda2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
        special
          mirror-2     ONLINE       0     0     0
            nvme1n1p2  ONLINE       0     0     0
            nvme0n1p2  ONLINE       0     0     0
        logs
          mirror-1     ONLINE       0     0     0
            nvme1n1p1  ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0

errors: No known data errors

Bash:

~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  6.20T  1.56T  4.65T        -         -    13%    25%  1.00x    ONLINE  -

Bash:

~# zfs list
NAME                               USED  AVAIL     REFER  MOUNTPOINT
rpool                             2.46T  2.82T      136K  /rpool
rpool/ROOT                        8.31G  2.82T       96K  /rpool/ROOT
rpool/ROOT/pve-1                  8.31G  2.82T     8.31G  /
rpool/data                          96K  2.82T       96K  /rpool/data
rpool/swap                        4.25G  2.82T      993M  -
/// machines disks
/// templates disks

LnxBil · May 19, 2020

Mecanik said:
Well this looks interesting, but I`m a bit confused. You are creating an empty dataset with small block size 1m (why?), and then writing 0's to it. From what I can see it sits only on the special vdev which is great.

The small_block_size is the upper limit for blocks that are included in the special device class, as it is stated on the wiki. So depended of the recordsize (also 1M in my case), all records with less than (and I don't know it less or less or equal) land on the special device, as you asked in the thread title. I use this e.g. for the PVE/root so that you boot of the SSDs in a special class of otherwise pure HDD pool. All other "need-to-be-fast" stuff can go into their own datasets there. The default for small_block_size is 0, so that nothing goes onto the special devices. You can also set this globally to 4K (as mentioned in the wiki article) so that all small files up to 4K go directly on the SSDs.

So in summary, this is what I thought you wanted to know.

Mecanik · May 19, 2020

LnxBil said:
The small_block_size is the upper limit for blocks that are included in the special device class, as it is stated on the wiki. So depended of the recordsize (also 1M in my case), all records with less than (and I don't know it less or less or equal) land on the special device, as you asked in the thread title. I use this e.g. for the PVE/root so that you boot of the SSDs in a special class of otherwise pure HDD pool. All other "need-to-be-fast" stuff can go into their own datasets there. The default for small_block_size is 0, so that nothing goes onto the special devices. You can also set this globally to 4K (as mentioned in the wiki article) so that all small files up to 4K go directly on the SSDs.

So in summary, this is what I thought you wanted to know.

Thank you, however I already have set this to 128k to match recordsize of 128. volblocksize is 8k and I cannot change it... unfortunately. It works pretty well considering I only have win kvm machines at the moment.

What I wanted to know is if I can literally put 5 templates on that device, and clone from there to reduce hdd's io cost and delay.

LnxBil · May 19, 2020

Mecanik said:
then writing 0's to it.

Oh forgot to mention this: /dev/urandom is not zeros, I read that in another thread and also commented, but it wasn't understood properly. The device that gives zeros is /dev/zero. The character device /dev/urandom is a pseudorandom bitstream that should be incompressible, therefore I used it to generate 1 GB of random data. If I would have written zeros, you would have not seen anything special with ZFS and its compression, but I think that is why you're puzzled.

LnxBil · May 19, 2020

Mecanik said:
What I wanted to know is if I can literally put 5 templates on that device, and clone from there to reduce hdd's io cost and delay.

Yes, that should work flawlessly (on the template logic). After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM.

LnxBil · May 19, 2020

Mecanik said:
Thank you, however I already have set this to 128k to match recordsize of 128. volblocksize is 8k and I cannot change it... unfortunately. It works pretty well considering I only have win kvm machines at the moment.

You cannot change it afterwards. Just create a new zvol with the correct volblocksize and small_block_size and dd the old to the new device. Afterwards just destroy the old dataset and rename the new to the name of the old.

Mecanik · May 19, 2020

LnxBil said:
Yes, that should work flawlessly (on the template logic). After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM.

Interesting, can you give me an example how to put a template there ? I can't comprehend how to litereally drop a file there.

As for "After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM. " - Not sure I understand, is this bad ?

LnxBil said:
You cannot change it afterwards. Just create a new zvol with the correct volblocksize and small_block_size and dd the old to the new device. Afterwards just destroy the old dataset and rename the new to the name of the old.

Unfortunately I cannot do that... because it's production, and I cannot modify disks, and I cannot risk failures, etc... so it's how it is.

LnxBil · May 21, 2020

Mecanik said:
Unfortunately I cannot do that... because it's production, and I cannot modify disks, and I cannot risk failures, etc... so it's how it is.

What you can do is create a new dataset with the aforementioned small_block_size and add it as a new ZFS storage (e.g. zfs-fast) to PVE, then you can online move the disk to this new storage.

Mecanik said:
As for "After the template and the real VM diverge after updates, you get a very strange I/O response time from the view of the VM. " - Not sure I understand, is this bad ?

Not bad, but strange. Different I/O times can "feel" strange.

Mecanik said:
Interesting, can you give me an example how to put a template there ? I can't comprehend how to litereally drop a file there.

Either use the new dataset as a directory storage for storing files or you can use it as a zvol (Default in PVE) for VMs. I'd clone the dataset to be used as a shared template, but you have to that manually. I don't use PVE templates, I use normal machines and clone them, so I have an updatable template, which is much nicer than the static template. I only do this for Windows, I can reinstall a Linux via netboot automatically including all security updates faster than I would clone + update afterwards.

Mecanik · May 21, 2020

LnxBil said:
What you can do is create a new dataset with the aforementioned small_block_size and add it as a new ZFS storage (e.g. zfs-fast) to PVE, then you can online move the disk to this new storage.

Not bad, but strange. Different I/O times can "feel" strange.

Either use the new dataset as a directory storage for storing files or you can use it as a zvol (Default in PVE) for VMs. I'd clone the dataset to be used as a shared template, but you have to that manually. I don't use PVE templates, I use normal machines and clone them, so I have an updatable template, which is much nicer than the static template. I only do this for Windows, I can reinstall a Linux via netboot automatically including all security updates faster than I would clone + update afterwards.

This is very interesting information, thank you. I will do some testing and will see.

ZFS and special device how to use unused free space ?

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Distinguished Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

We value your privacy