ZFS storage is very full, we would like to increase the space but...

IsThisThingOn · Dec 15, 2025

Just for fun, we can do the math for your setup and what you should get with 128k.
Let's look at a 132k write.

First block is an incompressible 128k write. Second is also a 128k block but with only 4k data, the rest zeros. LZ4 can compress that down to a 4k block.
Side note: This is one of the reasons why you never should disable compression even if your data is not compressible.

Ok, we take a look at the first 128k block.
This is a single stripe:
P + P + D + D + D + D + D + D.
So a single stripe offers 6 * 4k=24 K
128k / 24k =5.333
Ok so we take 5 "full stripes" to get 5*24k=120 K
Still need another 8k.
So we do another stripe like this
P + P + D + D.

In total we now have 5 * 8 sectors + 4 sectors = 44 sectors. That is not a multiple of 3, so we add one padding to get 45 sectors.

The second block contains 4k real data and mostly zeros. It is still a 128k volblock but LZ4 compresses that to 4k actual write data.
P + P + D. Another 3 sectors.

We now have 48 sectors to store 128k data. Or 48 * 4k=192 K to store 128k.
128 / 192 * 100 =66.66%

Now I don't know what happens at the OS level, if you write another 132k. I think it should make use of the previous second volblock that is mostly empty. But that is way above my paygrade

So for simplicity, if you really want to run some test, I would test it with 16k volblock and just fill them with random data.
You will get 66.66% instead of 75% for a 8 wide RAIDZ2.

meyergru · Dec 15, 2025

Being a german, I know that german saying...

As you saw, the actual real-world results with a volblocksize of 128k do not show what the theory says. On a side note, I have a Proxmox Backup Server VM with a 2TByte data disk on the same raidz2 array. It shows inside the VM:

Filesystem Size Used Avail Use% Mounted on
pbs 2.0T 1.3T 676G 66% /mnt/datastore/pbs

The data should be hardly compressible, because PBS only writes compressed data, The data has been collected over months, so the write pattern should also be realistic. Also note, that according to your cited article, I am now officially "that guy that has a 16TB Windows Server VM disk, just to host Plex

" - in that I did the dumb thing to use a large VM disk volume with lots of padding instead of using a ZFS dataset.

And here - tada - is the result on the PVE host:
NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-130-disk-0 1.28T 1.28T 37.1T 128K

But you wanted 16k volblocksize. So, there you go, I created the 1G VM disk again on the same array, but this time, in stopped VM state. Then, I did:

zfs destroy huge/backup/vm-105-disk-0
zfs create -V 1G huge/backup/vm-105-disk-0 -b 16k -s -o compression=off
zfs list -oname,used,refer,avail,volblocksize huge/backup/vm-105-disk-0

which gave:

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 154K 154K 37.1T 16K

as it would be if the volblocksize of the array had been 16k in the first place. Also, there will be no compression this time, regardless of what data I write.
Then I started up the VM and again did:

dd if=/dev/urandom of=/dev/vdb bs=4K count=262144 status=progress oflag=direct,sync

I had to interrupt this one at ~100 MByte, because the data was written at only 60kB/s because of an ongoing ZFS scrub on the array. Result was:

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 104M 104M 37.1T 16K

Note the USED to REFER ratio, which is 1.
What I saw in the first test was stressing my theory: When I issued the zfs list command, I saw the sizes of both USED and REFER increase in increments of ~2 MByte, which seems to be what the ZFS transaction sizes are.

dd if=/dev/urandom of=/dev/vdb bs=64K count=16384 status=progress oflag=direct,sync

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 1.07G 1.07G 37.1T 16K

Again, ratio is 1.

dd if=/dev/urandom of=/dev/vdb bs=68K count=15269 status=progress oflag=direct,sync

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 1.07G 1.07G 37.1T 16K

Again, ratio is 1.

dd if=/dev/urandom of=/dev/vdb bs=132K count=7866 status=progress oflag=direct,sync

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 1.06G 1.06G 37.1T 16K

Just to complete this, I also tried:

dd if=/dev/urandom of=/dev/vdb bs=12K count=87381 status=progress oflag=direct,sync

NAME USED REFER AVAIL VOLBLOCK
huge/backup/vm-105-disk-0 1.07G 1.07G 37.1T 16K

I also tried by writing on the PVE host itself, with the same results.

So while I can trigger the theoretical problem with synthetic tests, I almost never see the anticipated worst case in real tests, which is what I said from the start.

P.S.: This whole discussion does not even apply to LXCs, because they use datasets instead of volumes, anyway.

P.P.S.: As you might have guessed from the abysmal speed of my array, this one is build of HDDs, so I am not trashing by SSDs ;-)

IsThisThingOn · Dec 16, 2025

meyergru said:
As you saw, the actual real-world results with a volblocksize of 128k do not show what the theory says.

No, that is not an actual real-world result, considering how complicated free space reporting is in ZFS.
That is: wer misst, misst Mist
A real world test would be you setting up a 4 wide RAIDZ1 vdev with 1TB HDDs, create a 16k zvol with compression disabled, create a RAW image inside that zvol and fill the disk with data, just to see that you wont be able to write your expected 3TB, because the pool will overspill sooner.
That is a real world test

Since I think we are getting into a dead end here, let me try a different approach.
Why do you think there are many, many users reporting more storage used than anticipated?
Why do you think that happens and what is the culprit for these users?

Also discussing about 128 volblocksize is meaningless, since that is such an absurd high volblocksize for most users.

But hey, you don't have to believe me.
Go to https://www.truenas.com/docs/references/zfscapacitycalculator/
Use recordsize 16k, 4 drives à 1TB, RAIDZ1, 4 drives wide.
What do you get? Even worse numbers than I gave you (because I ignore some stuff).

You can to the same thing with your setup, 8 wide 18TB drives, RAIDZ2 with 64k volblocksize and get this:

So you see, my numbers are not a worst case scenario. On the contrary, they are a best case scenario.

My wild guess is that you might have so much free space, that you did not realize it yet, that you fell into that ZFS trap and only get roughly 71% instead of 75%.

IsThisThingOn · Dec 16, 2025

BTW, that calculator has an interesting note regarding the zfs list command you used:

Note: The zfs list command always assumes a 128KiB recordsize for capacity calculations. The usable capacity values displayed below represent how much data you can store on a pool with a given recordsize, but changing your recordsize will not change the capacity that zfs list displays.

Maybe this is what screwed your tests? I honestly have no idea.

Here is another, IMHO even better calculator.
And if you scroll down all the way to the bottom, there is also explanations behind his math
https://jro.io/capacity

IsThisThingOn · Dec 16, 2025

Another great explanation of that topic:
https://forums.truenas.com/t/understanding-openzfs-capacity/2444

It even has a picture of a 7 wide RAIDZ2 with 128k. Not exactly your 8 wide, but pretty close:

See the two X sectors? That is 8k padding for every single 128k stripe.

It is similar with your 8 drives.
P + P + D + D + D + D + D + D offers 24k.
You need that 5 times to get 120k.
Still need another 8k.
So we add another stripe that looks like this
P + P + D + D.

In total we now have 5 * 8 sectors + 4 sectors = 44 sectors. That is not a multiple of 3, so we add one padding to get 45 sectors. So you in the end have for every single 128k volblock exactly this. There is no way around this.

P + P + D + D + D + D + D + D
P + P + D + D + D + D + D + D
P + P + D + D + D + D + D + D
P + P + D + D + D + D + D + D
P + P + D + D + D + D + D + D
P + P + D + D + X

One single 128k volblock need 45 times a 4k sector.
128 / (45*4) * 100 =71.111%

But even in a fantasy universe without that padding, you would not get 75%.
128 / (44*4) * 100 =72.727%

Why?
Because your last stripe is not optimal in size and needs 8k parity just to store 8k data.
So your last stripe is only 50% efficient instead of 75%.

Damn I love ZFS

But it is hard to bring that across. That is why I wrote that Github thing.
I invested a lot of time but I can't think of a way to explain it any better.

cybermod · Dec 17, 2025

alexskysilk said:
Most likely yes.

It tooks me a second to read the above discussion when I realized they're no longer speaking in terms of your question. The answer to your question depends on other factors you didnt mention, namely:
1. are you trying to improve virtual machine performance? if not, bigger RAID5. if yes, consider striped mirrors (and yes, you'll need more drives)
2. are you trying to increase resilience/survivability? if not, bigger raid5. if yes- consider raidz2 instead, or striped mirrors.
3. Is your entire pool dedicated to virtual machine workload, or is it comingled with file storage (or a vdisk used for a virtual nas) if yes, you can split it to two separate zpools- striped mirrors for the vdisks, raidz/z2 for your filer. best of both worlds.

I wouldnt lose too much sleep trying to minimize padding losses, especially if performance isnt a factor. disks are cheap and your time is not.

Hi, actually the thread has become too technical for my current knowledge, but it was still a pleasure to read it even though I understood very little.

On this Proxmox server, we have a Windows server that acts as a file server and about 5 VMs that act as clients.

We would like to increase the storage space because ZFS has reached over 80% and, according to the documentation, this is not a good situation.

So, the very first idea was:

- I'll make a backup
- I'll delete the ZFS storage, destroying it (because adding two more disks would be a sort of hybrid)
- I'll create a hardware RAID5 via the hardware controller (4 disks + spare or all 5 disks?)
- I'll recreate the storage in Proxmox, with the classic LVM-Thin)
- restore all VMs from backups

What do you think?

UdoB · Dec 17, 2025

cybermod said:
- I'll delete the ZFS storage,
- I'll create a hardware RAID5 via the hardware controller

What do you think?

Dropping ZFS will make you lose the guaranteed integrity of ZFS, IOPS (...are low for Raid5), technically cheap snapshots, transparent compression, zfs send/receive, ..., ...

And... easy extensibility. Where is your problem?

(( Yes, you may add a mirror (of two disk) to that pool, which has a RaidZ currently. That's just not recommended! But it works.
You can also expand that RaidZ1 twice, with one disk at a time. Yes, the result will be below your expectations, but... it works! Without downtime! ))

cybermod · Dec 17, 2025

UdoB said:
Dropping ZFS will make you lose the guaranteed integrity of ZFS, IOPS (...are low for Raid5), technically cheap snapshots, transparent compression, zfs send/receive, ..., ...

And... easy extensibility. Where is your problem?

(( Yes, you may add a mirror (of two disk) to that pool, which has a RaidZ currently. That's just not recommended! But it works.
You can also expand that RaidZ1 twice, with one disk at a time. Yes, the result will be below your expectations, but... it works! Without downtime! ))

I read that you're a big fan of ZFS

"(( Yes, you may add a mirror (of two disk) to that pool, which has a RaidZ currently. That's just not recommended! But it works."
If it's not recommended, I don't do it. What's the point?
But I did it in testing, just to try it out (virtual environment).

!You can also expand that RaidZ1 twice, with one disk at a time. Yes, the result will be below your expectations, but... it works! Without downtime! ))!

May I know why adding one disk at a time would be below my expectations?

I also have a problem that I didn't mention before regarding zfs, which is the available ram in the server (I inherited this server but it's a mistake I could have made myself)

To summarize, my situation is:
One operating system mirror (which I won't touch)

Five 2TB SSDs for storage where the VMs run.

Three are already installed and in production; I need to add the other two.

I have backups; I just need to choose the right and smartest solution.

IsThisThingOn · Dec 17, 2025

cybermod said:
On this Proxmox server, we have a Windows server that acts as a file server and about 5 VMs that act as clients.

It works, but it is not great. All your files are in 16k volblocks. So if you do that, make sure that to check the storage efficiency for that. Just take a look at this table at the 16k line
https://github.com/jameskimmel/opinions_about_tech_stuff/blob/main/ZFS/The problem with RAIDZ.md#efficiency-tables

Another option would be to have something like TrueNAS and make that NAS AD joined. Then clients directly access data from TrueNAS which in return can make use of a dataset instead of zvol.

cybermod said:
We would like to increase the storage space because ZFS has reached over 80% and, according to the documentation, this is not a good situation.

Just like the 1GB RAM per 1TB storage rule, this rule is very stupid. It does not help at all and for some scenarios is too nice and too mean for others.

This rules comes from blockstorage (again, I would recommend always against blockstorage for files) on RAIDZ HDD pools.
Why? To more full your pool is (and the more fragmented, something this rule completely ignores) the harder it will be to find free, continuous storage. So after 80%, your iSCSI performance will totally tank. But your performance probably already will tank after 50% usage.

There is an excellent post about that topic:
https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/
TLDR: avoid blockstorage if possible. If you need it, put in onto mirrors and if possible on SSDs.

cybermod said:
I also have a problem that I didn't mention before regarding zfs, which is the available ram in the server (I inherited this server but it's a mistake I could have made myself)

That looks like a very bad misconfig.
If you setup Proxmox with ZFS, there will be no SWAP. This is for good reasons, described in the docs.
Newer installations use max 10% of ARC. IMHO this is very conservative, and should be tuned to your setup. Empty RAM is wasted RAM.

ZFS is awesome. I would backup all VMs, completely reinstall Proxmox, use a ZFS mirror during the installation and after that reimport the VMs.
Calculate how much RAM your VMs use at max, use the leftover RAM for ARC by adjusting the 10% default.

UdoB · Dec 17, 2025

cybermod said:
I read that you're a big fan of ZFS

Yeah, that's no secret

Unfortunately this does not mean, that I am a ZFS expert..., there are too many internal details I do not understand.

cybermod said:
May I know why adding one disk at a time would be below my expectations?

Well, the most important point is that there is no automatic re-balancing. Already stored data will stay where it is. (For datasets there is "man zfs-rewrite", but for ZVols there is no equivalent. Yet.)

This also means that the added capacity is not 100% of the expected size but somewhat lower: let's say those "old" three drives were nearly full. Where shall the parity for a full wagon of new data, to be stored on the new drive, go? There is no place to put it. The pool is "full" even though the new drives are empty. (That's my current understanding, I'll be happy to be proven wrong.)

To get rid of this problem you may want to shuffle all data around once. It must be moved, so the data finds a newly distributed location.

There was one more aspect... I can't remember. Something with not re-adjustable blocksizes and (additional) padding...

----
Edit: at least the " the pool is full even though... " sentence is bullshit. I can expand a RaidZ1 and then write new data to the pool until it hits the ~98% mark. Tested and verified two minutes ago...

Code:

Preparation: a pool with a single RaidZ1 vdev:

root@pna:~# zpool create  extest  raidz  sde sdf sdg
root@pna:~# zpool list -v extest
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
extest      29.5G   265K  29.5G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0  29.5G   265K  29.5G        -         -     0%  0.00%      -    ONLINE
    sde     9.99G      -      -        -         -      -      -      -    ONLINE
    sdf     9.99G      -      -        -         -      -      -      -    ONLINE
    sdg     9.99G      -      -        -         -      -      -      -    ONLINE

Create a dataset with compression disabled and write some data

root@pna:~# zfs create -o compression=off  extest/data
root@pna:~# for B in {1..15}; do dd if=/dev/zero bs=1M count=1000 of=/extest/data/$B.dat; done

The intermediate result is a pool filled up to ~75 percent data with uncompressed zeros:

root@pna:~# zpool list -v extest
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
extest      29.5G  22.0G  7.52G        -         -     1%    74%  1.00x    ONLINE  -
  raidz1-0  29.5G  22.0G  7.52G        -         -     1%  74.5%      -    ONLINE
    sde     9.99G      -      -        -         -      -      -      -    ONLINE
    sdf     9.99G      -      -        -         -      -      -      -    ONLINE
    sdg     9.99G      -      -        -         -      -      -      -    ONLINE



Now to the experiment - add a fourth device:

root@pna:~# zpool attach extest raidz1-0 /dev/sdh

root@pna:~# zpool list -v extest
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
extest      39.5G  22.0G  17.5G        -         -     0%    55%  1.00x    ONLINE  -
  raidz1-0  39.5G  22.0G  17.5G        -         -     0%  55.6%      -    ONLINE
    sde     9.99G      -      -        -         -      -      -      -    ONLINE
    sdf     9.99G      -      -        -         -      -      -      -    ONLINE
    sdg     9.99G      -      -        -         -      -      -      -    ONLINE
    sdh     9.99G      -      -        -         -      -      -      -    ONLINE

Writing 10 GB of additional data:
root@pna:~# for B in {16..25}; do dd if=/dev/zero bs=1M count=1000 of=/extest/data/$B.dat; done

root@pna:~# zpool list -v extest
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
extest      39.5G  35.0G  4.48G        -         -     1%    88%  1.00x    ONLINE  -
  raidz1-0  39.5G  35.0G  4.48G        -         -     1%  88.7%      -    ONLINE
    sde     9.99G      -      -        -         -      -      -      -    ONLINE
    sdf     9.99G      -      -        -         -      -      -      -    ONLINE
    sdg     9.99G      -      -        -         -      -      -      -    ONLINE
    sdh     9.99G      -      -        -         -      -      -      -    ONLINE

Try to add four more:

root@pna:~# for B in {26..29}; do dd if=/dev/zero bs=1M count=1000 of=/extest/data/$B.dat; done
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 3.61045 s, 290 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 3.81757 s, 275 MB/s
dd: error writing '/extest/data/28.dat': No space left on device
520+0 records in
519+0 records out
544866304 bytes (545 MB, 520 MiB) copied, 2.24697 s, 242 MB/s
dd: failed to open '/extest/data/29.dat': No space left on device

root@pna:~# zpool list -v extest
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
extest      39.5G  38.3G  1.19G        -         -     3%    96%  1.00x    ONLINE  -
  raidz1-0  39.5G  38.3G  1.19G        -         -     3%  97.0%      -    ONLINE
    sde     9.99G      -      -        -         -      -      -      -    ONLINE
    sdf     9.99G      -      -        -         -      -      -      -    ONLINE
    sdg     9.99G      -      -        -         -      -      -      -    ONLINE
    sdh     9.99G      -      -        -         -      -      -      -    ONLINE

alexskysilk · Dec 17, 2025

cybermod said:
- I'll create a hardware RAID5 via the hardware controller (4 disks + spare or all 5 disks?)

A word about spares: spares should be avoided, especially for your use case. The reason for that is that spares are spinning the whole time alongside with your "operating" drives, which means they are having miles put on them without any real utility. If your volumeset is single parity (eg, raidz1), a drive failure means the volumeset is operating without parity with all that this entails, even during rebuild. if you have a spare, you might as well just put it in your volumeset as a full member (eg, raidz2)

Otherwise your plan is sound.

edit for all the zfs fanboys- zfs is a tool. It is the best tools for SOME use cases. it isnt for others. just because you have a hammer doesn't mean everything is a nail. for OPs use case lvm on RAID is a fine solution, and will likely perform better too.

IsThisThingOn · Dec 18, 2025

alexskysilk said:
edit for all the zfs fanboys- zfs is a tool. It is the best tools for SOME use cases. it isnt for others. just because you have a hammer doesn't mean everything is a nail. for OPs use case lvm on RAID is a fine solution, and will likely perform better too.

I would argue that it is the best tool for MOST use cases.
Just because a lot of people use a hammer with the claw side to hammer in a nail, doesn't mean it isn't the greatest to tool to hammer in a nail.

Edit for all the hardware RAID fanboys: RAID is a very old tool. Hardware RAID will likely perform worse (expect sync writes), is not portable, you are dependent on Hardware, has very limited config options, long rebuild times. No checksums, but worst of all, no snapshots

IsThisThingOn · Dec 18, 2025

On a serious note though, how old is your hardware RAID @cybermod?
What would you do, if it fails tomorrow?

These are IMHO questions you have to ask yourself when running Hardware RAID, since it is not always that easy to reimport a pool as with ZFS.

cybermod · Jan 19, 2026

IsThisThingOn said:
Parlando seriamente, quanti anni ha il tuo hardware RAID @cybermod?
Cosa faresti se domani fallisse?

A mio modesto parere, queste sono le domande che dovresti porti quando esegui un RAID hardware, poiché non è sempre così semplice reimportare un pool come con ZFS.

Hi, sorry for the delay.
Your point is excellent; the hardware RAID controller is essential.
In our case, the server has a cached controller and is powerful.
We destroyed ZFS, added the disks, and redid the storage.
Then we restored the machines.
It took a bit of time, but everything went well

cybermod · Jan 21, 2026

IsThisThingOn said:
On a serious note though, how old is your hardware RAID @cybermod?
What would you do, if it fails tomorrow?

These are IMHO questions you have to ask yourself when running Hardware RAID, since it is not always that easy to reimport a pool as with ZFS.

I admit, however, that your comment about the controller's hardware failure is giving me a lot to think about.
Thank you for this; I'll discuss it with my colleagues.

alexskysilk · Jan 23, 2026

cybermod said:
I admit, however, that your comment about the controller's hardware failure is giving me a lot to think about.

Your raid controller is most likely an LSI 9x00 based card. you can import a raid volume from virtually any LSI to any LSI RAID controller, and they are cheap and plentiful. This isnt much of a concern and can be treated as any consumable.

If you want certainty, post the model of your raid controller.

Search

Search

ZFS storage is very full, we would like to increase the space but...

IsThisThingOn

Renowned Member

meyergru

Well-Known Member

IsThisThingOn

Renowned Member

IsThisThingOn

Renowned Member

IsThisThingOn

Renowned Member

cybermod

Active Member

UdoB

Distinguished Member

cybermod

Active Member

IsThisThingOn

Renowned Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

IsThisThingOn

Renowned Member

IsThisThingOn

Renowned Member

cybermod

Active Member

cybermod

Active Member

alexskysilk

Distinguished Member

We value your privacy