ZFS storage is very full, we would like to increase the space but...

cybermod

Active Member
Sep 21, 2019
12
3
43
46
Hi, I've read a lot of opinions here on the forum, but I wanted to share my case anyway.
I have a ZFS RAID. The disks are exposed in JBOD format. I wanted to add two disks to increase the space, but I understand that the wisest option, considering ZFS, is to destroy the datastore and recreate it.
Is this really the best option?

Code:
root@pve:~# zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
DATASTORE  5.23T  2.09T  3.14T        -         -    38%    40%  1.00x    ONLINE  -
root@pve:~# zpool status
  pool: DATASTORE
 state: ONLINE
  scan: scrub repaired 0B in 01:12:36 with 0 errors on Sun Nov  9 01:36:38 2025
config:

        NAME                                                      STATE     READ WRITE CKSUM
        DATASTORE                                                 ONLINE       0     0     0
          raidz1-0                                                ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD52E8D  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD52E89  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD526D2  ONLINE       0     0     0

errors: No known data errors

I did some testing on a lab PVE. I got some additional space, but it's like creating a second raid that's added to the existing one (I'm not sure if I'm explaining myself, I'm not very technical, so I apologize).

The need arose from an incorrect space estimate and a lack of knowledge about ZFS's RAM requirements.

We were considering destroying the RAID array, recreating it in hardware mode by adding the two new disks, formatting it, and restoring all the VMs using Proxmox Backup.

We also have the enterprise license but I think that talking about these things on the forum is more helpful, as it helps to refresh the topics over time

In any case, as always, your opinions/advice are greatly appreciated.
 
That ZFS pool is not a JBOD, but a ZFS raid-z1 with one of the three SATA disks as redundancy. With the current ZFS versions on Proxmox 9, you can safely add new disks to this setup, provided that they are of the same size as the existing disks. Existing data will still be spread over only 3 disks, so you do not get the same 80% net storage capacity from the get-go (especially if the old pool were nearly full, which it isn't). Other than with a normal RAID5 expansion, ZFS does not have to redistribute the data immediately.

Newly written data (that includes modified old data) will spread over all 5 disks, though.

As always with these kinds of operations: Backup first!
 
Last edited:
  • Like
Reactions: cybermod
That ZFS pool is not a JBOD, but a ZFS raid-z1 with one of the three SATA disks as redundancy. With the current ZFS versions on Proxmox 9, you can safely add new disks to this setup, provided that they are of the same size as the existing disks. Existing data will still be spread over only 3 disks, so you do not get the same 80% net storage capacity from the get-go (especially if the old pool were nearly full, which it isn't). Other than with a normal RAID5 expansion, ZFS does not have to redistribute the data immediately.

Newly written data (that includes modified old data) will spread over all 5 disks, though.

As always with these kinds of operations: Backup first!
I didn't express myself well; I tried to provide as much information as possible.
I'll start again:
The disks are on the server, but a hardware raid hasn't been created on the server because we know that zfs doesn't support hardware raids.

The Proxmox version is 8.4.1.
Are the operations to enlarge the storage done via GUI or shell?

Thanks for specifying that the disks must be identical. I assumed that, but it's not good to assume that.

So, if all goes well, I can expect something like this:

Code:
NAME                                                      STATE     READ WRITE CKSUM
        DATASTORE                                                 ONLINE       0     0     0
          raidz1-0                                                ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD52E8D  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD52E89  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_2DD526D2  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_NEWDISK2  ONLINE       0     0     0
            ata-MTFDDAK1T9QDE-2AV1ZF_02JG916D7A50537LEN_NEWDISK2  ONLINE       0     0     0



"As always with these kinds of operations: Backup first!"
amen! certainly!! ;)
 
Yes, However, I do not exactly know if Proxmox 8.4.1 already has the ZFS support to enlarge raidz1 arrays yet. It was added only in ZFS 2.3, AFAIR.

The release notes say that ZFS is only at 2.2.7 with PVE 8.4: https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.4

So, I guess you should first upgrade to PVE 9.

You will have to add the disks and they will most probably be detected (even when hot-added).

You will have to find them in /dev/disk/by-id/. You will find the three existing disks alongside the new ones.

The command to add the disks to the pool is:
Code:
zpool add DATASTORE raidz1 /dev/disk/by-id/ata.... /dev/disk/by-id/ata...
 
Last edited:
  • Like
Reactions: cybermod
I would recommend you to do
- Backup the VMs
- Backup your Proxmox settings
- destroy, reinstall Proxmox
- use mirrors!
- reimport the VMs

I would not use 5 drives as a RAIDZ1. With the default of 16k volblocksize you only get 66% and not your expected 80% storage efficiency. Not is not much better than a mirrors with 50%.

Thanks for specifying that the disks must be identical. I assumed that, but it's not good to assume that.
They don't. It is recommended, but they don't.
 
Yes, However, I do not exactly know if Proxmox 8.4.1 already has the ZFS support to enlarge raidz1 arrays yet. It was added only in ZFS 2.3, AFAIR.

The release notes say that ZFS is only at 2.2.7 with PVE 8.4: https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.4

So, I guess you should first upgrade to PVE 9.

You will have to add the disks and they will most probably be detected (even when hot-added).

You will have to find them in /dev/disk/by-id/. You will find the three existing disks alongside the new ones.

The command to add the disks to the pool is:
Code:
zpool add DATASTORE raidz1 /dev/disk/by-id/ata.... /dev/disk/by-id/ata...
I will do further testing in a virtual environment for the sake of learning the steps better.
I had already tried something, but in the meantime, thank you for the advice.
I would recommend you to do
- Backup the VMs
- Backup your Proxmox settings
- destroy, reinstall Proxmox
- use mirrors!
- reimport the VMs

I would not use 5 drives as a RAIDZ1. With the default of 16k volblocksize you only get 66% and not your expected 80% storage efficiency. Not is not much better than a mirrors with 50%.


They don't. It is recommended, but they don't.
- Backup: certainly
- Backup your Proxmox settings: any advice on how to do this intelligently? Perhaps with a cron script that copies the configurations to a NAS drive?
- Destroy, reinstall Proxmox: WHY? I have Proxmox installed on two other disks in RAID1 (from the controller), so I can manage the datastore without breaking Proxmox. Explain why you recommend this, I'm curious.


I would not use 5 drives as a RAIDZ1. With the default of 16k volblocksize you only get 66% and not your expected 80% storage efficiency. Not is not much better than a mirrors with 50%.
Forgive me, but I'm not an expert.
Are you telling me that ZFS with five disks doesn't perform well or doesn't offer me the same amount of space?
Can you tell me more, explaining it as if to a child?
 
A raid mirror usually has the read and write performance of a single disk (for a single thread - it can perform better with more threads if the data is distributed over more than one disk per mirror side). A raidz1 can perform much better on reads, like 4x when 5 disks are in use, because the data will be distributed over 4 disks plus 1 disk redundancy and all read operations can be done in parallel.

This is true if the data is evenly distributed, which your old data will not be initially after expansion, as I explained. New data will use all disks and because of the COW architecture of ZFS, modified old data will also be redistributed. The 66% storage efficiency is the one you reach now with 2 usable disks and 1 for parity data. With 5 disks, you would have 80% storage efficiency. Because you did not start out with the 5-disk layout, you will land in between 66% and 80%.

You can force a better efficiency by copying your old data, causing it to redistribute and hence, approach 80% efficiency.

A mirror obviously always has 50% efficiency which is not much less than 66%, but:

a. The real storage effiency is likely way more than 66%, especially considering the fact that you only had 40% capacity used until now.
b. The performance on writes will be even between raidz1 and mirrored ZFS.
c. Read performance will be faster on raidz1.
d. You cannot build a mirror with 5 disks.
e. It also takes way more time and effort to rebuild the whole machine than to add two disks.

Therefore, I would not recommend a mirror setup in your situation.
 
  • Like
Reactions: UdoB
Of course I know that you (and most users here in the forum) know the facts, but this sentence:
d. You cannot build a mirror with 5 disks.
calls for "Mr. Obvious", stating "yes, you can!" ;-)

Code:
root@pnz:~# zpool create multimirror mirror sdc sdd sde sdf sdg

root@pnz:~# zpool status multimirror
  pool: multimirror
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        multimirror  ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
            sdg      ONLINE       0     0     0

To be clear: the result is a pool with the capacity of (a little bit below) a single drive. Four drives may fail at the same time.
Another view:
Code:
root@pnz:~# zpool list multimirror -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
multimirror  11.5G   117K  11.5G        -         -     0%     0%  1.00x    ONLINE  -
  mirror-0   11.5G   117K  11.5G        -         -     0%  0.00%      -    ONLINE
    sdc      12.0G      -      -        -         -      -      -      -    ONLINE
    sdd      12.0G      -      -        -         -      -      -      -    ONLINE
    sde      12.0G      -      -        -         -      -      -      -    ONLINE
    sdf      12.0G      -      -        -         -      -      -      -    ONLINE
    sdg      12.0G      -      -        -         -      -      -      -    ONLINE
 
  • Like
Reactions: news and Johannes S
Obviously, I meant "not in a sensible way" - at least for what the OP asked for, namely to increase his storage from 2x the size of one of his physical disks, so using 5 disks either does not use one of them, thus not increasing the net size, either, or just enhances redundancy, like you describe.

And because that has been said before: yes, you also can use different sizes of disks, in which case the combination of them will have the size of the smallest disk (striping aside, to be completely clear).
 
  • Like
Reactions: Johannes S and UdoB
Destroy, reinstall Proxmox: WHY? I have Proxmox installed on two other disks in RAID1 (from the controller), so I can manage the datastore without breaking Proxmox. Explain why you recommend this, I'm curious.
Sorry, I misunderstood. In that case you can leave it.
Forgive me, but I'm not an expert.
Are you telling me that ZFS with five disks doesn't perform well or doesn't offer me the same amount of space?
Can you tell me more, explaining it as if to a child?
Sure. So Proxmox uses the good default of 16k volblocksize.
That means that all your VMs ZFS raw disks, are offered 16k blocks.

Now lets look how ZFS provides 16k blocks.
You have RAIDZ1. The sector size of your SSDs is 4k.
So each 16k block is:
- one 4k block parity
- 4 times a 4k block data = 16k data

So that is 20k stripe to offer 16k data. So far so good. But there is a problem.
ZFS does not want to have empty blocks that can never be filled

Since ZFS does not use fixed sized stripes, we could potentially run into a situation where we have empty sectors inbetween data, that is too small to ever be used. That is why ZFS reserves some padding to make such situation impossible. ZFS requieres multiples of p+1 sectors to prevent unusable-small gaps on the disk.

P in your case is one because RAIDZ1. So p+1 = 2.
Now your 20k stripe from above is 5 sectors.
Can you device 5 sectors by 2? No.
So we need to add one 4k padding sector.
Now we have a 24k stripe of 6 sectors.
6 sectors can be divided by 2.
Otherwise we would need to add another one.

So now we used 24k data to provide a 16k block. 66.66% efficiency instead of your expected 80%.

The mean part? You won't see this anywhere. You will just notice that a 100GB VM disk is not using only 100GB on your ZFS but 120GB.

If you are interested in the details https://github.com/jameskimmel/opinions_about_tech_stuff/blob/main/ZFS/The problem with RAIDZ.md#raidz1-with-5-drives

TLDR: Use mirrors for block storage like VMs.
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
Sorry, but IMHO the article you cite is mostly incorrect or at least misleading in practice:

The article is irrelevant because its calculations are based on unrealistic assumptions, treating every write as a tiny 4 kB block and ignoring how ZFS actually combines multiple blocks into stripes.

Practical examples with a 5‑disk RAIDZ1 (1 parity disk):
  • 10 kB file: Smaller than a stripe (16 kB of data blocks). Some local padding occurs, but ZFS packs many small files together into stripes, so the overall pool efficiency is hardly affected.
  • 16 kB file: Fits exactly into a stripe. No padding is needed. Efficiency = classical 80 %.
  • 10,000 kB file: Spans many stripes. Padding per stripe is negligible relative to total size. Efficiency ≈ 80 %.
Conclusion:
  • Padding losses due to small volblock sizes exist only locally, for individual tiny files.
  • Even many small files are packed together, so the overall impact on usable capacity is minimal.
  • The overall usable capacity matches the classical formula (N−1)×disk size for RAIDZ1 (or (N−2)×disk size for RAIDZ2).
  • Therefore, the dramatic efficiency losses claimed in the article (e.g., 66 % for a RAIDZ1 5-disk array) are not realistic in practice.
If it were otherwise, and the numbers stated were realistic, I should see only 66% net capacity of my 8x18 TByte raidz2 array, which would yield 96 TByte. What I do actually see is 107 TByte, which is more in line with the expected 75% of 144 TByte.
 
  • Like
Reactions: Johannes S
The article is irrelevant because its calculations are based on unrealistic assumptions, treating every write as a tiny 4 kB block and ignoring how ZFS actually combines multiple blocks into stripes.
That is not what it does. I unrealistically assumes that a 16k volblock is not compressable and because of that always a 16k write.
It does not assume every write to be a tiny 4k block. It assumes a 4k sector size.
Some local padding occurs, but ZFS packs many small files together into stripes, so the overall pool efficiency is hardly affected.
I did not know about that. Do you have a link for that?
Are you sure you are not mixing that up with datasets and files being cached in a 5s TXG groups?
AFAIK for volblocks no such thing is possible.
  • 16 kB file: Fits exactly into a stripe. No padding is needed. Efficiency = classical 80 %.
Since stripe width is flexible for RAIDZ (not for dRAID) I would say "fit" is the wrong word.
A 4k file also fits exactly into a stripe. The stripe is just shorter.

But yeah, a 16k file would be a 20k stripe (16k data plus 4k parity).
A 20k stripe is 5 sectors and 80%. But 5 sectors is not a multiple of 2 and because of that not working.
So even if you only have exactly 16k "files" (would be true, if you only have VMs that use the 16k volblocksize and you disabled compression)
every single 16k "file" would need 24k data. So not 80% but 66%.
10,000 kB file: Spans many stripes.
For a dataset, yes.
But we are talking about VM disks. Unlike datasets which have a MAX recordsize, zvols have a STATIC volblocksize (by default 16k).
A 10000k file in a VM will be stored on the VMs 16k volblocksize. So that file would consist of many 16k volblocks. Each volblock contains 16k data. To store each volblock, a 24k stripe is needed. So not 80% but 66%.

This is btw why I am such a proponent of separating VM data and files!

For example don't store your movies of a Jellyfin VM, onto the VM disk but use a NFS share pointing to a dataset for that.
Using blockstorage comes with huge disadvantages!
The static instead of max, with all its huge implications on pool geometry, padding, compression and so on. But you also can't easily backup the VM, you can't simply access the data and rsync it or something like that.
 
Last edited:
  • Like
Reactions: Johannes S
Your clarifications are helpful, and I agree on several important points:
  • We are talking about zvols, not datasets. Zvols have a static volblocksize, and unlike datasets they cannot coalesce multiple logical blocks into a single larger record. TXG batching does not change the on-disk RAIDZ geometry of individual volblocks.
  • With RAIDZ, sector alignment matters. A 16 kB logical block plus parity can indeed result in padding in certain layouts, especially with 4 k sectors and incompressible data.
  • For VM workloads, block storage has real disadvantages compared to datasets, which is why separating VM disks from file data (e.g. media via NFS/SMB datasets) is good practice.
Where I still disagree with the article’s conclusion is the generalization of worst-case behavior:

The math assumes that every volblock is written as an incompressible, isolated unit that always triggers worst-case padding. While this can happen for individual blocks, it does not reflect real VM I/O patterns. Guest filesystems aggregate writes, metadata and zeroed blocks compress extremely well, and large files do not translate into random, independent 16 kB writes at the ZFS layer. As a result, padding losses do not accumulate uniformly across the pool.

So yes, RAIDZ + zvols + static volblocksize can be inefficient in specific cases, and this is a valid warning. But extrapolating that into a blanket claim of ~66 % pool efficiency is misleading and not representative of observed behavior on real systems.

In short:
The problem exists, the worst-case math is correct, but treating that worst case as the norm is not.

P.S.: The way I use HDD raidz1 (or in my case, raidz2) arrays, they are not strictly used for VM disks anyway, with the exception of some larger data-storage ones (for which the guest filesystems will take care of the write aggregation). Thus, the zpool carries not only VM volumes, but datasets as well - in my case, the PVE server also serves SMB shares. Considering this case shifts the storage efficiency even more to the "expected" value.
 
  • Like
Reactions: Johannes S
The math assumes that every volblock is written as an incompressible, isolated unit that always triggers worst-case padding.
That is correct, it is a oversimplification.
But it does not get that much better if the data actually is compressible.

What if a 16 block can be compressed to 4k?
Then you have one padding & one data = 50% efficiency.
Congrats you are now down to mirror efficiency.

What if it can be compressed to 8k? You have one padding plus two data which is 66.66%, only slightly better than mirror, still far away from the 80% you would expect from a 5 wide RAIDZ1.

What if it can be compressed to 12k? You have padding plus three data which is 75%. Better, but still not your naturally expected 80% for a 5 wide RAIDZ1.

But extrapolating that into a blanket claim of ~66 % pool efficiency is misleading and not representative of observed behavior on real systems.
As the math shows above, it is not really misleading. You will get between 50% and 75% efficiency.
You will never ever be able to get your naturally expected 80%, it is simply is not possible.
 
Last edited: