Confused About my ZFS Storage

knewman

Member
Dec 30, 2021
2
0
21
36
Hello, I'm hoping someone more knowledgeable than me can help me understand how much space I actually have left on my Proxmox ZFS pool.

If I do a
Code:
zpool list -v

I get an ouput saying my pool is 3.62T in size with 2.16T allocated and 1.47T free. Seems great!
Code:
root@hv2:~# zpool list -v
NAME                                                                                                                                            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
hv2-zpool0                                                                                                                                     3.62T  2.16T  1.47T        -         -    40%    59%  1.00x    ONLINE  -
  raidz1-0                                                                                                                                     3.62T  2.16T  1.47T        -         -    40%  59.5%      -    ONLINE
    nvme-INTEL_SSDPE2MX800G4M___________118000178_CVPD642100VL800U                                                                              745G      -      -        -         -      -      -      -    ONLINE
    nvme-INTEL_SSDPE2MX800G4M___________118000178_CVPD6421014Y800U                                                                              745G      -      -        -         -      -      -      -    ONLINE
    nvme-nvme.8086-43565044363432313030375038303055-494e54454c205353445045324d5838303047344d2020202020202020202020313138303030313738-00000001   745G      -      -        -         -      -      -      -    ONLINE
    nvme-INTEL_SSDPE2MX800G4M___________118000178_CVPD64220032800U                                                                              745G      -      -        -         -      -      -      -    ONLINE
    nvme-INTEL_SSDPE2MX800G4M___________118000178_CVPD642100DU800U                                                                              745G      -      -        -         -      -      -      -    ONLINE

However, If I check the ZFS section in the web UI for this hypervisor, it reads that my total pool size is 3.99T with 2.37T allocated and 1.61T free. Even better!

1778537488897.png

However... if I click the same volume on the left side nav pane, it reads a total pool size of 3.08T with 2.91T allocated o_O

proxmox.png

Which is accurate, and where's this discrepancy coming from?

Thanks!
 
These basically show similar things to zpool status and zfs list. RAIDZ has padding overhead and the GUI also mixes different units (TB vs TiB) and that's where the discrepancy comes from.
This was talked about here:
- https://forum.proxmox.com/threads/raidz1-shows-wrong-space.125736/
- https://forum.proxmox.com/threads/zfs-proxmox-disk-size-weirdness.168177/
You'll find more if you search for RAIDZ padding. There's also resize/expand/rewrite shenanigans which can make this very confusing. Especially if you put ZVOLs into the mix.
 
Last edited:
That is why everybody will recommend you to use mirrors for blockstorage hypervisors and not RAIDZ.
RAIDZ can be good for files, but is not good for block storage.

Short overview on how this works:
4k secotor size is the default. So you will save 4k sectors onto your drives.

Imagine a 2GB iso file you try to save.
It is saved on a dataset with the 128k recordsize default.
128k can be saved without a problem.
Parity + Data + Data + Data + Data is one single stripe.
We see, each stripe has 4 chunks of data and one parity.
4 chunks = 4 times 4k = 16k for each stripe.
128k / 16k = 8 stripes needed to save a 128 record.
This works out perfectly well.

Now, to your VMs. These are stored in RAW disks.
And these RAW disk are in a ZVOL with a 16k volblocksize.
Let's see how we store 16k chunks.
Parity + Data + Data + Data + Data is one single stripe.
Looks great! Each SSD is used.
But there is a catch. Padding!
To avoid unusable empty blocks, ZFS need your secotrs need to be a mutliple of 1 + Parity.
Your parity is 1. So you need a mutliple of 2 for your sectors.
But this (Parity + Data + Data + Data + Data) is 5 sectors.
So we add a padding block.
Parity + Data + Data + Data + Data + padding is 6 sectors. That is a mutliple of 2. Great.

Now the efficiency changed.
You expected 4 (total amount of drives - parity ) / 5 (total amount of drives) * 100 = 80% effiency.
That is how it works in a traditional RAID.
But what you get is 6 sectors needed for one stripe that contains 4 sectors of data.
4 / 6 = 66%.

The mean part? ZFS will not directly show you this. It calculates the total storage based on the assumption that you only store 128k. What you see is typical for the padding problem. A 1TB VM disk can use more than 1TB on your ZFS, because of padding and pool geometry.

So you got 66%. Not much better than mirror with 50%. I recommend you switch to mirror.
Or shrink you blockstorage usage by offloading files from your VMs to datasets.
This is in general good practice, because storing stuff into blockstorage comes with many downsides and should only be done if really needed.
Like for VMs, but not for the files themselves.

If you are intrested in ZFS and how it stores data:
https://github.com/jameskimmel/opinions_about_tech_stuff/blob/main/ZFS/The problem with RAIDZ.md
 
Last edited:
  • Like
Reactions: UdoB