ZFS + PVE Disk usage makes no sense

KevinPeters

New Member
Sep 22, 2022
16
1
3
Hi, I am new to Proxmox and migrating from Hyper-V has been something of a pain. I am getting extremely frustrated and considering moving back to Hyper-V. Please, please, please help me understand.

I have 3 x 12TB disks, set up in a RAIDZ-1 array. Assuming I lose one of these disks to parity, that should leave me with 23.8TB of formatted, real, usable capacity.

root@abe:~# zpool status pool: zfsdata state: ONLINE config: NAME STATE READ WRITE CKSUM zfsdata ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD120EMFZ-11A6JA0_XJG004GM ONLINE 0 0 0 ata-WDC_WD120EMAZ-11BLFA0_5PGW3M9E ONLINE 0 0 0 ata-WDC_WD120EDBZ-11B1HA0_5QG4TBGF ONLINE 0 0 0

In the Proxmox GUI, it correctly shows 23.83 TB of space in the zfsdata pool.

In this zpool I have some smaller disk images and 3 large ones. One is 10TB, one is 1TB and the other is 2.5TB (shown below as using 13.3T, 1.36T and 3.33T):
root@abe:~# zfs list NAME USED AVAIL REFER MOUNTPOINT zfsdata 18.3T 3.36T 128K /zfsdata zfsdata/vm-100-disk-0 43.7G 3.39T 5.71G - zfsdata/vm-101-disk-0 13.3T 8.17T 8.51T - zfsdata/vm-101-disk-1 43.7G 3.38T 20.9G - zfsdata/vm-101-disk-2 3.33M 3.36T 176K - zfsdata/vm-101-disk-3 1.36T 3.82T 925G - zfsdata/vm-102-disk-0 3.33M 3.36T 229K - zfsdata/vm-102-disk-1 43.7G 3.38T 15.1G - zfsdata/vm-102-disk-3 3.33T 4.64T 2.05T - zfsdata/vm-103-disk-0 3.33M 3.36T 144K - zfsdata/vm-103-disk-1 7.33M 3.36T 90.6K - zfsdata/vm-103-disk-2 175G 3.51T 16.2G -

I have no snapshots:
root@abe:~# zfs list -t snapshot no datasets available

By my calculations, I have used ~14TB of my 23.8TB capacity, so I should have about 9.8TB left.

However, the GUI and the "zfs list" above is showing that I only have about 3.5TB of space left. How can this possibly be? Where has the missing 6TB gone? I've already given up an entire disk for parity (as expected) so it can't possibly be more parity.

Any help gratefully received.
 
Last edited:
Please search the forum for "padding overhead".

With the default ashift=12 + volblocksize=8K and 3 12TB disks in raidz1 you only get 14.4TB of usable storage for VM disks:

3x 12TB = 36TB raw storage
-12TB parity data (-33%) = 24 TB usable storage
Everything written to a zvol will be 133% in size, because of padding overhead, because your default volblocksize is probably too small, so you indirectly loose another 17% of your raw storage = 18TB left
ZFS always needs 20% of the storage to be free for optimal operation. So you loose another 20% of that 18TB and you end up with 14.4TB.

You could increase the volblocksize to 16K and destroy and recreate your zvols (for example by backing up and restoring your VMs or by doing a migration), then you would get 19.2 TB (24TB - 20% that should be kept free) of real usable storage. Downside of cause would be that all workloads doing reads/writes smaller than 16K (like PostgreSQL) will be terrible, as ZFS could only work with 16K blocks, so any 4K or 8K IO would cause 16K reads/writes.

And you might also want to run zfs list -o space and see the refreservation to check if discard/TRIM isn't working which might also prevent ZFS from freeing up stuff.
 
Last edited:
Thank you Dunuin, much appreciated. I will research accordingly. That's an awful lot of disk space lost!
 
Thanks both for your support, the linked page has helped my understanding. It seems like I have selected just about the worst configuration in terms of space lost. The trouble is, it's taken me a week to transfer the data from NTFS and VHDXs (just on my little home server) and I don't have additional disks to transfer to a new zpool. I originally wanted to do a 4 disk pool but I couldn't afford 4 x 12TB drives. I have some thinking to do!
 
3 disk raidz1 isn‘t that bad in general. It really depends on your workload. Would be a totally fine choice for a cold storage for medium to big files. Its just bad that you created it with a 8K volblocksize as this can only be set at creation of the zvols.

Real usable capacity for zvols:IOPS performance:Throughput performance (read / write):Disks allowed to fail:Resilvering time:Expandability:
3x 12TB raidz1 (@ashift=12; volblocksize=8K):14.4 TB (40%)1x2x / 2x1badadd 3/6/9 more HDDs
3x 12TB raidz1 (@ashift=12; volblocksize=16K):19.2 TB (53%)1x2x / 2x1badadd 3/6/9 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=8K):19.2 TB (40%)1x3x / 3x1worseadd 4/8/12 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=16K):25.3 TB (53%)1x3x / 3x1worseadd 4/8/12 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=64K): 28 TB (58%)1x3x / 3x1worseadd 4/8/12 more HDDs
4x 12TB raidz2 (@ashift=12; volblocksize=8K):12.7 TB (26%)1x2x / 2x2worseadd 4/8/12 more HDDs
4x 12TB raidz2 (@ashift=12; volblocksize=16K):16.9 TB (35%)1x2x / 2x2worseadd 4/8/12 more HDDs
4x 12TB striped mirror aka raid10 (@ashift=12; volblocksize=8K):19.2 TB (40%)2x4x / 2x1-2goodadd 2/4/6 more HDDs

The new draid also might be an alternative to a raidz1/2/3 to get a better resilvering time. You want a short resilvering time because it can take day or weeks to rebuild the pool after you replaced a disk and the while doing the resilvering the HDDs re under high stress and its more likely that another disk will then fail. And while the resilvering is running your pool is basically useless because its too slow.

And to speed up the pool its also a good idea to add 2 or 3 small SSDs in a mirror (2x 240GB enterprise SSDs should be fine for a 48TB raidz1) as special metadata devices. This can compensate a bit of the bad IOPS performance of HDDs as the HDDs then only have to store the data and not data+metadata. So the HDDs get hit by less IO.
 
Last edited:
And I highly recommend that you set a quota of 90%. You should monitor your pool and keep the used space below 80% or the pool will become slow and will fragment faster which is bad, as ZFS is a copy-on-write filesystem that can't be defragmented (only option to "defrag" it is to move data of the pool and write it back). And when the pool hits 100% it becomes inoperatable and you maybe even won't be able to delete anything to free it up, because its copy-on-write, so you need free space to write stuff in order to delete data. So its a good idea to set a 90% quota, which will prevent that by accident the pool could be filled up more than 90%, so this worst case can't happen. If the pool reports you got 24TB of usable storage you might want to run something like this: zfs set quota=21.6T zfsdata
 
Hi Dunuin, one more question. I realise now that I volblocksize can be set on the fly but will only will affect new volumes (virtual disks) within the pool. Therefore at least for the two smaller virtual disks I could do the following:
1. Change the block size of the pool ( DataCenter ->Storage -> myzfsdata->Edit: Block Size = 64k).
2. Create a new virtual disk and attach it to the VM.
3. Attach the new disk to the VM
4. Within the VM, mount the VM and cp/rsync the files from the old 8K virtual disk to the new 64k virtual disk
5. Delete the old virtual disk.

All of this could be done without stopping the VM. My understanding is that for this new volume/virtual disk the volblocksize would also be 64k. Is that correct? Or do I need to specify the volblocksize when I create the volume? (I can't see an option for this in the GUI). Is there a better way?

Thanks, Kevin
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!