ZFS + PVE Disk usage makes no sense

KevinPeters · Sep 29, 2022

Hi, I am new to Proxmox and migrating from Hyper-V has been something of a pain. I am getting extremely frustrated and considering moving back to Hyper-V. Please, please, please help me understand.

I have 3 x 12TB disks, set up in a RAIDZ-1 array. Assuming I lose one of these disks to parity, that should leave me with 23.8TB of formatted, real, usable capacity.

root@abe:~# zpool status
  pool: zfsdata
 state: ONLINE
config:

        NAME                                    STATE     READ WRITE CKSUM
        zfsdata                                 ONLINE       0     0     0
          raidz1-0                              ONLINE       0     0     0
            ata-WDC_WD120EMFZ-11A6JA0_XJG004GM  ONLINE       0     0     0
            ata-WDC_WD120EMAZ-11BLFA0_5PGW3M9E  ONLINE       0     0     0
            ata-WDC_WD120EDBZ-11B1HA0_5QG4TBGF  ONLINE       0     0     0

In the Proxmox GUI, it correctly shows 23.83 TB of space in the zfsdata pool.

In this zpool I have some smaller disk images and 3 large ones. One is 10TB, one is 1TB and the other is 2.5TB (shown below as using 13.3T, 1.36T and 3.33T):

root@abe:~# zfs list
NAME                    USED  AVAIL     REFER  MOUNTPOINT
zfsdata                18.3T  3.36T      128K  /zfsdata
zfsdata/vm-100-disk-0  43.7G  3.39T     5.71G  -
zfsdata/vm-101-disk-0  13.3T  8.17T     8.51T  -
zfsdata/vm-101-disk-1  43.7G  3.38T     20.9G  -
zfsdata/vm-101-disk-2  3.33M  3.36T      176K  -
zfsdata/vm-101-disk-3  1.36T  3.82T      925G  -
zfsdata/vm-102-disk-0  3.33M  3.36T      229K  -
zfsdata/vm-102-disk-1  43.7G  3.38T     15.1G  -
zfsdata/vm-102-disk-3  3.33T  4.64T     2.05T  -
zfsdata/vm-103-disk-0  3.33M  3.36T      144K  -
zfsdata/vm-103-disk-1  7.33M  3.36T     90.6K  -
zfsdata/vm-103-disk-2   175G  3.51T     16.2G  -

I have no snapshots:

root@abe:~# zfs list -t snapshot
no datasets available

By my calculations, I have used ~14TB of my 23.8TB capacity, so I should have about 9.8TB left.

However, the GUI and the "zfs list" above is showing that I only have about 3.5TB of space left. How can this possibly be? Where has the missing 6TB gone? I've already given up an entire disk for parity (as expected) so it can't possibly be more parity.

Any help gratefully received.

Dunuin · Sep 29, 2022

Please search the forum for "padding overhead".

With the default ashift=12 + volblocksize=8K and 3 12TB disks in raidz1 you only get 14.4TB of usable storage for VM disks:

3x 12TB = 36TB raw storage
-12TB parity data (-33%) = 24 TB usable storage
Everything written to a zvol will be 133% in size, because of padding overhead, because your default volblocksize is probably too small, so you indirectly loose another 17% of your raw storage = 18TB left
ZFS always needs 20% of the storage to be free for optimal operation. So you loose another 20% of that 18TB and you end up with 14.4TB.

You could increase the volblocksize to 16K and destroy and recreate your zvols (for example by backing up and restoring your VMs or by doing a migration), then you would get 19.2 TB (24TB - 20% that should be kept free) of real usable storage. Downside of cause would be that all workloads doing reads/writes smaller than 16K (like PostgreSQL) will be terrible, as ZFS could only work with 16K blocks, so any 4K or 8K IO would cause 16K reads/writes.

And you might also want to run zfs list -o space and see the refreservation to check if discard/TRIM isn't working which might also prevent ZFS from freeing up stuff.

mira · Sep 29, 2022

As an addition to Dunuin's answer, our documentation [0] contains some more information on the different ZFS RAID levels.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs (3.8.3)

KevinPeters · Sep 29, 2022

Thank you Dunuin, much appreciated. I will research accordingly. That's an awful lot of disk space lost!

Dunuin · Sep 29, 2022

I can recommend this article on why there is padding overhead and how to calculate the optimal volblocksize: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

KevinPeters · Sep 29, 2022

Thanks both for your support, the linked page has helped my understanding. It seems like I have selected just about the worst configuration in terms of space lost. The trouble is, it's taken me a week to transfer the data from NTFS and VHDXs (just on my little home server) and I don't have additional disks to transfer to a new zpool. I originally wanted to do a 4 disk pool but I couldn't afford 4 x 12TB drives. I have some thinking to do!

Dunuin · Sep 29, 2022

3 disk raidz1 isn‘t that bad in general. It really depends on your workload. Would be a totally fine choice for a cold storage for medium to big files. Its just bad that you created it with a 8K volblocksize as this can only be set at creation of the zvols.

	Real usable capacity for zvols:	IOPS performance:	Throughput performance (read / write):	Disks allowed to fail:	Resilvering time:	Expandability:
3x 12TB raidz1 (@ashift=12; volblocksize=8K):	14.4 TB (40%)	1x	2x / 2x	1	bad	add 3/6/9 more HDDs
3x 12TB raidz1 (@ashift=12; volblocksize=16K):	19.2 TB (53%)	1x	2x / 2x	1	bad	add 3/6/9 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=8K):	19.2 TB (40%)	1x	3x / 3x	1	worse	add 4/8/12 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=16K):	25.3 TB (53%)	1x	3x / 3x	1	worse	add 4/8/12 more HDDs
4x 12TB raidz1 (@ashift=12; volblocksize=64K):	28 TB (58%)	1x	3x / 3x	1	worse	add 4/8/12 more HDDs
4x 12TB raidz2 (@ashift=12; volblocksize=8K):	12.7 TB (26%)	1x	2x / 2x	2	worse	add 4/8/12 more HDDs
4x 12TB raidz2 (@ashift=12; volblocksize=16K):	16.9 TB (35%)	1x	2x / 2x	2	worse	add 4/8/12 more HDDs
4x 12TB striped mirror aka raid10 (@ashift=12; volblocksize=8K):	19.2 TB (40%)	2x	4x / 2x	1-2	good	add 2/4/6 more HDDs

The new draid also might be an alternative to a raidz1/2/3 to get a better resilvering time. You want a short resilvering time because it can take day or weeks to rebuild the pool after you replaced a disk and the while doing the resilvering the HDDs re under high stress and its more likely that another disk will then fail. And while the resilvering is running your pool is basically useless because its too slow.

And to speed up the pool its also a good idea to add 2 or 3 small SSDs in a mirror (2x 240GB enterprise SSDs should be fine for a 48TB raidz1) as special metadata devices. This can compensate a bit of the bad IOPS performance of HDDs as the HDDs then only have to store the data and not data+metadata. So the HDDs get hit by less IO.

Dunuin · Sep 29, 2022

And I highly recommend that you set a quota of 90%. You should monitor your pool and keep the used space below 80% or the pool will become slow and will fragment faster which is bad, as ZFS is a copy-on-write filesystem that can't be defragmented (only option to "defrag" it is to move data of the pool and write it back). And when the pool hits 100% it becomes inoperatable and you maybe even won't be able to delete anything to free it up, because its copy-on-write, so you need free space to write stuff in order to delete data. So its a good idea to set a 90% quota, which will prevent that by accident the pool could be filled up more than 90%, so this worst case can't happen. If the pool reports you got 24TB of usable storage you might want to run something like this: zfs set quota=21.6T zfsdata

KevinPeters · Sep 29, 2022

Hi Dunuin, one more question. I realise now that I volblocksize can be set on the fly but will only will affect new volumes (virtual disks) within the pool. Therefore at least for the two smaller virtual disks I could do the following:
1. Change the block size of the pool ( DataCenter ->Storage -> myzfsdata->Edit: Block Size = 64k).
2. Create a new virtual disk and attach it to the VM.
3. Attach the new disk to the VM
4. Within the VM, mount the VM and cp/rsync the files from the old 8K virtual disk to the new 64k virtual disk
5. Delete the old virtual disk.

All of this could be done without stopping the VM. My understanding is that for this new volume/virtual disk the volblocksize would also be 64k. Is that correct? Or do I need to specify the volblocksize when I create the volume? (I can't see an option for this in the GUI). Is there a better way?

Thanks, Kevin

Search

Search

ZFS + PVE Disk usage makes no sense

KevinPeters

Member

Dunuin

Distinguished Member

mira

Proxmox Staff Member

KevinPeters

Member

Dunuin

Distinguished Member

KevinPeters

Member

Dunuin

Distinguished Member

Dunuin

Distinguished Member

KevinPeters

Member

We value your privacy