zfs thin provision space usage discrepancies

mnih · Apr 15, 2023

Hi,

I've got a debian 11 VM which uses 3 virtio-scsi disks with 20G/1T/3T size and discard on. First one is for the system, second and third for storage, with ext4 file system.

Code:

Filesystem              Size  Used Avail Use% Mounted on
/dev/sdc1                16G  6.3G  8.6G  43% /
/dev/sda1              1007G  268G  740G  27% /mnt/storage1
/dev/sdb1               3.0T  2.2T  789G  74% /mnt/storage2
...

All disk data is located in a thin provisioned zfs raid-z2 pool as raw dataset.

zfs list output:

Code:

NAME                             USED  AVAIL     REFER  MOUNTPOINT
vmpool/vm-107-disk-0            7.79G  3.30T     7.79G  -
vastank/bulkpool/vm-107-disk-0   350G  6.88T      350G  -
vastank/bulkpool/vm-107-disk-1  2.79T  6.88T     2.79T  -

350G used/refer but data usage inside the file system is only 268G for disk two, 2.79T vs 2.2T for disk three.

I tried fstrim -av; writing zeros with dd and deleting the file afterwards, sync flushes - nothing reduced the 350G zfs dataset.

Then I moved the second disk to another (also zfs thin provisioned) datapool (it's now called disk-1):

Code:

NAME                             USED  AVAIL     REFER  MOUNTPOINT
vmpool/vm-107-disk-1             266G  3.04T      266G  -

As you can see the additional space was freed in the process.

I'm currently moving the data back to the initial pool (will take some hours) to see if the usage stays low (to rule out some pool differences I'm not aware of causing the overhead) .

Also the virtual disks get replicated every minute to another node, on the target node they are also oversized.

I'm asking myself is there is any way to reclaim the unused space without having to migrate the virtual disk between zfs pools?

Dunuin · Apr 15, 2023

Search this forum for "padding overhead". You can't use any raidz1/2/3 with the defualt 8K volblocksize or you will get massive padding overhead causing to everything written to zvol consuming more space. The smaller your volblocksize or the more disks your raidz1/2/3 consists of, the more space you will waste.
See here for more info: https://web.archive.org/web/2021031...or-how-i-learned-stop-worrying-and-love-raidz

mnih · Apr 15, 2023

Thanks! So if I understand your post here: https://forum.proxmox.com/threads/raidz1-shows-wrong-space.125736/post-548794 correctly, If I move the disk to a directory storage on the same pool as qcow2, I could avoid this problem?

Dunuin · Apr 15, 2023

Yes, but then you got the additional overhead of the qcow2 (which is also copy-on-write like ZFS, so CoW on top of CoW) as well as the overhead of that additional filesystem from the ZFS dataset.

To lower the padding overhead I would either:
A.) increase the volblocksize in case you only got data that does big async sequential reads/writes. Would be really bad when using DBs like PostgreSQL or MySQL doing small sync writes/reads.
B.) in case you need small writes/reads, buy more disks and use a striped mirror (raid10) instead, which would also improve IOPS performance, as IOPS performance only scales with the number of vdevs and not the number of disks (a 100 disk raidz2 is as slow as a single disk when it comes to IOPS)
C.) try to use a LXC in case you don't need the additional isolation/security, as LXCs use datasets and datasets won't be effected by padding overhead

mnih · Apr 15, 2023

okay so it seems qcow2 would be an even bigger mistake then.

data is only storage, so no databases access patterns.

The pool is a raidz2 containing of 5 7.68TB SSDs.

To make the adjustments I would set a new volblocksize for the same pool, all subsequently created datasets then inherit the new setting, correct?

What volblocksize do you recommend?

Dunuin · Apr 15, 2023

mnih said:
To make the adjustments I would set a new volblocksize for the same pool, all subsequently created datasets then inherit the new setting, correct?

The volblocksize can only be set at creation of a zvol. And you can only set it yourself when manually creating a zvol. But usually, PVE will create the zvols for you, when you add a new virtual disk to a VM or restore a VM from backups. So you would need to edit your ZFSPool storage in the webUI and set the "Block Size" there. What is set in that "Block Size" textbox will be used by ZFS as volblocksize for newly created zvols.
Then you would need to destroy and recreate all zvols, which can be done by restoring a backup and overwriting the old zvols.

Good volblocksize for a 5 disk raidz2 with ashift=12 would be 32K (53% raw capacity usable) or 128K (59% raw capacity usable). With the default 8K volblocksize only 33% of the raw capacity is usable.

zfs thin provision space usage discrepancies

mnih

Well-Known Member

Dunuin

Distinguished Member

mnih

Well-Known Member

Dunuin

Distinguished Member

mnih

Well-Known Member

Dunuin

Distinguished Member

We value your privacy