zfs thin provision space usage discrepancies

mnih

Member
Feb 10, 2021
40
5
13
Hi,

I've got a debian 11 VM which uses 3 virtio-scsi disks with 20G/1T/3T size and discard on. First one is for the system, second and third for storage, with ext4 file system.

Code:
Filesystem              Size  Used Avail Use% Mounted on
/dev/sdc1                16G  6.3G  8.6G  43% /
/dev/sda1              1007G  268G  740G  27% /mnt/storage1
/dev/sdb1               3.0T  2.2T  789G  74% /mnt/storage2
...

All disk data is located in a thin provisioned zfs raid-z2 pool as raw dataset.

zfs list output:
Code:
NAME                             USED  AVAIL     REFER  MOUNTPOINT
vmpool/vm-107-disk-0            7.79G  3.30T     7.79G  -
vastank/bulkpool/vm-107-disk-0   350G  6.88T      350G  -
vastank/bulkpool/vm-107-disk-1  2.79T  6.88T     2.79T  -

350G used/refer but data usage inside the file system is only 268G for disk two, 2.79T vs 2.2T for disk three.

I tried fstrim -av; writing zeros with dd and deleting the file afterwards, sync flushes - nothing reduced the 350G zfs dataset.

Then I moved the second disk to another (also zfs thin provisioned) datapool (it's now called disk-1):
Code:
NAME                             USED  AVAIL     REFER  MOUNTPOINT
vmpool/vm-107-disk-1             266G  3.04T      266G  -

As you can see the additional space was freed in the process.

I'm currently moving the data back to the initial pool (will take some hours) to see if the usage stays low (to rule out some pool differences I'm not aware of causing the overhead) .

Also the virtual disks get replicated every minute to another node, on the target node they are also oversized.

I'm asking myself is there is any way to reclaim the unused space without having to migrate the virtual disk between zfs pools?
 
Search this forum for "padding overhead". You can't use any raidz1/2/3 with the defualt 8K volblocksize or you will get massive padding overhead causing to everything written to zvol consuming more space. The smaller your volblocksize or the more disks your raidz1/2/3 consists of, the more space you will waste.
See here for more info: https://web.archive.org/web/2021031...or-how-i-learned-stop-worrying-and-love-raidz
 
Last edited:
  • Like
Reactions: mnih
Yes, but then you got the additional overhead of the qcow2 (which is also copy-on-write like ZFS, so CoW on top of CoW) as well as the overhead of that additional filesystem from the ZFS dataset.

To lower the padding overhead I would either:
A.) increase the volblocksize in case you only got data that does big async sequential reads/writes. Would be really bad when using DBs like PostgreSQL or MySQL doing small sync writes/reads.
B.) in case you need small writes/reads, buy more disks and use a striped mirror (raid10) instead, which would also improve IOPS performance, as IOPS performance only scales with the number of vdevs and not the number of disks (a 100 disk raidz2 is as slow as a single disk when it comes to IOPS)
C.) try to use a LXC in case you don't need the additional isolation/security, as LXCs use datasets and datasets won't be effected by padding overhead
 
  • Like
Reactions: mnih
okay so it seems qcow2 would be an even bigger mistake then.

data is only storage, so no databases access patterns.

The pool is a raidz2 containing of 5 7.68TB SSDs.

To make the adjustments I would set a new volblocksize for the same pool, all subsequently created datasets then inherit the new setting, correct?

What volblocksize do you recommend?
 
To make the adjustments I would set a new volblocksize for the same pool, all subsequently created datasets then inherit the new setting, correct?
The volblocksize can only be set at creation of a zvol. And you can only set it yourself when manually creating a zvol. But usually, PVE will create the zvols for you, when you add a new virtual disk to a VM or restore a VM from backups. So you would need to edit your ZFSPool storage in the webUI and set the "Block Size" there. What is set in that "Block Size" textbox will be used by ZFS as volblocksize for newly created zvols.
Then you would need to destroy and recreate all zvols, which can be done by restoring a backup and overwriting the old zvols.

Good volblocksize for a 5 disk raidz2 with ashift=12 would be 32K (53% raw capacity usable) or 128K (59% raw capacity usable). With the default 8K volblocksize only 33% of the raw capacity is usable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!