Understanding disk usage of a VM with ZFS

saphirblanc

Well-Known Member
Jul 4, 2017
49
0
46
Hello,

I'm curious about the disk usage of a VM. Here is the config of that VM :

Code:
agent: 1
bootdisk: virtio0
cores: 8
ide2: none,media=cdrom
memory: 12288
name: XXX
net0: virtio=XXX,bridge=vmbr0,rate=30
net1: virtio=XXX,bridge=vmbr0,tag=103
net2: virtio=XXX,bridge=vmbr0,tag=104
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
virtio0: local-zfs:vm-133-disk-0,size=10G
virtio1: local-zfs:vm-133-disk-1,size=100G
vmgenid: XXX

As you can see, this VM have 2 disks (10G and 100G). Now if I check the disk usage of this VM on my ZFS pool :

Code:
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
rpool/data/vm-133-disk-0             10.1G   647G  10.1G  -
rpool/data/vm-133-disk-1              145G   647G   145G  -

The VM takes 155G ?
I've enabled the replication between this host and another, here are the snapshots stored on the original host :

Code:
rpool/data/vm-133-disk-0@__replicate_133-0_1551165720__             44.4M      -  10.1G  -
rpool/data/vm-133-disk-1@__replicate_133-0_1551165720__             12.7M      -   145G  -

Can someone explain me this storage usage ? I've seen the same on another VM simply using the double of space that it should too.

Thank you very much.
 
Please run

Code:
zfs get all rpool/data/vm-133-disk-1 | grep used

and have a look at the used values for each type. Can you also output your ZFS setup

Code:
zpool status -v rpool
 
Hi!

Thanks, here are the results :

Code:
zfs get all rpool/data/vm-133-disk-1 | grep used
rpool/data/vm-133-disk-1  used                  145G                   -
rpool/data/vm-133-disk-1  usedbysnapshots       453M                   -
rpool/data/vm-133-disk-1  usedbydataset         145G                   -
rpool/data/vm-133-disk-1  usedbychildren        0B                     -
rpool/data/vm-133-disk-1  usedbyrefreservation  0B                     -
rpool/data/vm-133-disk-1  logicalused           95.4G                  -

What's the difference between usedbydataset and logicalused ?
And ZFS setup :

Code:
zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h36m with 0 errors on Sun Feb 10 01:00:46 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0
        sde2    ONLINE       0     0     0

errors: No known data errors

Thanks for the hints!
 
What's the difference between usedbydataset and logicalused ?

The main problem in interpreting the data is that usedbysnapshot is not the total space that is used by snapshots in general, but only the space that is used by only the snapshots. This means that there is data that was around, has been deleted but still is referenced in at least 2 snapshots. This is very complicated in comparison to "ordinary stupid" filesystems like ext4.

Now to your actual question, which can be answered directly by looking into the manpage of zfs:

Code:
     usedbydataset         The amount of space used by this dataset itself, which would be freed if the dataset were destroyed (after first removing any refreservation and destroying any necessary
                           snapshots or descendents).

     logicalused           The amount of space that is "logically" consumed by this dataset and all its descendents.  See the used property.  The logical space ignores the effect of the compression and
                           copies properties, giving a quantity closer to the amount of data that applications see.  However, it does include space consumed by metadata.
 
Thank you for your reply :)
I still don't get exactly why it takes much more that it should and how can I reduce this, but I'm fine with that for now, hope it won't double.
 
I still don't get exactly why it takes much more that it should and how can I reduce this

You must know that zfs is a cow(copy on write) system. For this reason when you try to modify a file, there is no modification of any previous data that is all ready on pool. Insted you will have new data blocks with only new data. This is very usefull because when you make a new zfs snaphot, who can make a list of only new blocks that are not included in the previous snapshot.

Another reason is the fact that when you delete some files in your VM the zfs has no information that some blocks disks was erased. So you need to instruct zfs that some blocks was free using fstrim in your vm.

And you must take in account that you have a parity in your pool(raidz) , and your VM could not have any info about this. Also padding it will eating some space ;)

Good luck!
 
  • Like
Reactions: saphirblanc
I'm having a similar issue and still don't understand the difference between usedbydataset and localused. I have a VM that has TRIM configured, and I've verified that the ZFS subvol actually reports less data being used after issuing a fstrim command in the VM, but the volume is still taking up a lot more space than I anticipated on my Proxmox host. The VM in question has a 2TB partition that currently contains 1.36TB of data. I've just run fstrim on the volume in the VM. The localused shows the expected value of 1.36TB, but the usedbydataset shows 2.71TB. I also don't have any snapshots for this volume. Below is the output of zfs all and zfs list:

Code:
root@pve:~# zfs get all rpool/data/vm-106-disk-2 | grep used
rpool/data/vm-106-disk-2  used                  2.71T                  -
rpool/data/vm-106-disk-2  usedbysnapshots       0B                     -
rpool/data/vm-106-disk-2  usedbydataset         2.71T                  -
rpool/data/vm-106-disk-2  usedbychildren        0B                     -
rpool/data/vm-106-disk-2  usedbyrefreservation  0B                     -
rpool/data/vm-106-disk-2  logicalused           1.36T                  -

Code:
root@pve:~# zfs list
NAME                           USED  AVAIL  REFER  MOUNTPOINT
rpool                         5.08T  1.94T   208K  /rpool
rpool/ROOT                    2.23G  1.94T   192K  /rpool/ROOT
rpool/ROOT/pve-1              2.23G  1.94T  2.23G  /
rpool/data                    5.08T  1.94T   224K  /rpool/data
rpool/data/subvol-104-disk-0  1.54G  98.5G  1.54G  /rpool/data/subvol-104-disk-0
rpool/data/subvol-105-disk-0   588M  7.43G   588M  /rpool/data/subvol-105-disk-0
rpool/data/vm-100-disk-0      6.12G  1.94T  6.12G  -
rpool/data/vm-101-disk-0       112K  1.94T   112K  -
rpool/data/vm-101-disk-1      7.59G  1.94T  7.59G  -
rpool/data/vm-106-disk-0      3.47G  1.94T  3.47G  -
rpool/data/vm-106-disk-1      2.34T  1.94T  2.34T  -
rpool/data/vm-106-disk-2      2.71T  1.94T  2.71T  -
rpool/data/vm-111-disk-0      2.21G  1.94T  2.21G  -
rpool/data/vm-113-disk-0      8.53G  1.94T  8.53G  -

Code:
root@pve:~# zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 15h40m with 0 errors on Sun Mar 10 17:04:36 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdc3    ONLINE       0     0     0
        sdd3    ONLINE       0     0     0
        sde3    ONLINE       0     0     0
        sdf3    ONLINE       0     0     0
        sdg3    ONLINE       0     0     0
        sdh3    ONLINE       0     0     0

errors: No known data errors
 
I guess I am coming from it from the opposite point of view; I never expected zvols to be "thin provisioned", so a 2TB zvol should take up 2TB in your pool (plus parity). So you see your ~2.7TB usage for your 4+2 RAIDZ2 pool.

Does proxmox make "sparse" zvols by default? I think you want to look at the "refreservation" property of your zvol.

Edit: OK I looked at my proxmox box and my zvols have a non-zero "refreservation" value
 
The refreservation property is "none", which seems to me to indicate that the zvol can expand to the maximum size of the pool. The problem arose because the amount of space on the server doesn't match what is expected.

I have 6 2TB drives in a RAID-Z2 configuration, so I expect to have 4x2TB or 8TB of space available to work with. The root level of my pool indicates that I have about 7TB available, and I assume that there's almost 1TB consumed by overhead.

I create a few VMs with drives that total well under 7TB, and I see that the available space in the pool is still close to 7TB, so I assume that thin provisioning is in place. Eventually, my system crashes because the pool is completely full. I look at the pool statistics, and the first thing I see is a zvol that I know is related to a 2TB drive that I've created on one of the VMs is taking up almost 3TB. There is only about 1TB of data stored on the drive. I mistakenly assume that it's a combination of me accidentally over-allocating drive space to my VMs and a lack of TRIM implementation. I free up some space, double-check that TRIM is working properly, and then look over everything closely.

Maybe this is a QEMU issue, but I can't find anything that keeps a zvol that's presented to a VM as a block device from consuming more space than was defined when creating the VM. I also don't know why, if the VM sees a 2TB block device, that the underlying zvol would consume almost 3TB.

And, also, I'm certain that these zvols are thin-provisioned because I had not created any new VMs for quite a while when this event occurred. I had just been generating more data in the existing VMs. Not only that, but I saw the logicalused parameter decrease on my zvols after running the fstrim command inside the VMs.

I assume that there's something that I'm missing in how all of this fits together, but I can't for the life of me figure out what's going on.
 
I guess one way to examine what is happening is make a new disk of a small but round size and then fill it from inside your VM or something and see what it looks like on the ZFS side.

The other thing you really need to keep track of is the base2 vs base10 sizing, that you called "overhead".

Here is one example I have; it is a 5GB disk in proxmox GUI and 5GB inside the VM:

root@testnodepve:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 4.9G 3.5G 1.1G 77% /
root@testnodepve:~# df -h --si /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 5.3G 3.8G 1.2G 77% /

on the hypervisor:
root@amd01:~# zfs list | grep 401
vm01/vm-401-disk-0 6.58G 2.42T 6.58G -

root@amd01:~# zfs get all | grep vm-401 | grep -i ref
vm01/vm-401-disk-0 referenced 6.58G -
vm01/vm-401-disk-0 refreservation 5.16G local
vm01/vm-401-disk-0 usedbyrefreservation 0B -
vm01/vm-401-disk-0 refcompressratio 1.25x -
vm01/vm-401-disk-0 logicalreferenced 4.64G
 
I'm curious... your zvol shows a referenced value of 6.58GB when your filesystem is 5GB. Where is the extra ~1.5GB being consumed? On my system, the discrepancy is ~2x. Can I really expect a 2TB partition to take up over 4TB of disk space? Here are the "ref" parameters for my partition:

Code:
rpool/data/vm-106-disk-2  referenced            2.76T                  -
rpool/data/vm-106-disk-2  refreservation        none                   default
rpool/data/vm-106-disk-2  usedbyrefreservation  0B                     -
rpool/data/vm-106-disk-2  refcompressratio      1.00x                  -
rpool/data/vm-106-disk-2  logicalreferenced     1.38T                  -

I tried creating a new partition and rsync'ing some of the data across, but it was obvious after only a few hundred GB that the new partition was behaving in exactly the same way. When there was 100GB of data stored on the new partition, the logicalused and logicalreferenced values were about 100GB, but the referenced and usedbydataset values were almost twice as big. At this rate, I'll only be able to store about 3.5TB of data on my 6 x 2TB drive array, which seems a bit absurd
 
I am not sure how to explain it, but at least we can make careful observations and go from there.

In this case, on my zvol I see I have volblocksize 8K (from zfs get all | grep vm-401 | grep -i block) , and inside is actually an ext4 filesystem with 4k block size (from tune2fs -l /dev/sda1).

I saw a random post on zfs-discuss mailing list today about scsi unmap not passing correctly to zvols; I'm not sure if that is anecdotal or something I misunderstand.

Here is also a possibly related post about "fstrim": https://github.com/zfsonlinux/zfs/issues/6307

When I first started using proxmox, I expected it to make files inside a zfs filesystem for disks; I did not expect zvols. There are pros and cons. I went with zvols because they are the default in proxmox.

But I do agree that the desired behavior is that the zvol does not grow beyond the size that you set it; so long as you don't use snapshots.

edit: maybe we just need to enable discard for our inside filesystem? https://github.com/zfsonlinux/zfs/issues/7722
 
OK, so inside my VM I ran
root@testnodepve:~# dd if=/dev/urandom of=/tmp/zerofile1G bs=1M count=1024
and the zvol usage grew but only by a little bit:
vm01/vm-401-disk-0 referenced 6.65G
 
So, I think I've stumbled upon something. After reading https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz and seeing over and over that 6 drives in a RAID-Z2 seems to be a sweet spot, which is exactly what I have, I noticed at the very end it said
Note that setting a small recordsize with 4KB sector devices results in universally poor space efficiency -- RAIDZ-p is no better than p-way mirrors for recordsize=4K or 8K. The strongest valid recommendation based on exact fitting of blocks into stripes is the following: If you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K

My drive has 4K sectors, and by default, ProxMox sets the blocksize to 8k. After changing it to 32k and creating a new zvol, my referenced and logicalreferenced are almost identical, and match the amount of data stored in the VM.
 
When I first started using proxmox, I expected it to make files inside a zfs filesystem for disks; I did not expect zvols. There are pros and cons. I went with zvols because they are the default in proxmox.

But I do agree that the desired behavior is that the zvol does not grow beyond the size that you set it; so long as you don't use snapshots.

edit: maybe we just need to enable discard for our inside filesystem? https://github.com/zfsonlinux/zfs/issues/7722

It looks like there are different defaults for configuring zfs pools in different versions of ProxMox. I just set up a new v5.3 server from scratch, and when I go to Datacenter->Storage->local-zfs and click Edit, I see that "Thin Provision" is checked, and the Block Size is 8k.

Based on what you've posted about your system it appears that your zvols are not being thin provisioned, but mine definitely are. If "Thin Provision" is checked, and "Discard" is selected when creating a new VirtIO-SCSI volume, *and* discard is supported in the guest OS, then running fstrim in the GuestOS will de-allocate space in the zvol.

There's a lot of pieces to get just right for it to work. And I'm still not sure how to address the issue of different RAID-Z geometries and block sizes and sector sizes to get reasonable behavior of disk space consumption without trial and error.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!