VMs disk growing beyond the allocated size

isantos · Feb 11, 2022

Hello,

I got a proxmox cluster setup with TrueNAS as storage using ZFS over iSCSI with TheGrandWazoo plugin https://github.com/TheGrandWazoo/freenas-proxmox.

The proxmox cluster are running version 6.4-8 (pve-manager/6.4-8/185e14db (running kernel: 5.4.119-1-pve))
The TrueNAS server is running on version 12 (TrueNAS-12.0-U8).

I have set up the connection to the storage using the plugin from TheGrandWazoo, and I noticed some weird problems like snapshots taking forever to finish, or some VMs that I deleted was not deleted from the storage (ZVOL persists sometimes, and I have to manually delete it).

But my big concern are the VMs growing beyond the size I specify when I have created them. At this point I think my setup sounds not the optimal, due to some info I found on issues between PVE and TrueNAS, but at that point I didn't had this information, and this set was created on rush... unfortunatelly. I see other posts people are running ZFS locally on the PVE, but for us is important that the storage is accessible by our other nodes, so we can live-migrate VMs between nodes. Please share your thouths about this.

Well, this is what I know to the moment (I will take the worst case as exemple, VM ID 100):

Bash:

root@pve08:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       100 XXXXXXXXX running    32768          15360.00 6259     
       102 XXXXXXXXX running    49152           4000.00 6405     
       114 XXXXXXXXX running    16384           2600.00 6473

root@pve08:~# qm config 100
agent: 1
boot: order=ide2;scsi0;net0
cores: 8
ide2: none,media=cdrom
memory: 32768
name: XXXXXXXXX
net0: virtio=XXXXXXXXX,bridge=vmbr0,firewall=1,rate=15,tag=300
net1: virtio=XXXXXXXXX,bridge=vmbr0,firewall=1,rate=70,tag=208
numa: 0
ostype: l26
scsi0: san01-datastore01:vm-100-disk-1,cache=writeback,discard=on,iops_rd=1000,iops_rd_max=2000,iops_wr=1000,iops_wr_max=2000,mbps_rd=100,mbps_rd_max=300,mbps_wr=100,mbps_wr_max=300,size=15T
scsihw: virtio-scsi-pci
smbios1: uuid=3fd78b62-0c60-4b2d-b693-9b408e27a698
sockets: 2
vmgenid: 6ad23a7a-4ec9-4e30-b29a-979c3fa1bead

root@pve08:~# qm listsnapshot 100
`-> current                                             You are here!

The VM is configured with a 15TB virtual disk. No snapshots present.
On my storage by other hand, I see I much bigger volume size. How is this possible?

Bash:

root@san01[~]# zfs list pool01/vm-100-disk-1
NAME                   USED  AVAIL     REFER  MOUNTPOINT
pool01/vm-100-disk-1  30.6T  19.1T     30.6T  -

root@san01[~]# zfs list -t snapshot pool01/vm-100-disk-1
no datasets available

Since I don't find many setups like ours, using PVE with storage on TrueNAS, I am not finding some more information that helps me figure out the source of this problem. Maybe other more experienced admins have some suggestions?

I appreciate your help.

mira · Feb 11, 2022

Do you have TRIM enabled and run it periodically in the VM?

isantos · Feb 11, 2022

@mira All my VMs are with the "discard" option enabled in it's configuration. Should I still run something manually in the VM?

Dunuin · Feb 11, 2022

isantos said:
@mira All my VMs are with the "discard" option enabled in it's configuration. Should I still run something manually in the VM?

Yes, the TRIM commands need to be send by your guest OS. So in case of Linux you need to mount every partition with the 'discard' option set or setup cron to run fstrim -a once per hour or day.

isantos · Feb 11, 2022

I will wait for the next backup finishes and I will test this suggestion in the hollidays, then I get back to update this topic. Thank you @mira and @Dunuin !

mira · Feb 14, 2022

Lots of Linux distributions offer a fstrim service, which can be enabled and runs once a week.
This should be enough in most cases.

isantos · Feb 19, 2022

I have ran the fstrim -av on all my linux servers last weekend, and it worked great on most of them, and I also enabled the fstrim.timer on every VM. Only on this VM id 100 I used as example it didn't worked. I repeated the command yesterday, and it ran all night along, and when I checked the disk size this morning, it still shows 30TB on my storage (it is 15TB configured on the PVE):

Code:

[root@mailserver ~]# fstrim -av
/var: 6 GiB (6477033472 bytes) trimmed
/boot: 868,2 MiB (910336000 bytes) trimmed
/: 867,5 GiB (931441553408 bytes) trimmed
[root@mailserver ~]# lsblk
NAME                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                     8:0    0    15T  0 disk
├─sda1                  8:1    0     1M  0 part
├─sda2                  8:2    0     1G  0 part /boot
└─sda3                  8:3    0    15T  0 part
  ├─cl_mailbox01-root 253:0    0    15T  0 lvm  /
  ├─cl_mailbox01-swap 253:1    0   7,9G  0 lvm  [SWAP]
  └─cl_mailbox01-var  253:2    0    10G  0 lvm  /var

The VM's disk is SCSI and the controller is VirtIO SCSI.

Code:

root@pve08:~# cat /etc/pve/qemu-server/100.conf
agent: 1
boot: order=ide2;scsi0;net0
cores: 8
ide2: none,media=cdrom
memory: 32768
name: mailserver.finep.gov.br
net0: virtio=EA:2E:00:9A:D1:F1,bridge=vmbr0,firewall=1,rate=15,tag=300
net1: virtio=F6:A1:66:AF:49:37,bridge=vmbr0,firewall=1,rate=70,tag=208
numa: 0
ostype: l26
scsi0: san01-datastore01:vm-100-disk-1,cache=writeback,discard=on,iops_rd=1000,iops_rd_max=2000,iops_wr=1000,iops_wr_max=2000,mbps_rd=100,mbps_rd_max=300,mbps_wr=100,mbps_wr_max=300,size=15T
scsihw: virtio-scsi-pci
smbios1: uuid=3fd78b62-0c60-4b2d-b693-9b408e27a698
sockets: 2
vmgenid: 6ad23a7a-4ec9-4e30-b29a-979c3fa1bead

What else should I check?

Dunuin · Feb 19, 2022

Check if that VM got a snapshot that prevents freeing up space.

isantos · Feb 19, 2022

No snapshots on this VM

PVE:

Code:

root@pve08:~# qm listsnapshot 100
`-> current                                             You are here!

Storage:

Code:

root@san01[~]# zfs list -t snapshot pool01/vm-100-disk-1
no datasets available

isantos · Feb 21, 2022

Some more information about this:

Code:

root@san01[~]# zfs get all pool01/vm-100-disk-1 | grep used
pool01/vm-100-disk-1  used                     30.4T                    -
pool01/vm-100-disk-1  usedbysnapshots          0B                       -
pool01/vm-100-disk-1  usedbydataset            30.4T                    -
pool01/vm-100-disk-1  usedbychildren           0B                       -
pool01/vm-100-disk-1  usedbyrefreservation     0B                       -
pool01/vm-100-disk-1  logicalused              14.3T                    -

Dunuin · Feb 21, 2022

Is that "pool01" ZFS pool a raidz1/2/3? Then it might be padding overhead if your volblocksize is too small.

Whats zpool list pool01 and zfs get volblocksize pool01/vm-100-disk-1?

isantos · Feb 21, 2022

@Dunuin Yes it is a RAIDZ2 ZFS pool

Code:

root@san01[~]# zpool list pool01
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool01   116T  69.6T  46.8T        -         -    26%    59%  1.00x    ONLINE  /mnt

Code:

root@san01[~]# zfs get volblocksize pool01/vm-100-disk-1
NAME                  PROPERTY      VALUE     SOURCE
pool01/vm-100-disk-1  volblocksize  4K        -

Then it might be padding overhead if your volblocksize is too small

I don't know about this, I will do a research on padding overhead on a ZFS volblocksize.
If you have any suggestion on this matter I would appreciate.

Dunuin · Feb 21, 2022

Whats your zpool status pool01 or how much disks your raidz2 consists of?

There is a good explanation about padding overhead and volblocksize: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

With a 4K volblocksize you will always loose 2/3 of your pools total raw capacity to padding and parity overhead. So if your pool got 116T raw capacity you can only store 38.66T of vzols on it. To fix that you would need to destroy and recreate all zvols with a bigger volblocksize as the volblocksize can only be set at creation. How big your volblocksize needs to be depends on your pools ashift and the number of drives your pool conists of.

isantos · Feb 21, 2022

My RAIDZ2 consists of 8 disks + logs and cache:

Code:

root@san01[~]# zpool status pool01
  pool: pool01
 state: ONLINE
  scan: scrub repaired 0B in 6 days 06:19:08 with 0 errors on Fri Jan 21 06:28:27 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/6fa898dd-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/701ebef3-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/7000b337-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/70516358-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/70e7d3e3-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/71397881-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/705ab46a-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/71141eed-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
        logs
          mirror-2                                      ONLINE       0     0     0
            gptid/9ce214c4-7164-11eb-94ae-3cecef3d7ef0  ONLINE       0     0     0
            gptid/9cea4619-7164-11eb-94ae-3cecef3d7ef0  ONLINE       0     0     0
        cache
          gptid/92366db8-7162-11eb-b90c-3cecef3d7ef0    ONLINE       0     0     0
          gptid/923e0667-7162-11eb-b90c-3cecef3d7ef0    ONLINE       0     0     0

errors: No known data errors

There is a good explanation about padding overhead and volblocksize: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

Thanks!

Dunuin · Feb 21, 2022

In case your pool was created with ashift=12, 8 data disks and the sum of the data disks raw capacity is 116T it would look like this:

	Parity+Padding loss of raw capacity:	Usable capacity for zvols:
Volblocksize 4K/8K:	67% (25% parity + 42% padding)	30.6T
Volblocksize 16K/32K/64K:	33% (25% parity + 7% padding)	62.1T
Volblocksize 128K:	29% (25% parity + 4% padding)	65.8T
Volblocksize 256K/512K:	26% (25% parity + 1% padding)	68.6T
Volblocksize 1M:	25% (25% parity + no padding)	69.6T

"Usable capacity for zvols" already takes into account that 20% of a zfs pool always should be kept free because otherwise it will get slow and fragments faster.

So I would use a 16K volblocksize which in theory should double the space your zvols can use. But keep in mind that this isn't taking block level compression or deduplication into account, so real world results may differ a bit.

If you want to use posgres DBs with its 8K blocksize you should consider using a striped mirror as writing 8K blocks to a zvol with 16K blocksize will half the performance.

isantos · Feb 21, 2022

@Dunuin So in order to fix this issue in my environment, I should change the volblocksize. But can I just change it in the Storage configuration of my actual pools? How this affect my existing VMs?

And I imagine that my VM-100 that is now using 30T instead of the 15T that was allocated, will not magically change it's zvol size. But is there any trick I can do to fix it's size? This VM is very big and this would require me a lot of (offline) time if I had to migrate it.

isantos · Feb 21, 2022

What is the implication of me using for example 1M volblocksize? This looks the best option in your spreadshit, speaking of space saving. All data would be written in chunks of 1Mb?

Dunuin · Feb 21, 2022

isantos said:
@Dunuin So in order to fix this issue in my environment, I should change the volblocksize. But can I just change it in the Storage configuration of my actual pools? How this affect my existing VMs?

The volblocksize can't be changed later. It can only be set at creation and will be read-only afterwards. You need to backup the data, destroy that vzol, recreate that zvol with the correct volblocksize and and restore the data to the new zvol. Not sure how that will work with ZFS over iSCSI.
Is PVE or TrueNAS creating the zvols? If that is just handled by PVE like a local ZFS pool I would change the blocksize of the ZFS storage (Datacenter -> Storage -> YourZFSStorage -> Edit -> Blocksize: 16K). Then I would create a PBS or vzdump backup of that VM and restore it from backup. PVE will then delete all existing zvols of that VM and create new ones using the 16K volblocksize and fill it with the data of the backups.

isantos said:
And I imagine that my VM-100 that is now using 30T instead of the 15T that was allocated, will not magically change it's zvol size. But is there any trick I can do to fix it's size? This VM is very big and this would require me a lot of (offline) time if I had to migrate it.

Much of that 30T is padding overhead. After increasing the volblocksize to something like 16K it should only use something like 16T or 17T to store those 15T.

isantos said:
What is the implication of me using for example 1M volblocksize? This looks the best option in your spreadshit, speaking of space saving. All data would be written in chunks of 1Mb?

Jup, so everything will atleast consume 1M, even if you just want to store 4K. So writing 1000x 4K will result in 1G instead of data instead of just 4M. In general you want the volblocksize to be as low as possible without sacrificing too much storage due to padding overhead.

Search

Search

VMs disk growing beyond the allocated size

isantos

Member

mira

Proxmox Staff Member

isantos

Member

Dunuin

Distinguished Member

isantos

Member

mira

Proxmox Staff Member

isantos

Member

Dunuin

Distinguished Member

isantos

Member

isantos

Member

Dunuin

Distinguished Member

isantos

Member

Dunuin

Distinguished Member

isantos

Member

Dunuin

Distinguished Member

isantos

Member

isantos

Member

Dunuin

Distinguished Member

We value your privacy