VMs disk growing beyond the allocated size

isantos

Member
Mar 6, 2021
11
0
6
35
Hello,

I got a proxmox cluster setup with TrueNAS as storage using ZFS over iSCSI with TheGrandWazoo plugin https://github.com/TheGrandWazoo/freenas-proxmox.

The proxmox cluster are running version 6.4-8 (pve-manager/6.4-8/185e14db (running kernel: 5.4.119-1-pve))
The TrueNAS server is running on version 12 (TrueNAS-12.0-U8).

I have set up the connection to the storage using the plugin from TheGrandWazoo, and I noticed some weird problems like snapshots taking forever to finish, or some VMs that I deleted was not deleted from the storage (ZVOL persists sometimes, and I have to manually delete it).

But my big concern are the VMs growing beyond the size I specify when I have created them. At this point I think my setup sounds not the optimal, due to some info I found on issues between PVE and TrueNAS, but at that point I didn't had this information, and this set was created on rush... unfortunatelly. I see other posts people are running ZFS locally on the PVE, but for us is important that the storage is accessible by our other nodes, so we can live-migrate VMs between nodes. Please share your thouths about this.

Well, this is what I know to the moment (I will take the worst case as exemple, VM ID 100):

Bash:
root@pve08:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       100 XXXXXXXXX running    32768          15360.00 6259     
       102 XXXXXXXXX running    49152           4000.00 6405     
       114 XXXXXXXXX running    16384           2600.00 6473

root@pve08:~# qm config 100
agent: 1
boot: order=ide2;scsi0;net0
cores: 8
ide2: none,media=cdrom
memory: 32768
name: XXXXXXXXX
net0: virtio=XXXXXXXXX,bridge=vmbr0,firewall=1,rate=15,tag=300
net1: virtio=XXXXXXXXX,bridge=vmbr0,firewall=1,rate=70,tag=208
numa: 0
ostype: l26
scsi0: san01-datastore01:vm-100-disk-1,cache=writeback,discard=on,iops_rd=1000,iops_rd_max=2000,iops_wr=1000,iops_wr_max=2000,mbps_rd=100,mbps_rd_max=300,mbps_wr=100,mbps_wr_max=300,size=15T
scsihw: virtio-scsi-pci
smbios1: uuid=3fd78b62-0c60-4b2d-b693-9b408e27a698
sockets: 2
vmgenid: 6ad23a7a-4ec9-4e30-b29a-979c3fa1bead

root@pve08:~# qm listsnapshot 100
`-> current                                             You are here!

The VM is configured with a 15TB virtual disk. No snapshots present.
On my storage by other hand, I see I much bigger volume size. How is this possible?

Bash:
root@san01[~]# zfs list pool01/vm-100-disk-1
NAME                   USED  AVAIL     REFER  MOUNTPOINT
pool01/vm-100-disk-1  30.6T  19.1T     30.6T  -

root@san01[~]# zfs list -t snapshot pool01/vm-100-disk-1
no datasets available

Since I don't find many setups like ours, using PVE with storage on TrueNAS, I am not finding some more information that helps me figure out the source of this problem. Maybe other more experienced admins have some suggestions?

I appreciate your help.
 
Do you have TRIM enabled and run it periodically in the VM?
 
@mira All my VMs are with the "discard" option enabled in it's configuration. Should I still run something manually in the VM?
 
@mira All my VMs are with the "discard" option enabled in it's configuration. Should I still run something manually in the VM?
Yes, the TRIM commands need to be send by your guest OS. So in case of Linux you need to mount every partition with the 'discard' option set or setup cron to run fstrim -a once per hour or day.
 
I will wait for the next backup finishes and I will test this suggestion in the hollidays, then I get back to update this topic. Thank you @mira and @Dunuin !
 
Lots of Linux distributions offer a fstrim service, which can be enabled and runs once a week.
This should be enough in most cases.
 
I have ran the fstrim -av on all my linux servers last weekend, and it worked great on most of them, and I also enabled the fstrim.timer on every VM. Only on this VM id 100 I used as example it didn't worked. I repeated the command yesterday, and it ran all night along, and when I checked the disk size this morning, it still shows 30TB on my storage (it is 15TB configured on the PVE):

Code:
[root@mailserver ~]# fstrim -av
/var: 6 GiB (6477033472 bytes) trimmed
/boot: 868,2 MiB (910336000 bytes) trimmed
/: 867,5 GiB (931441553408 bytes) trimmed
[root@mailserver ~]# lsblk
NAME                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                     8:0    0    15T  0 disk
├─sda1                  8:1    0     1M  0 part
├─sda2                  8:2    0     1G  0 part /boot
└─sda3                  8:3    0    15T  0 part
  ├─cl_mailbox01-root 253:0    0    15T  0 lvm  /
  ├─cl_mailbox01-swap 253:1    0   7,9G  0 lvm  [SWAP]
  └─cl_mailbox01-var  253:2    0    10G  0 lvm  /var

The VM's disk is SCSI and the controller is VirtIO SCSI.
Code:
root@pve08:~# cat /etc/pve/qemu-server/100.conf
agent: 1
boot: order=ide2;scsi0;net0
cores: 8
ide2: none,media=cdrom
memory: 32768
name: mailserver.finep.gov.br
net0: virtio=EA:2E:00:9A:D1:F1,bridge=vmbr0,firewall=1,rate=15,tag=300
net1: virtio=F6:A1:66:AF:49:37,bridge=vmbr0,firewall=1,rate=70,tag=208
numa: 0
ostype: l26
scsi0: san01-datastore01:vm-100-disk-1,cache=writeback,discard=on,iops_rd=1000,iops_rd_max=2000,iops_wr=1000,iops_wr_max=2000,mbps_rd=100,mbps_rd_max=300,mbps_wr=100,mbps_wr_max=300,size=15T
scsihw: virtio-scsi-pci
smbios1: uuid=3fd78b62-0c60-4b2d-b693-9b408e27a698
sockets: 2
vmgenid: 6ad23a7a-4ec9-4e30-b29a-979c3fa1bead

What else should I check?
 
No snapshots on this VM

PVE:
Code:
root@pve08:~# qm listsnapshot 100
`-> current                                             You are here!

Storage:
Code:
root@san01[~]# zfs list -t snapshot pool01/vm-100-disk-1
no datasets available
 
Some more information about this:

Code:
root@san01[~]# zfs get all pool01/vm-100-disk-1 | grep used
pool01/vm-100-disk-1  used                     30.4T                    -
pool01/vm-100-disk-1  usedbysnapshots          0B                       -
pool01/vm-100-disk-1  usedbydataset            30.4T                    -
pool01/vm-100-disk-1  usedbychildren           0B                       -
pool01/vm-100-disk-1  usedbyrefreservation     0B                       -
pool01/vm-100-disk-1  logicalused              14.3T                    -
 
Is that "pool01" ZFS pool a raidz1/2/3? Then it might be padding overhead if your volblocksize is too small.

Whats zpool list pool01 and zfs get volblocksize pool01/vm-100-disk-1?
 
@Dunuin Yes it is a RAIDZ2 ZFS pool

Code:
root@san01[~]# zpool list pool01
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool01   116T  69.6T  46.8T        -         -    26%    59%  1.00x    ONLINE  /mnt

Code:
root@san01[~]# zfs get volblocksize pool01/vm-100-disk-1
NAME                  PROPERTY      VALUE     SOURCE
pool01/vm-100-disk-1  volblocksize  4K        -

Then it might be padding overhead if your volblocksize is too small
I don't know about this, I will do a research on padding overhead on a ZFS volblocksize.
If you have any suggestion on this matter I would appreciate.
 
Whats your zpool status pool01 or how much disks your raidz2 consists of?

There is a good explanation about padding overhead and volblocksize: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

With a 4K volblocksize you will always loose 2/3 of your pools total raw capacity to padding and parity overhead. So if your pool got 116T raw capacity you can only store 38.66T of vzols on it. To fix that you would need to destroy and recreate all zvols with a bigger volblocksize as the volblocksize can only be set at creation. How big your volblocksize needs to be depends on your pools ashift and the number of drives your pool conists of.
 
Last edited:
My RAIDZ2 consists of 8 disks + logs and cache:

Code:
root@san01[~]# zpool status pool01
  pool: pool01
 state: ONLINE
  scan: scrub repaired 0B in 6 days 06:19:08 with 0 errors on Fri Jan 21 06:28:27 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/6fa898dd-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/701ebef3-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/7000b337-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/70516358-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/70e7d3e3-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/71397881-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/705ab46a-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
            gptid/71141eed-60c3-11eb-960c-3cecef3d7ef0  ONLINE       0     0     0
        logs
          mirror-2                                      ONLINE       0     0     0
            gptid/9ce214c4-7164-11eb-94ae-3cecef3d7ef0  ONLINE       0     0     0
            gptid/9cea4619-7164-11eb-94ae-3cecef3d7ef0  ONLINE       0     0     0
        cache
          gptid/92366db8-7162-11eb-b90c-3cecef3d7ef0    ONLINE       0     0     0
          gptid/923e0667-7162-11eb-b90c-3cecef3d7ef0    ONLINE       0     0     0

errors: No known data errors

There is a good explanation about padding overhead and volblocksize: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz
Thanks!
 
In case your pool was created with ashift=12, 8 data disks and the sum of the data disks raw capacity is 116T it would look like this:
Parity+Padding loss of raw capacity:Usable capacity for zvols:
Volblocksize 4K/8K:67% (25% parity + 42% padding)30.6T
Volblocksize 16K/32K/64K:33% (25% parity + 7% padding)62.1T
Volblocksize 128K:29% (25% parity + 4% padding)65.8T
Volblocksize 256K/512K:26% (25% parity + 1% padding)68.6T
Volblocksize 1M:25% (25% parity + no padding)69.6T
"Usable capacity for zvols" already takes into account that 20% of a zfs pool always should be kept free because otherwise it will get slow and fragments faster.

So I would use a 16K volblocksize which in theory should double the space your zvols can use. But keep in mind that this isn't taking block level compression or deduplication into account, so real world results may differ a bit.

If you want to use posgres DBs with its 8K blocksize you should consider using a striped mirror as writing 8K blocks to a zvol with 16K blocksize will half the performance.
 
Last edited:
@Dunuin So in order to fix this issue in my environment, I should change the volblocksize. But can I just change it in the Storage configuration of my actual pools? How this affect my existing VMs?

And I imagine that my VM-100 that is now using 30T instead of the 15T that was allocated, will not magically change it's zvol size. But is there any trick I can do to fix it's size? This VM is very big and this would require me a lot of (offline) time if I had to migrate it.
 
What is the implication of me using for example 1M volblocksize? This looks the best option in your spreadshit, speaking of space saving. All data would be written in chunks of 1Mb?
 
@Dunuin So in order to fix this issue in my environment, I should change the volblocksize. But can I just change it in the Storage configuration of my actual pools? How this affect my existing VMs?
The volblocksize can't be changed later. It can only be set at creation and will be read-only afterwards. You need to backup the data, destroy that vzol, recreate that zvol with the correct volblocksize and and restore the data to the new zvol. Not sure how that will work with ZFS over iSCSI.
Is PVE or TrueNAS creating the zvols? If that is just handled by PVE like a local ZFS pool I would change the blocksize of the ZFS storage (Datacenter -> Storage -> YourZFSStorage -> Edit -> Blocksize: 16K). Then I would create a PBS or vzdump backup of that VM and restore it from backup. PVE will then delete all existing zvols of that VM and create new ones using the 16K volblocksize and fill it with the data of the backups.
And I imagine that my VM-100 that is now using 30T instead of the 15T that was allocated, will not magically change it's zvol size. But is there any trick I can do to fix it's size? This VM is very big and this would require me a lot of (offline) time if I had to migrate it.
Much of that 30T is padding overhead. After increasing the volblocksize to something like 16K it should only use something like 16T or 17T to store those 15T.
What is the implication of me using for example 1M volblocksize? This looks the best option in your spreadshit, speaking of space saving. All data would be written in chunks of 1Mb?
Jup, so everything will atleast consume 1M, even if you just want to store 4K. So writing 1000x 4K will result in 1G instead of data instead of just 4M. In general you want the volblocksize to be as low as possible without sacrificing too much storage due to padding overhead.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!