ZFS Issues - trim/pve-zsync

soapee01

Well-Known Member
Sep 7, 2016
39
6
48
68
I'm seeing some odd things happen on a VM. I've been doing auto snapshots with cv4pve-autosnap, then copying the data across to another location using pve-zsync.

The VM running is using rpool/data/vm-100-disk0; however cv4pve-autosnap is taking snapshots on a disk that seems to be from a previous snapshot restore rpool/data/vm-100-state-BeforeOPUD

This in turn is causing pve-zsync to take for freaking ever to copy. Days. Historically this ran in a few hours.

I've deleted all snapshots, so I can start fresh, and rebooted the server.

Additionally, this is a windows 2012R2 VM, and the space shown in zfs greatly exceeds the used space in the windows VM. Used space in zfs shows 1.19TB, but windows shows 429GB used.

I worry about deleting what I believe is the unused dataset. Is there a good way to verify that it actually isn't being used? Can the two be combined if they are?

More information below.

PVE config
Code:
cat /etc/pve/qemu-server/100.conf


balloon: 0
boot: dcn
bootdisk: virtio0
cores: 6
memory: 32768
name: COMPANY_VM
net0: virtio=EA:CE:C6:F1:13:EB,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win8
parent: BeforeOPUD
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=cd6176e8-9f99-4cbe-a263-8fa5ea79590a
sockets: 2
startup: order=2
virtio0: local-zfs:vm-100-disk-0,size=1000G
vmgenid: 4f2bf1b3-01e4-4c28-b690-02918fb33952


ZFS Space:
Code:
~# zfs list -o space
NAME                                AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
rpool                               3.27T  1.53T        0B    222K             0B      1.53T
rpool/ROOT                          3.27T  19.0G        0B    205K             0B      19.0G
rpool/ROOT/pve-1                    3.27T  19.0G        0B   19.0G             0B         0B
rpool/data                          3.27T  1.51T        0B    205K             0B      1.51T
rpool/data/vm-100-disk-0            3.27T  1.19T        0B   1.19T             0B         0B
rpool/data/vm-100-state-BeforeOPUD  3.27T  53.9G        0B   53.9G             0B         0B

Snapshots:
Code:
# zfs list -t snapshot
no datasets available

Trim to try and get some space back
Code:
zpool trim rpool
cannot trim: no devices in pool support trim operations
autotrim enabled
Code:
zpool get autotrim rpool
NAME   PROPERTY  VALUE     SOURCE
rpool  autotrim  on        local
 
What is your pool setup? If using for example raidz with the default 8K volblocksize it is not unusual that everything on the pools need double the space.
 
  • Like
Reactions: soapee01
What is your pool setup? If using for example raidz with the default 8K volblocksize it is not unusual that everything on the pools need double the space.
Code:
zfs get recordsize
NAME                                PROPERTY    VALUE    SOURCE
rpool                               recordsize  128K     default
rpool/ROOT                          recordsize  128K     default
rpool/ROOT/pve-1                    recordsize  128K     default
rpool/data                          recordsize  128K     default
rpool/data/vm-100-disk-0            recordsize  -        -
rpool/data/vm-100-state-BeforeOPUD  recordsize  -        -
rpool/data/vm-101-disk-0            recordsize  -        -
rpool/data/vm-102-disk-0
 
Code:
zfs get recordsize
NAME                                PROPERTY    VALUE    SOURCE
rpool                               recordsize  128K     default
rpool/ROOT                          recordsize  128K     default
rpool/ROOT/pve-1                    recordsize  128K     default
rpool/data                          recordsize  128K     default
rpool/data/vm-100-disk-0            recordsize  -        -
rpool/data/vm-100-state-BeforeOPUD  recordsize  -        -
rpool/data/vm-101-disk-0            recordsize  -        -
rpool/data/vm-102-disk-0
bah, wrong setting. it is 8k:
Code:
 zfs get volblocksize
NAME                                PROPERTY      VALUE     SOURCE
rpool                               volblocksize  -         -
rpool/ROOT                          volblocksize  -         -
rpool/ROOT/pve-1                    volblocksize  -         -
rpool/data                          volblocksize  -         -
rpool/data/vm-100-disk-0            volblocksize  8K        default
rpool/data/vm-100-state-BeforeOPUD  volblocksize  8K        default
rpool/data/vm-101-disk-0            volblocksize  8K        default
rpool/data/vm-102-disk-0            volblocksize  8K        default
 
But it is just a mirror/striped mirror or some kind of raidz?
raidz2

Code:
zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:40:31 with 0 errors on Sun May  9 01:04:32 2021
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5003ea1e0c7-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a2d-part3  ONLINE       0     0     0
            scsi-35000c5003ea21aa1-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f1b2-part3  ONLINE       0     0     0
            scsi-35000c5003ea21ccd-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a60-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a44-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f98f-part3  ONLINE       0     0     0
 
PVE config
Code:
cat /etc/pve/qemu-server/100.conf


balloon: 0
boot: dcn
bootdisk: virtio0
cores: 6
memory: 32768
name: COMPANY_VM
net0: virtio=EA:CE:C6:F1:13:EB,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win8
parent: BeforeOPUD
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=cd6176e8-9f99-4cbe-a263-8fa5ea79590a
sockets: 2
startup: order=2
virtio0: local-zfs:vm-100-disk-0,size=1000G
vmgenid: 4f2bf1b3-01e4-4c28-b690-02918fb33952
You have chosen virtio SCSI as storage controller but your disk is "virtio0" and not "scsi0". scsi would support discard but not all other protocols. And you also need to to enable discard in the VMs settings which isn't done here. If you don't enable it the pool can't receive TRIM commands from the guest and thin provisioning isn't working.
 
  • Like
Reactions: soapee01
raidz2

Code:
zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:40:31 with 0 errors on Sun May  9 01:04:32 2021
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5003ea1e0c7-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a2d-part3  ONLINE       0     0     0
            scsi-35000c5003ea21aa1-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f1b2-part3  ONLINE       0     0     0
            scsi-35000c5003ea21ccd-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a60-part3  ONLINE       0     0     0
            scsi-35000c5003ea21a44-part3  ONLINE       0     0     0
            scsi-35000c5003ea1f98f-part3  ONLINE       0     0     0
Look at this table. With 8K volblocksize and 8 drives in raidz2 you are loosing 2/3 of your capacity if using ashift of 12. In other words everything you write to a zvol wastes +125% of the space because of bad padding.

Edit:
So you got 8x 800GB drives so 6400GB total raw capacity. Theoretically you will only loose 1600GB for parity data so ZFS will tell you that you got 4800GB of usable space. But because of bad padding everything you write to the virtual disks will be +125% bigger. So if you write 2133GB of data the bad padding will waste additional 2667GB of space and you reach your 4800GB pool limit.

And you should keep in mind that your pool will switch to panic mode if your pool gets more than 90% full. And it will get slow if you use more than 80% of the capacity. So in reality you only got 1706GB of usable capacity and not 5000GB.

If you don't want to waste such a amount due to bad padding you should increase the volblocksize or switch from raidz2 to a striped mirror setup for better performance as a VM storage.
 
Last edited:
  • Like
Reactions: soapee01
You have chosen virtio SCSI as storage controller but your disk is "virtio0" and not "scsi0". scsi would support discard but not all other protocols. And you also need to to enable discard in the VMs settings which isn't done here. If you don't enable it the pool can't receive TRIM commands from the guest and thin provisioning isn't working.
I'm confused here. I can see the option to enable discard (and I guess I should do that), but virtio disks are recommended everywhere I look for windows performance. I'm not seeing anything suggesting that windows virtio drivers don't suppport this, but perhaps I'm reading the wrong things. Do you have more information on this? Or does setting the disk to virtio0 override virtio-scsi and use virtio-blk instead?

I have another guest OS that I don't care about here (Win10) so I suppose I can experiment on that one. Copy some large file over, delete it and run a trim and see if space goes back down with something like:

Code:
Optimize-Volume -DriveLetter C -ReTrim -Verbose
 
Last edited:
Look at this table. With 8K volblocksize and 8 drives in raidz2 you are loosing 2/3 of your capacity if using ashift of 12. In other words everything you write to a zvol wastes +133% of the space because of bad padding.
That's interesting. It looks like I'll be rebuilding the machine during a maintenance window. :-/

It looks like had I set atime=0 instead of 12, everything might have been aligned properly (from what I've read today atime=0 lets linux autodetect the correct volblocksize), but I don't know how this works in practice. It's probably better to figure it out manually.
 
If you want thin provisioning you should switch from "virtio block" to "virtio SCSI" so discard is working. Discard on "virtio block" is a really new feature and I don't know if proxmox already supports it. If its greyed out its not supporting it yet.

For ashift you should look at your physical disks. If they got a logical blocksize of 4K it should be ashift 12 and with a logical blocksize of 512B it should be ashift 9. If its a logical block size of 4K you need a way bigger volblocksize (atleast 16K) to minimize padding overhead.
 
Last edited:
  • Like
Reactions: soapee01
First of all Dunuin, thank you very much for the pointers. I greatly appreciate the help.

I just wanted to report back and state that I tested this on a Win10 Pro virtual machine, then I did this on Server 2012R2 (two Vm's)
1. Add a scsi1 disk to windows (to make sure the driver was there). Tried without that first, and windows refused to boot
2. Changed the device from virtio0 to scsi0 via cli
3. set the scsi0 disk to use discard and cache to write back
3. Remove scsi1 via proxmox gui and delete it
4. boot windows
5. via powershell, run Optimize-Volume -DriveLetter C -Retrim -Verbose
6. Double check that thin provisioning shows up in disk properties.
7. Checked space
8. Added SSD emulation and rebooted the VM (didn't get back any more space).

I started with one VM using 1.13TB of storage, and it's down now on ZFS to 762GB. This is better, but the Optimize-Volume report in windows says it's only using 428.01GB. There are no shadow volume copies on this drive.

Is there anything I can do to further squash some of this unused space? I still don't have any snapshots per se, but zfs list still has me using two datasets for this VM (months ago I restored to a snapshot that's long since been removed).

Code:
NAME                                 USED  AVAIL     REFER  MOUNTPOINT
rpool                               1.06T  3.74T      222K  /rpool
rpool/ROOT                          19.0G  3.74T      205K  /rpool/ROOT
rpool/ROOT/pve-1                    19.0G  3.74T     19.0G  /
rpool/data                          1.04T  3.74T      205K  /rpool/data
rpool/data/vm-100-disk-0             762G  3.74T      762G  -
rpool/data/vm-100-state-BeforeOPUD  53.9G  3.74T     53.9G  -
 
I started with one VM using 1.13TB of storage, and it's down now on ZFS to 762GB. This is better, but the Optimize-Volume report in windows says it's only using 428.01GB. There are no shadow volume copies on this drive.
Is there anything I can do to further squash some of this unused space?
Did you changed the volblocksize to 16K? Without that everything written to the virtual disk will be much bigger on the pool. And the volblocksize can only be set at creation of virtual disks so every virtual disk needs to be destroyed and recreated. Easiest way to do this is by creating a backup, destroying the VM and importing it from backup. To change the volblocksize for newly created virtual disks go to Datacenter -> Storage -> YourZFSstorage -> Edit and replace the default 8K "block size" with for example "16K".
I still don't have any snapshots per se, but zfs list still has me using two datasets for this VM (months ago I restored to a snapshot that's long since been removed).

Code:
NAME                                 USED  AVAIL     REFER  MOUNTPOINT
rpool                               1.06T  3.74T      222K  /rpool
rpool/ROOT                          19.0G  3.74T      205K  /rpool/ROOT
rpool/ROOT/pve-1                    19.0G  3.74T     19.0G  /
rpool/data                          1.04T  3.74T      205K  /rpool/data
rpool/data/vm-100-disk-0             762G  3.74T      762G  -
rpool/data/vm-100-state-BeforeOPUD  53.9G  3.74T     53.9G  -
I'm not sure if you can delete the "vm-100-state-BeforeOPUD".
 
Did you changed the volblocksize to 16K? Without that everything written to the virtual disk will be much bigger on the pool. And the volblocksize can only be set at creation of virtual disks so every virtual disk needs to be destroyed and recreated. Easiest way to do this is by creating a backup, destroying the VM and importing it from backup. To change the volblocksize for newly created virtual disks go to Datacenter -> Storage -> YourZFSstorage -> Edit and replace the default 8K "block size" with for example "16K".
Unfortunately I installed proxmox on the same pool so that will have to wait for a full reinstall. Normally I install on a raid 1 mdadm disk, but on this server I decided to give the full zfs install a go. Live and learn.

Based on the numbers you were showing and other research I expected the size to be between 500 and 600GB due to the 8k inefficiency. Perhaps it's just greater in my case.

Either way it's 400gb less that I'm copying across a 100mbps point to point so that's significantly better. Thank you.
 
Unfortunately I installed proxmox on the same pool so that will have to wait for a full reinstall. Normally I install on a raid 1 mdadm disk, but on this server I decided to give the full zfs install a go. Live and learn.
That should be no problem. Volblocksize is only used for virtual disks (zvols) and not for datasets. Datasets (like where your proxmox is installed) are using a recordsize of 128K instead. And you dont need to delete the pool or datatsets, just the zvols.
Based on the numbers you were showing and other research I expected the size to be between 500 and 600GB due to the 8k inefficiency. Perhaps it's just greater in my case.
In theory your 428GB of data should create 535GB of padding overhead, so 963GB space used on the pool. Might be possible that compression is saving you some space if only 762GB are used. Using a volblocksize of 16K you should bring the used space below 500GB.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!