ProxMox Implementation of ZFS o_O

ErkDog · May 14, 2020

So ProxMox appears to employ some sort of logic or methodology when creating ZFS Pools or Volumes that consume SIGNIFICANTLY more data or providing SIGNIFICANTLY less capacity than they should.

I have 8 8TB Drives in RaidZ1 We'll round it down to 7TB to more than account for the fuzzy harddrive math hdd buttheads have used for a hundred years.

7 * 7 = 49TB (cause -1 for Z1)

But this:

https://pastebin.com/p3aihnrs (Had to pastebin because post was "greater than 10000 characters")

So the Master pool says 47.5T is "used" as expected when we create a 47.5T volume under it.

The logical used says 23T which makes sense cause Windows is showing this volume has 22.6TB of data on it.

However, when we look at vm-100-disk-0 the actual volume. It says "47.5" is used, and that 8.79T is 'available' and that '38.7T' is 'referenced'

I had to tweak the refreservation values because ProxMox refused to allow me to create a volume anywhere near the 47T I -should- be able to create in a RaidZ1 volume.

This says "dataset" is using 38.7T, and that 8.7T is "available" .... HOW HOW HOW HOW HOW? In previous ZFS Systems I've run, I plain got the X*Y-1 capacity. Based on what I'm reading here, I can only write 8.7T more data to this volume, even though it SHOULD have 23T free because only 23T of the 46T is in use.

Can someone please help me understand this insanity??? Before what should be my 49T Volume crashes like the other one I've been talking about in this thread?

Is there some setting or something I can adjust or tweak to stop ProxMox's implementation of ZFS Pools and Volumes from consuming significantly more data than actually exists?!?!??!

No implementation of ZFS I've dealt with to date suffered from this confusing, vague, and seemingly senseless consumption of excess space inside the pools/volumes.

H4R0 · May 14, 2020

referenced drive space, did you create snapshots ? can you post the output of "zpool list", "zfs list" and "zfs list -t snapshot"

you could thin provision and create a volume using -s argument for now

LnxBil · May 14, 2020

ErkDog said:
So ProxMox appears to employ some sort of logic or methodology when creating ZFS Pools or Volumes that consume SIGNIFICANTLY more data or providing SIGNIFICANTLY less capacity than they should.

No Proxmox VE does not. It's stock OpenZFS with some backported patches. What previous implementations did you use?

ErkDog said:
In previous ZFS Systems I've run, I plain got the X*Y-1 capacity.

In the beginning, the formular holds. Everything depends on the metadata, snapshots, send/receive with different ashift values etc. Even with refreservation, the consumption of space will increase if you fill up the dataset. In my experience, that was always the case, at least in OpenZFS. You cannot predict the free space and that makes it very hard and unintuitive. Unfortunately, I haven't found a logic behind this yet.

If this is truly a problem of OpenZFS, we will also see this in the future on FreeBSD and eventually in FreeNAS.

fabian · May 14, 2020

in addition to the usual 'snapshots cause additional space to be reserved' issue that people often misinterpret, you also have to keep in mind that zvols can waste a lot of space with raidz if their volblocksize and the pools ashift don't align well. if you search the forum here and other sites, you will find lots of hits for this. consider using a bigger volblocksize to trade off less used space for more churn/less performance

rholighaus · May 14, 2020

@fabian Just to clarify: Bigger volblocksize uses more space but gives more performance?

ErkDog · May 14, 2020

Here's the thing, I never configured snapshots, or backups, or anything like that. So if your saying @fabian that by default ProxMox employes snapshots and creates a misinterpretation of available space, perhaps that's a bad default behavior, and or it should be more clearly explained and a selectable option to waste harddrive space with 'snapshots'.

I don't want or need snapshots or backups. I back up all of the data and configs in my virtual machines to backblaze b2 buckets. It's a trivial task to restore configurations and data to any other given set of hardware or virtual machine if something bad happens.

My experience with "block level OS Backups" for the last 20 years has been poor at best. Therefore I simply backup data, and configurations themselves, and just reload things from scratch if something bad happens.

To answer your question about my previous ZFS Experience, it WAS Actually FreeNAS that I was using.

I put 6 x 8 TB Drives, and had 5 x 8 TB of space on the volume I created, end of story.
That's how ProxMox should operate by default. When I first provisioned the 8 * 8 ZFS R1 pool, it was giving me less than 5*8 TB of space available for volumes until I messed around with reservation sizes. I went from FreeNAS to ProxMox, added TWO MORE 8TB Drives and had LESS Usable space!@!@!@!@~!@

Here is the output of the commands requested.

Code:

root@pmox:~# zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
MECH56T    58.2T  46.0T  12.3T        -         -     0%    78%  1.00x    ONLINE  -
SIXTBSATA  7.45T  7.22T   239G        -         -     3%    96%  1.00x    ONLINE  -
SIXTBZ     7.27T  3.74T  3.52T        -         -    15%    51%  1.00x    ONLINE  -
rpool       232G  16.8G   215G        -         -     0%     7%  1.00x    ONLINE  -

Code:

root@pmox:~# zfs list
NAME                      USED  AVAIL     REFER  MOUNTPOINT
MECH56T                  47.5T     0B      162K  /MECH56T
MECH56T/vm-100-disk-0    47.5T  8.79T     38.7T  -
SIXTBSATA                5.25T  1.48M      140K  /SIXTBSATA
SIXTBSATA/vm-100-disk-0  5.25T  1.48M     5.25T  -
SIXTBZ                   4.02T  1.09T      140K  /SIXTBZ
SIXTBZ/vm-100-disk-0      660G  1.09T      660G  -
SIXTBZ/vm-101-disk-0     2.50T  1.52T     2.07T  -
SIXTBZ/vm-102-disk-0      900G  1.96T     6.44G  -
rpool                    16.8G   208G      104K  /rpool
rpool/ROOT               16.8G   208G       96K  /rpool/ROOT
rpool/ROOT/pve-1         16.8G   208G     16.8G  /
rpool/data                 96K   208G       96K  /rpool/data

Code:

root@pmox:~# zfs list -t snapshot
no datasets available
root@pmox:~#

and zpool status

Code:

root@pmox:~# zpool status
  pool: MECH56T
 state: ONLINE
  scan: scrub repaired 0B in 0 days 17:01:04 with 0 errors on Tue May 12 05:02:04 2020
config:

        NAME        STATE     READ WRITE CKSUM
        MECH56T     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0

errors: No known data errors

  pool: SIXTBSATA
 state: ONLINE
  scan: scrub repaired 0B in 0 days 02:13:33 with 0 errors on Mon May 11 14:14:23 2020
config:

        NAME        STATE     READ WRITE CKSUM
        SIXTBSATA   ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0

errors: No known data errors

  pool: SIXTBZ
 state: ONLINE
  scan: scrub repaired 0B in 0 days 01:22:01 with 0 errors on Mon May 11 13:22:46 2020
config:

        NAME                           STATE     READ WRITE CKSUM
        SIXTBZ                         ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            nvme-eui.6479a72ef17983fc  ONLINE       0     0     0
            nvme-eui.6479a72ef1797ddb  ONLINE       0     0     0
            nvme-eui.6479a72ef17973e9  ONLINE       0     0     0
            nvme-eui.6479a72ef179779f  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:00:14 with 0 errors on Mon May 11 12:00:54 2020
config:

        NAME                                 STATE     READ WRITE CKSUM
        rpool                                ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            nvme-eui.0025385a91510da9-part3  ONLINE       0     0     0
            nvme-eui.0025385a9151097a-part3  ONLINE       0     0     0

errors: No known data errors
root@pmox:~#

H4R0 · May 15, 2020

Please post the output of "qm config 100"

Seems like "discard" is not working for you. Your windows vm is not forwarding delete operations to the underlying storage.

You probably did not enable the "discard" option, and now your proxmox host doesnt know about the free space.

Now to fix your situation:
Go to the vm options -> Hardware -> Hard Disk: check "discard"
Shut down the vm for 3 seconds, make sure the change is applied

Login to your windows vm, exectue as admin "fsutil behavior set DisableDeleteNotify 0"
Download SDelete https://docs.microsoft.com/en-us/sysinternals/downloads/sdelete
Execute SDelete: "sdelete.exe -z c:" (this will forward free space from windows vm to underlying storage)
Zfs will need some minutes to mark the space as free, you can watch it with "watch zfs list"

ErkDog · May 15, 2020

well that seems silly, why in the world would "windows" need to "forward delete operations" to "underlying storage" ....

If a file gets deleted, the hypervisor should see that and figure it out, why does it have to do some special extra thing?

Anyway, here's the config, help is appreciated

Code:

root@pmox:~# qm config 100
agent: 1,type=virtio
balloon: 0
bios: seabios
bootdisk: sata1
cores: 8
memory: 98304
name: Bulma
net0: virtio=D6:53:83:30:EF:2B,bridge=vmbr0
net1: virtio=22:02:16:79:8F:5F,bridge=vmbr1
numa: 0
ostype: win10
sata1: SIXTBZ:vm-100-disk-0,cache=none,size=512G,ssd=1
scsi4: ASUSRaidR:vm-100-disk-0,backup=0,cache=none,size=223G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=ac066df1-814e-448d-809f-5541180d8126
sockets: 1
unused0: SIXTBSATA:vm-100-disk-0
vga: std
virtio0: MECH56T:vm-100-disk-0,backup=0,cache=none,size=47000G
vmgenid: 5825d45d-b11a-4798-b6e5-af0665328867

H4R0 · May 15, 2020

ErkDog said:
well that seems silly, why in the world would "windows" need to "forward delete operations" to "underlying storage" ....

If a file gets deleted, the hypervisor should see that and figure it out, why does it have to do some special extra thing?

Anyway, here's the config, help is appreciated

Code:

root@pmox:~# qm config 100 agent: 1,type=virtio balloon: 0 bios: seabios bootdisk: sata1 cores: 8 memory: 98304 name: Bulma net0: virtio=D6:53:83:30:EF:2B,bridge=vmbr0 net1: virtio=22:02:16:79:8F:5F,bridge=vmbr1 numa: 0 ostype: win10 sata1: SIXTBZ:vm-100-disk-0,cache=none,size=512G,ssd=1 scsi4: ASUSRaidR:vm-100-disk-0,backup=0,cache=none,size=223G,ssd=1 scsihw: virtio-scsi-pci smbios1: uuid=ac066df1-814e-448d-809f-5541180d8126 sockets: 1 unused0: SIXTBSATA:vm-100-disk-0 vga: std virtio0: MECH56T:vm-100-disk-0,backup=0,cache=none,size=47000G vmgenid: 5825d45d-b11a-4798-b6e5-af0665328867

Windows has its own filesystem ntfs, if you delete a file it gets marked as deleted in the filesystem table.

Windows now knows the space is free and might use it in the future.

How should the underlying storage, zfs know that the file got deleted this way ?

Your storage has discard disabled, thats the problem you have.

Try the fix i wrote in my last reply and the storage should be freed on the host.

fabian · May 15, 2020

rholighaus said:
@fabian Just to clarify: Bigger volblocksize uses more space but gives more performance?

no. bigger volsize means that the data and parity block ratio is not as skewed, so the overhead of raidz goes down (="less space used"). it hurts performance though if the guest internally uses a smaller block size, since then a (small) guest write will trigger a read, modify, write of a big ZFS block. if your guest is already configured to use big block sizes itself, and your work load is accordingly, it might even improve performance

that's why you have to find the sweet spot for your use case, and tune file systems / databases /... in your guest accordingly

erkdog said:
Here's the thing, I never configured snapshots, or backups, or anything like that. So if your saying @fabian that by default ProxMox employes snapshots and creates a misinterpretation of available space, perhaps that's a bad default behavior, and or it should be more clearly explained and a selectable option to waste harddrive space with 'snapshots'.

I don't want or need snapshots or backups. I back up all of the data and configs in my virtual machines to backblaze b2 buckets. It's a trivial task to restore configurations and data to any other given set of hardware or virtual machine if something bad happens.

I never said any such thing, I did tell you where your used space is most likely coming from though - so please read my post again

it seems like you have never worked with zvols before, only plain datasets? the behaviour of recordsize and volblocksize is very different, and the difference between a block device and a file system is also quite big - I suggest you read up on both topics.

LnxBil · May 15, 2020

ErkDog said:
well that seems silly, why in the world would "windows" need to "forward delete operations" to "underlying storage" ....

If a file gets deleted, the hypervisor should see that and figure it out, why does it have to do some special extra thing?

No Hypervisor I've ever seen "figured this out". This is just technical limitations and true for Hyper-V VMware etc. all of the hypervisors have tools that do regular unmapping. This is also not a Windows-only problem, every GuestOS does not really free the space it uses, it just marks it as deleted. Newer operating systems have built in discard so that they do this for SSDs (TRIM) regularly, the same can be used for ZFS, but if it is detected in the VM directly, it is not used.

ErkDog said:
To answer your question about my previous ZFS Experience, it WAS Actually FreeNAS that I was using.

So, did you use zvols there, or just file based datasets? There is a huge difference between them.

ErkDog said:
When I first provisioned the 8 * 8 ZFS R1 pool, it was giving me less than 5*8 TB of space available for volumes until I messed around with reservation sizes.

As I've already said, that is also true for OpenZFS. In the beginning, everything is correct (here 8x 2 TB):

Code:

root@zfs ~ > zpool create -o ashift=12 zpool /dev/sd?

root@zfs ~ > zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zpool  15,9T   480K  15,9T         -     0%     0%  1.00x  ONLINE  -

root@zfs ~ > zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
zpool   396K  15,4T    96K  /zpool

ErkDog · May 16, 2020

Thanks everyone for the suggestions. Should I turn discard on all my hosts then? Just for safety?

The other two are CentOS installations.

Thanks,
Matt

ErkDog · May 16, 2020

Also sorry I admittedly missed how to mitigate the problem @hr40 just saw "post config" :-D

H4R0 · May 16, 2020

Yes its important to enable discard on every disk, if you intend to use thin provisioning and reclaim the space.

In addition to that for various linux guests, you also have to add the discard option to fstab or kernel options to enable it.

ErkDog · May 16, 2020

Good Gravy H4R0 are you serious? lol. Then why wouldn't "discard" be enabled for every volume created against a zvol anyway?

I hate to be needy, but can you please share how to flag discard in /etc/fstab?

And am I correct in assuming that fstrim in Linux is the equivalent of the sdelete from sysutils that you shared?

Thanks,

ErkDog · May 16, 2020

Well I looked up the flag but was seeing that's not the best way to do it, that using fstrim.timer / fstrim.service was better because it doesn't put as much load on the file system so I -believe- I've enabled those.

However, I've setup discard, rebooted the system, and observed the cleaning up of free space when deleting files/folders which is of course a desired effect.

However, "SDelete" doesn't appear to be clearing up any additional file space.

https://puu.sh/FLh4B/b32954bb40.png - So I'm only using a couple hundred gigs on SIXTBZ/vm-100-disk-0

But zfs list still shows this after running sdelete and waiting quite some time:

SIXTBZ/vm-100-disk-0 575G 1.54T

Please advise.

H4R0 · May 16, 2020

ErkDog said:
Good Gravy H4R0 are you serious? lol. Then why wouldn't "discard" be enabled for every volume created against a zvol anyway?

I hate to be needy, but can you please share how to flag discard in /etc/fstab?

And am I correct in assuming that fstrim in Linux is the equivalent of the sdelete from sysutils that you shared?

Thanks,

Enable the discard option in vm options on proxmox and then ssh into your linux guest.

edit /etc/fstab
and add "discard,noatime" to your root entry options

Code:

UUID=e23e172c-5bdf-45f9-aeb9-d3b1ec19661d /               ext4    errors=remount-ro,discard,noatime 0       1

I havent tried fstrim for this purpose.

you can just run the following command on linux to fix it, its the equivalent of sdelete:
just cd to the correct directory beforehand, you can check which filesystem you are on with "df -h ."
it will simply write until the disk is full and then free the space.

Code:

cat /dev/urandom > tmpfile || rm tmpfile

ErkDog · May 16, 2020

also NVM about vm-100-disk-0 I just realized the "USED" is the total volume size, the "Refer" on the right appears to be how much is actually in use by the FS itself.

H4R0 · May 16, 2020

ErkDog said:
Well I looked up the flag but was seeing that's not the best way to do it, that using fstrim.timer / fstrim.service was better because it doesn't put as much load on the file system so I -believe- I've enabled those.

However, I've setup discard, rebooted the system, and observed the cleaning up of free space when deleting files/folders which is of course a desired effect.

However, "SDelete" doesn't appear to be clearing up any additional file space.

https://puu.sh/FLh4B/b32954bb40.png - So I'm only using a couple hundred gigs on SIXTBZ/vm-100-disk-0

But zfs list still shows this after running sdelete and waiting quite some time:

SIXTBZ/vm-100-disk-0 575G 1.54T

Please advise.

You should leave it enabled, it doesnt add overhead, if you run trim periodically it will add overhead.

Your vm has 3 disks attached and all of them with a different driver...

Please test the sdelete option on all of them and check on which of them it works.

Your drive SIXTBZ is using the sata driver which does not support discard.

Your best option is to use virtio scsi.

MECH56T virtio should support discard
ASUSRaidR scsi virtio should support discard
SIXTBZ sata does not support discard

ErkDog · May 17, 2020

LOL Shit, of course, it doesn't, so basically I have to move that data, recreate it with virtio....

On VM 101 I've got this:

Code:

root@pmox:~# qm config 101
agent: 1,type=virtio
balloon: 0
bios: seabios
bootdisk: sata0
cores: 4
memory: 32768
name: CRM
net0: virtio=F2:F1:62:5F:AD:A6,bridge=vmbr1
numa: 0
ostype: l26
sata0: SIXTBZ:vm-101-disk-0,cache=none,discard=on,size=2548G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=7773017e-338e-4101-a5eb-de42cfe3da7f
sockets: 1
vga: std
vmgenid: 8fe67fd0-bdde-4c38-b24b-68222c6242f9

It's a CentOS 7.8, will I have to reload it to respect discard and trim?????

Same for this VM:

Code:

root@pmox:~# qm config 102
agent: 1,type=virtio
balloon: 0
bios: seabios
bootdisk: sata0
cores: 9
memory: 93184
name: Goku
net0: virtio=A6:92:0C:58:2D:89,bridge=vmbr1
numa: 0
ostype: l26
sata0: SIXTBZ:vm-102-disk-0,backup=0,cache=none,discard=on,size=869G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=bab463ca-85f4-42a7-ae4b-ea2864527d15
sockets: 1
vga: std
vmgenid: 0acc4ba5-e498-403d-b295-45cef40e7f81

This one is Cent OS 8.1.... will I have to reload -it- to support trim/discard.

IT would be SUPER Nice @fabian if when you went to create vms with zvols and stuff it like made a concerted effort to explain all this stuff.... don't pick data, you should use virtio storage devices, you have to make sure this windows flag is set, you need to add this to your fstab......

Like how is anyone supposed to just know all this?

ProxMox Implementation of ZFS o_O

Member

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member