zfs counts double the space

picateclas · Jun 17, 2020

Hello,

I'm seeing some weird stuff, I don't think I fully understand zfs or there's something wrong with my setup

I have a couple of machines, one with 450GB and another with 32GB

But, proxmox is counting 883G used

and making some investigation on server, i see this:

Does anyone have any idea what might be going on? Why does the machine take up twice as much space?

My pool configuration is as follows:

Code:

zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:48 with 0 errors on Sun Jun 14 00:25:49 2020
config:

    NAME                                                     STATE     READ WRITE CKSUM
    rpool                                                    ONLINE       0     0     0
      raidz2-0                                               ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S21JNXAG908145B-part3  ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S21JNXAG578103D-part3  ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S21JNXAG570527A-part3  ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S21JNSAG500441R-part3  ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S21JNSAG121223P-part3  ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB_S2RBNB0J633402B-part3  ONLINE       0     0     0

errors: No known data errors

Code:

pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Thank you!

aaron · Jun 17, 2020

Do you have some files in /rpool/data?

What is the output of zfs list -t all? It's possible that snapshots are using that space.

picateclas · Jun 17, 2020

aaron said:
Do you have some files in /rpool/data?

What is the output of zfs list -t all? It's possible that snapshots are using that space.

Thank you, for you're time

Code:

/rpool/data# ls -lha
total 1.0K
drwxr-xr-x 2 root root 2 Jun 12 12:51 .
drwxr-xr-x 4 root root 4 Jun 12 12:51 ..

Thank you

aaron · Jun 17, 2020

hmm, so I didn't look closely enough the first time as the majority is used by VM 114.

Can you run zfs get all rpoo/data/vm-114-disk-0? That should show where that space is used. If possible please copy and paste the output into [code] tags instead of screenshots as it is easier to copy certain parts from it

picateclas · Jun 17, 2020

aaron said:
hmm, so I didn't look closely enough the first time as the majority is used by VM 114.

Can you run zfs get all rpoo/data/vm-114-disk-0? That should show where that space is used. If possible please copy and paste the output into [code] tags instead of screenshots as it is easier to copy certain parts from it

Sure!

Code:

zfs get all rpool/data/vm-114-disk-0
NAME                      PROPERTY              VALUE                  SOURCE
rpool/data/vm-114-disk-0  type                  volume                 -
rpool/data/vm-114-disk-0  creation              Wed Jun 17  1:26 2020  -
rpool/data/vm-114-disk-0  used                  845G                   -
rpool/data/vm-114-disk-0  available             913G                   -
rpool/data/vm-114-disk-0  referenced            845G                   -
rpool/data/vm-114-disk-0  compressratio         1.04x                  -
rpool/data/vm-114-disk-0  reservation           none                   default
rpool/data/vm-114-disk-0  volsize               450G                   local
rpool/data/vm-114-disk-0  volblocksize          8K                     default
rpool/data/vm-114-disk-0  checksum              on                     default
rpool/data/vm-114-disk-0  compression           on                     inherited from rpool
rpool/data/vm-114-disk-0  readonly              off                    default
rpool/data/vm-114-disk-0  createtxg             76379                  -
rpool/data/vm-114-disk-0  copies                1                      default
rpool/data/vm-114-disk-0  refreservation        none                   default
rpool/data/vm-114-disk-0  guid                  5621535973689180047    -
rpool/data/vm-114-disk-0  primarycache          all                    default
rpool/data/vm-114-disk-0  secondarycache        all                    default
rpool/data/vm-114-disk-0  usedbysnapshots       0B                     -
rpool/data/vm-114-disk-0  usedbydataset         845G                   -
rpool/data/vm-114-disk-0  usedbychildren        0B                     -
rpool/data/vm-114-disk-0  usedbyrefreservation  0B                     -
rpool/data/vm-114-disk-0  logbias               latency                default
rpool/data/vm-114-disk-0  objsetid              2565                   -
rpool/data/vm-114-disk-0  dedup                 off                    default
rpool/data/vm-114-disk-0  mlslabel              none                   default
rpool/data/vm-114-disk-0  sync                  standard               inherited from rpool
rpool/data/vm-114-disk-0  refcompressratio      1.04x                  -
rpool/data/vm-114-disk-0  written               845G                   -
rpool/data/vm-114-disk-0  logicalused           443G                   -
rpool/data/vm-114-disk-0  logicalreferenced     443G                   -
rpool/data/vm-114-disk-0  volmode               default                default
rpool/data/vm-114-disk-0  snapshot_limit        none                   default
rpool/data/vm-114-disk-0  snapshot_count        none                   default
rpool/data/vm-114-disk-0  snapdev               hidden                 default
rpool/data/vm-114-disk-0  context               none                   default
rpool/data/vm-114-disk-0  fscontext             none                   default
rpool/data/vm-114-disk-0  defcontext            none                   default
rpool/data/vm-114-disk-0  rootcontext           none                   default
rpool/data/vm-114-disk-0  redundant_metadata    all                    default
rpool/data/vm-114-disk-0  encryption            off                    default
rpool/data/vm-114-disk-0  keylocation           none                   default
rpool/data/vm-114-disk-0  keyformat             none                   default
rpool/data/vm-114-disk-0  pbkdf2iters           0                      default
root@cloud04:/rpool/data#

LnxBil · Jun 18, 2020

picateclas said:
I don't think I fully understand zfs or there's something wrong with my setup

It's the raidz2, which breaks your human understanding of how much space is used. We discussed the raidz space usage a couple of times on the forums. Best to use stripped mirrors, which show your space correctly.

picateclas · Jun 18, 2020

LnxBil said:
It's the raidz2, which breaks your human understanding of how much space is used. We discussed the raidz space usage a couple of times on the forums. Best to use stripped mirrors, which show your space correctly.

thanks for you're response, but...

So, the usable space is not what zfs says is usable because then it really counts more?

I've read practically all the posts in the forum about zfs and disk consumption, but in none of them I've finished understanding the correct use.

Thanks for responding, I'm going a little crazy with this, I see my 400gb machine using 800gb and I don't understand anything

LnxBil · Jun 18, 2020

picateclas said:
Thanks for responding, I'm going a little crazy with this, I see my 400gb machine using 800gb and I don't understand anything

You're not alone. I also did not find a definitive answer or any at all. I also read all of the other threads and I still don't know how the on disk layout of a raidz stripe looks like, which is blamed for this. Yet I do know, that the space requirement and calculations are much easier with stripped mirrors.

The smaller influence are things like metadata (some of them stored multiple times), the pointer tree of all blocks including checksums etc.

fabian · Jun 19, 2020

raidz needs to store parity information somewhere (in blocks of size at least the pool's ashift, which usually nowadays is 12 so 4k). zvols by default use a very small blocksize (8k). if you write a single 8k block to the zvol, and use raidz2, you need to tolerate the loss of 2 disks and still recover this write. so you need to write two parity blocks of 4k each to different vdevs then the data block. so the total write is 8k+4k+4k = 16k. this is all a bit oversimplified (e.g., metadata, compression, .., write aggregation) so it's not strictly a 2x overhead, but it's the basic cause of this issue. regular datasets don't have this issue as much, since they have a variable record size (the 'recordsize' property just gives the upper limit, but ZFS can use anything from ashift to that recordsize).

there are basically three ways to address this:
- increase volblocksize and tune guest accordingly (this improves the data : parity ratio, and thus reduces overhead, but might not work for all use cases - you don't want to pass the hot potato to the guest and have lots of write amplification there as that will hurt performance!)
- use mirror instead of raidz (less space overhead, better performance, but different failure characteristics - for raidz2, any 2 disks can fail, for mirror or stripe of mirrors, 1-N disks can fail depending on which disks fail)
- use ashift=9 (probably bad for performance and life-expectancy of your SSDs, bad for future upgradability as it cannot be changed)

as you can probably tell I'd recommend either of the first two

Basti-Fantasti · May 23, 2021

I have the same problem.
Did i get this right?
Raidz2: I have 10 TB (5x2TB SSD)
After parity etc. something about 5 TB are left.

Now my VM costumes double the Space:
For example a 300 GB VM Image uses 600 GB of Space.
I though this is part of the "parity loss" and not on top of it.

So out of the 10 TB i got a useable space of 2.5 TB

And this is OK? Or did I miss something?

Read a lot, but didn't get a clear answer.

Can someone help me?

fabian · May 25, 2021

yeah, this is to be expected when combining raidz with zvols with small block sizes.

Basti-Fantasti · May 25, 2021

Ok thanks for your answer.

I changed the block size to 16k, now everything seems to be right with the vm size.

Are there any problems to be expected with 16k block size?

Dunuin · May 25, 2021

HIPPE - IT Berater said:
I changed the block size to 16k, now everything seems to be right with the vm size.

Keep in mind that the volblocksize can only be set at creation of a zvol. So 16K will only be used for newly created zvols. If you already got existing zvols with 8K volblocksize you need to destroy and recreate them.

Are there any problems to be expected with 16k block size?

Probably more write amplification and overhead.

Basti-Fantasti · May 25, 2021

Dunuin said:
Keep in mind that the volblocksize can only be set at creation of a zvol. So 16K will only be used for newly created zvols. If you already got existing zvols with 8K volblocksize you need to destroy and recreate them.

Probably more write amplification and overhead.

I destroyed a VM and restored it to the new 16k zvol, after that: 300 GB instead of 600Gb. Seams ok so far.

When I use a 8K block size and ZFS always writes double the size of the actual really needed Disk space, don't i have also a big overhead
and SSD wearout?

Is more write amplification and overhead at 16k really a problem or just a fact?
Because when i got the 2.5TB out of 10 TB this is a real problem respectively just no good value for money.

Sorry just confused but thankful for further explanation

fabian · May 25, 2021

space usage (overhead) vs. read/write amplification are two separate issues. if you increase the volblocksize, the former goes down - the ratio of data to parity data gets better, but the latter increases since small reads/writes will now need to read/write 16k of data instead of 8k, which usually means performance will go down (since you waste time reading/writing data you are not even interested in).

Basti-Fantasti · May 25, 2021

fabian said:
space usage (overhead) vs. read/write amplification are two separate issues. if you increase the volblocksize, the former goes down - the ratio of data to parity data gets better, but the latter increases since small reads/writes will now need to read/write 16k of data instead of 8k, which usually means performance will go down (since you waste time reading/writing data you are not even interested in).

Ok thanks for the explanation. I think I understand. But about 25-30 % of real usable space is not an option, is it for anybody?

So the options are: (as you partly mentioned above)

live with 16k and not so good performance
(really bad? Seams ok in test environment)? (and maybe earlier ssd failure?)

Or use ZFS Raid10 with "the right one of the disks must gone broke" issue... also not so "sexy"

Or is there any best practice I missed?

Again thanks for your time

fabian · May 25, 2021

HIPPE - IT Berater said:
Ok thanks for the explanation. I think I understand. But about 25-30 % of real usable space is not an option, is it for anybody?

So the options are: (as you partly mentioned above)

live with 16k and not so good performance
(really bad? Seams ok in test environment)? (and maybe earlier ssd failure?)

Or use ZFS Raid10 with "the right one of the disks must gone broke" issue... also not so "sexy"

Or is there any best practice I missed?

Again thanks for your time

pretty much those are the options - which one is the right one for your use case is up to you of course.

Dunuin · May 25, 2021

If you want the performance of a raid10 (striped mirror) and the "any 2 drives may fail" of a raid6 (raidz2) you could also stripe some 3-way mirrors. But you would loose 66% of your capacity too.

Search

Search

zfs counts double the space

picateclas

New Member

aaron

Proxmox Staff Member

picateclas

New Member

aaron

Proxmox Staff Member

picateclas

New Member

LnxBil

Distinguished Member

picateclas

New Member

LnxBil

Distinguished Member

fabian

Proxmox Staff Member

Basti-Fantasti

Member

fabian

Proxmox Staff Member

Basti-Fantasti

Member

Dunuin

Distinguished Member

Basti-Fantasti

Member

fabian

Proxmox Staff Member

Basti-Fantasti

Member

fabian

Proxmox Staff Member

Dunuin

Distinguished Member

We value your privacy