[SOLVED] ZFS replica ~2x larger than original

Chumblys · Dec 16, 2018

On my setup, a VM that takes 9.86T of ZFS pool on active host, grows to 19.6T when replicated to one of the other hosts. On yet another host, it takes 9.86T as expected. Every one of them has ZFS "copies=1" enabled. Data is already compressed (it's a nested zpool inside a zvol).

The only thing that seems to be different on the oddly behving node 3, is that it's using raidz2 instead of jbod in zpool for now. I'm planning to turn the nodes 1 and 2 into raidz2 as well, though.

Any idea what's going on and how to fix this?

Code:

NODE 1:
hddtank/vmdata/vm-117-nas-files-0                                 9.86T  1.88T  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544980802__  5.66M      -  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__     0B      -  9.86T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      sdc1               ONLINE       0     0    16
      sdd1               ONLINE       0     0    48
      sdg1               ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0

NODE 2:
hddtank/vmdata/vm-117-nas-files-0                                 9.86T  5.47T  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__  2.05M      -  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544981100__   984K      -  9.86T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      sdg1               ONLINE       0     0     0
      sde1               ONLINE       0     0     0
      sdb1               ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0

NODE 3:
hddtank/vmdata/vm-117-nas-files-0                                    19.6T  1.37T  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__     3.28M      -  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544981100__        0B      -  19.6T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      raidz2-0           ONLINE       0     0     0
        sda              ONLINE       0     0     0
        sdb              ONLINE       0     0     0
        sdc              ONLINE       0     0     0
        sdd              ONLINE       0     0     0
        sde              ONLINE       0     0     0
        sdh              ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0

wolfgang · Dec 17, 2018

Hi,

RaidZ2 means you can lose tow Disks and this data must be redundant,
so it is normal that you use more space on a RaidZ2 then on a stripe.

P.S. When you run the VM on a stripe, you lose all your data if one disk failed.

Chumblys · Dec 17, 2018

The raidz2 pool has 6 x 6T disks, which should yield 0.666% storage efficiency, and 24T of usable storage (36T for data minus 12T for parity).
This is what I actually see:

Code:

NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
hddtank  32.5T  29.5T  3.02T         -     0%    90%  1.00x  ONLINE  -


NAME                                                                  USED  AVAIL  REFER  MOUNTPOINT
hddtank                                                              19.6T  1.34T   192K  /hddtank
hddtank/vmdata                                                       19.6T  1.34T   192K  /hddtank/vmdata
hddtank/vmdata/vm-117-nas-files-0                                    19.6T  1.34T  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1545031802__     3.76M      -  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1545031834__        0B      -  19.6T  -

Even with parity, 9.86T on raidz2 should take 14,79T, not 19.6T. And even if there was 4.8T(!) of overhead, the whole pool should still report 12.9T of free space, not 3.02T.

These numbers make no sense. Is someone getting useful efficiency from ZFS + raidz2, or should I maybe try switching to ZFS on Linux mdraid (RAID6)?

guletz · Dec 17, 2018

What value of zvolblocksize do you have for vm-117-nas-files-0 ? Ahift is all the same all pools?(9 or 12?)

Chumblys · Dec 17, 2018

Volblocksize reports 8k and ashift 12 on all the pools:

NAME PROPERTY VALUE SOURCE
hddtank/vmdata/vm-117-nas-files-0 volblocksize 8K default

NAME PROPERTY VALUE SOURCE
hddtank ashift 12 local

guletz · Dec 17, 2018

This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K.

So, your 9.86T will take 19.6T(around 2x 9.86T) !

Chumblys · Dec 17, 2018

Thank you @guletz !

For future reference, this happens with all zvols, so replication or Proxmox are of no consequence here.
An in-depth discussion of the problem: https://github.com/zfsonlinux/zfs/issues/548

The outcome on the Github thread is that ZoL/ZFS devs have no plans to fix this issue (neither by calculating block size default based on ashift nor by optimizing storage of contiguous writes). Everyone will simply have to know this in advance, and to do the math manually. (Quite disappointing, IMO, but oh well.)

gkovacs · Dec 17, 2018

guletz said:
This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K.

So, your 9.86T will take 19.6T(around 2x 9.86T) !

That's very interesting... but the problem only affects RAIDZ2?

So if I use a 4 disk RAIDZ, 8k is fine? Or should I set the volblocksize 16k as well?

guletz · Dec 17, 2018

Chumblys said:
For future reference, this happens with all zvols, so replication or Proxmox are of no consequence here.

... if you do not take in account zvolbloksize, and make the math

Chumblys · Dec 17, 2018

Yeah, this sort of does seem to be a Proxmox problem after all: Since replica volumes are automatically created by Proxmox, the ZFS storage plugin would apparently need to calculate and set the block size. I tried setting volblocksize on the parent volume "vmdata" manually, but it failed:

Code:

# zfs set volblocksize=16K hddtank/vmdata
cannot set property for 'hddtank/vmdata': 'volblocksize' does not apply to datasets of this type

Is there a workaround or will I have to either settle to non-optimal disk performance with ashift=9 (512) on 4k HDDs, or lots of wasted space with ashift=12?

guletz · Dec 17, 2018

gkovacs said:
So if I use a 4 disk RAIDZ, 8k is fine? Or should I set the volblocksize 16k as well?

for raidz2, 8K it will be OK

guletz · Dec 17, 2018

Chumblys said:
Is there a workaround or will I have to either settle to non-optimal disk performance with ashift=9 (512) on 4k HDDs, or lots of wasted space with ashift=12?

This is what I do in your case(for a 16K, maybe better 32K, or greater):
- create a new data-set like hddtank/vmdata-16K
- add this new zfs data-set in storage, and set on this 16 K as volblocksize
- then move/migrate(make also a vzdump backup just in case) your vDISK from hddtank/vmdata(8K) to the new 16K hddtank/vmdata-16K

In my own case by default I use 16 K data-sets as minimum(and I also have 32k and 64K for different load cases)

Good luck!

guletz · Dec 17, 2018

Chumblys said:
# zfs set volblocksize=16K hddtank/vmdata cannot set property for 'hddtank/vmdata': 'volblocksize' does not apply to datasets of this type

You cannot set volblocksize on a data-set. This is a proprety for zvol only. For a dataset you have recordsize proprietiy(128 K by default)

Chumblys · Dec 17, 2018

Sorry, not sure I understood @guletz . You mean, in my hddtank/vmdata/vm-117-nas-files-0 case, create a new vmdata with 16/32k volblocksize, and migrate Proxmox VM volumes under it? Or create a new vm-117-nas-files-0 manually and edit qemu config files?

If the former, which property should I set when creating the vmdata? Setting the volblocksize property gives the same error message when creating as when trying set it afterwards:

# zfs create -o volblocksize=32K hddtank/vmdata32k
cannot create 'hddtank/vmdata32k': 'volblocksize' does not apply to datasets of this type

guletz · Dec 17, 2018

Chumblys said:
You mean, in my hddtank/vmdata/vm-117-nas-files-0 case, create a new vmdata with 16/32k volblocksize, and migrate Proxmox VM volumes under it

zfs create hddtank/vmdata-16K

in /etc/storage.cfg, you wil have like this:

#.... bla bla bla ......
zfspool: vmdata-16k
pool hddtank/vmdata-16K
blocksize 16k
content rootdir,images
nodes your-nodes-as-yo-have
sparse 0
# end bla-bla-bla

Chumblys · Dec 17, 2018

Ah, ok! So the /etc/pve/storage.cfg "blocksize" option changes default volblocksize for new zvols?
All right, thank you!

guletz · Dec 17, 2018

... some dry theory

zfs cand offer to the users 2 things: data-sets and zvols
- data-sets is like a filesystem with default recordsize = 128K(this is the maximum record size, most of the time is variable, depends on application)
- zvols is like a partiton without any filesystem on it, with a FIXED volblocksize= X, where X is defined at the time creation

Now, a datasets, could be only a container for other data-sets and other zvols. If you list zfs list, you can see that data-sets have a mount point,
(as any normal FS), but for zvol you do not have a mount point for it! And because PMX have do a great job, and they put most of this info in a greate interface(and most users say .... this is all I need to know) when you create a zvol, the PMX will create on-the-fly a new vdisk with the default volblocksize=8K(who is a bad value in most cases). But if you have defined in PMX(datatcenter->storage->your-zfs-data-set->block-size) another value, then this it will be used(like in the snippet that I post).

guletz · Dec 17, 2018

Chumblys said:
"blocksize" option changes default volblocksize for new zvols?

Yes.

SergeyISP · Dec 17, 2018

You need delete replication to node 3 (and "hddtank/vmdata/vm-117-nas-files-0" with snapshots on node 3), enable "Thin provision" option in the storage settings menu on node 3, after that enable replication again. If you use console for zfs create - create sparse zfs (key -s).
Many questions about replication and out of space users write on this forum. Time to add this in wiki?

Chumblys · Dec 26, 2018

Progress report: replicas apparently will always use the same blocksize as the original ZFS volume, so the original volume needs to be (re)created with the desired volblocksize.

Simply changing /etc/pve/storage.cfg to this...

Code:

zfspool: local-zfs-hdd-images
        pool hddtank/vmdata
        content rootdir,images
    blocksize 32k
        sparse

...and replicating still resulted in the old block size:

Code:

NAME                               PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-nas-files-0  volblocksize  8K        default

A newly created & replicated volume shows the new value:

Code:

NAME                          PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-disk-0  volblocksize  32K       -

Making room for an extra 8TB volume is a bit of a pain, but at least it looks it should work.
EDIT: All copied now, and with 32k block size, storage overhead is <2%, so problem solved.

[SOLVED] ZFS replica ~2x larger than original

Active Member

Proxmox Retired Staff

Active Member

Famous Member

Active Member

Famous Member

Active Member

Renowned Member

Famous Member

Active Member

Famous Member

Famous Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Active Member

Active Member