[SOLVED] ZFS replica ~2x larger than original

Dec 16, 2018
13
0
41
46
On my setup, a VM that takes 9.86T of ZFS pool on active host, grows to 19.6T when replicated to one of the other hosts. On yet another host, it takes 9.86T as expected. Every one of them has ZFS "copies=1" enabled. Data is already compressed (it's a nested zpool inside a zvol).

The only thing that seems to be different on the oddly behving node 3, is that it's using raidz2 instead of jbod in zpool for now. I'm planning to turn the nodes 1 and 2 into raidz2 as well, though.

Any idea what's going on and how to fix this?

Code:
NODE 1:
hddtank/vmdata/vm-117-nas-files-0                                 9.86T  1.88T  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544980802__  5.66M      -  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__     0B      -  9.86T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      sdc1               ONLINE       0     0    16
      sdd1               ONLINE       0     0    48
      sdg1               ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0

NODE 2:
hddtank/vmdata/vm-117-nas-files-0                                 9.86T  5.47T  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__  2.05M      -  9.86T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544981100__   984K      -  9.86T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      sdg1               ONLINE       0     0     0
      sde1               ONLINE       0     0     0
      sdb1               ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0

NODE 3:
hddtank/vmdata/vm-117-nas-files-0                                    19.6T  1.37T  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1544981040__     3.28M      -  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1544981100__        0B      -  19.6T  -

    NAME                 STATE     READ WRITE CKSUM
    hddtank              ONLINE       0     0     0
      raidz2-0           ONLINE       0     0     0
        sda              ONLINE       0     0     0
        sdb              ONLINE       0     0     0
        sdc              ONLINE       0     0     0
        sdd              ONLINE       0     0     0
        sde              ONLINE       0     0     0
        sdh              ONLINE       0     0     0
    logs
      zfs-hddpool-slog   ONLINE       0     0     0
    cache
      zfs-hddpool-l2arc  ONLINE       0     0     0
 
Last edited:
Hi,

RaidZ2 means you can lose tow Disks and this data must be redundant,
so it is normal that you use more space on a RaidZ2 then on a stripe.

P.S. When you run the VM on a stripe, you lose all your data if one disk failed.
 
The raidz2 pool has 6 x 6T disks, which should yield 0.666% storage efficiency, and 24T of usable storage (36T for data minus 12T for parity).
This is what I actually see:

Code:
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
hddtank  32.5T  29.5T  3.02T         -     0%    90%  1.00x  ONLINE  -


NAME                                                                  USED  AVAIL  REFER  MOUNTPOINT
hddtank                                                              19.6T  1.34T   192K  /hddtank
hddtank/vmdata                                                       19.6T  1.34T   192K  /hddtank/vmdata
hddtank/vmdata/vm-117-nas-files-0                                    19.6T  1.34T  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-1_1545031802__     3.76M      -  19.6T  -
hddtank/vmdata/vm-117-nas-files-0@__replicate_117-2_1545031834__        0B      -  19.6T  -

Even with parity, 9.86T on raidz2 should take 14,79T, not 19.6T. And even if there was 4.8T(!) of overhead, the whole pool should still report 12.9T of free space, not 3.02T.

These numbers make no sense. Is someone getting useful efficiency from ZFS + raidz2, or should I maybe try switching to ZFS on Linux mdraid (RAID6)?
 
Volblocksize reports 8k and ashift 12 on all the pools:

NAME PROPERTY VALUE SOURCE
hddtank/vmdata/vm-117-nas-files-0 volblocksize 8K default

NAME PROPERTY VALUE SOURCE
hddtank ashift 12 local
 
This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K.

So, your 9.86T will take 19.6T(around 2x 9.86T) !
 
Thank you @guletz !

For future reference, this happens with all zvols, so replication or Proxmox are of no consequence here.
An in-depth discussion of the problem: https://github.com/zfsonlinux/zfs/issues/548

The outcome on the Github thread is that ZoL/ZFS devs have no plans to fix this issue (neither by calculating block size default based on ashift nor by optimizing storage of contiguous writes). Everyone will simply have to know this in advance, and to do the math manually. (Quite disappointing, IMO, but oh well.)
 
This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K.

So, your 9.86T will take 19.6T(around 2x 9.86T) !

That's very interesting... but the problem only affects RAIDZ2?

So if I use a 4 disk RAIDZ, 8k is fine? Or should I set the volblocksize 16k as well?
 
Yeah, this sort of does seem to be a Proxmox problem after all: Since replica volumes are automatically created by Proxmox, the ZFS storage plugin would apparently need to calculate and set the block size. I tried setting volblocksize on the parent volume "vmdata" manually, but it failed:

Code:
# zfs set volblocksize=16K hddtank/vmdata
cannot set property for 'hddtank/vmdata': 'volblocksize' does not apply to datasets of this type

Is there a workaround or will I have to either settle to non-optimal disk performance with ashift=9 (512) on 4k HDDs, or lots of wasted space with ashift=12?
 
Is there a workaround or will I have to either settle to non-optimal disk performance with ashift=9 (512) on 4k HDDs, or lots of wasted space with ashift=12?

This is what I do in your case(for a 16K, maybe better 32K, or greater):
- create a new data-set like hddtank/vmdata-16K
- add this new zfs data-set in storage, and set on this 16 K as volblocksize
- then move/migrate(make also a vzdump backup just in case) your vDISK from hddtank/vmdata(8K) to the new 16K hddtank/vmdata-16K


In my own case by default I use 16 K data-sets as minimum(and I also have 32k and 64K for different load cases)

Good luck!
 
# zfs set volblocksize=16K hddtank/vmdata cannot set property for 'hddtank/vmdata': 'volblocksize' does not apply to datasets of this type

You cannot set volblocksize on a data-set. This is a proprety for zvol only. For a dataset you have recordsize proprietiy(128 K by default)
 
Sorry, not sure I understood @guletz . You mean, in my hddtank/vmdata/vm-117-nas-files-0 case, create a new vmdata with 16/32k volblocksize, and migrate Proxmox VM volumes under it? Or create a new vm-117-nas-files-0 manually and edit qemu config files?

If the former, which property should I set when creating the vmdata? Setting the volblocksize property gives the same error message when creating as when trying set it afterwards:

# zfs create -o volblocksize=32K hddtank/vmdata32k
cannot create 'hddtank/vmdata32k': 'volblocksize' does not apply to datasets of this type
 
You mean, in my hddtank/vmdata/vm-117-nas-files-0 case, create a new vmdata with 16/32k volblocksize, and migrate Proxmox VM volumes under it

zfs create hddtank/vmdata-16K

in /etc/storage.cfg, you wil have like this:

#.... bla bla bla ......
zfspool: vmdata-16k
pool hddtank/vmdata-16K
blocksize 16k
content rootdir,images
nodes your-nodes-as-yo-have
sparse 0
# end bla-bla-bla
 
... some dry theory ;)

zfs cand offer to the users 2 things: data-sets and zvols
- data-sets is like a filesystem with default recordsize = 128K(this is the maximum record size, most of the time is variable, depends on application)
- zvols is like a partiton without any filesystem on it, with a FIXED volblocksize= X, where X is defined at the time creation

Now, a datasets, could be only a container for other data-sets and other zvols. If you list zfs list, you can see that data-sets have a mount point,
(as any normal FS), but for zvol you do not have a mount point for it! And because PMX have do a great job, and they put most of this info in a greate interface(and most users say .... this is all I need to know) when you create a zvol, the PMX will create on-the-fly a new vdisk with the default volblocksize=8K(who is a bad value in most cases). But if you have defined in PMX(datatcenter->storage->your-zfs-data-set->block-size) another value, then this it will be used(like in the snippet that I post).
 
  • Like
Reactions: Chumblys
You need delete replication to node 3 (and "hddtank/vmdata/vm-117-nas-files-0" with snapshots on node 3), enable "Thin provision" option in the storage settings menu on node 3, after that enable replication again. If you use console for zfs create - create sparse zfs (key -s).
Many questions about replication and out of space users write on this forum. Time to add this in wiki?
 
Last edited:
Progress report: replicas apparently will always use the same blocksize as the original ZFS volume, so the original volume needs to be (re)created with the desired volblocksize.

Simply changing /etc/pve/storage.cfg to this...

Code:
zfspool: local-zfs-hdd-images
        pool hddtank/vmdata
        content rootdir,images
    blocksize 32k
        sparse

...and replicating still resulted in the old block size:

Code:
NAME                               PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-nas-files-0  volblocksize  8K        default

A newly created & replicated volume shows the new value:

Code:
NAME                          PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-disk-0  volblocksize  32K       -

Making room for an extra 8TB volume is a bit of a pain, but at least it looks it should work.
EDIT: All copied now, and with 32k block size, storage overhead is <2%, so problem solved.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!