[SOLVED] ZFS replica ~2x larger than original

Hello,
I'm on similar (big, for large data size) problem.
I have a two nodes PVE 5.3 with 6x12TB for each node with zfs raidz2 zpool and storage replication.

I have installed following all wiki suggestions (so I agree that is a wiki missing infos) and moving VMs from old server of about 20TB on LVM iSCSI SAN. Now I have doubled size and I'm at limit (also having slow performances, but this is probably a consequence).
My current status is:

Code:
# zpool status
  pool: zfspool1
 state: ONLINE
  scan: scrub repaired 0B in 48h7m with 0 errors on Mon Apr 15 09:42:34 2019
config:

    NAME        STATE     READ WRITE CKSUM
    zfspool1    ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
    logs
      sdb1      ONLINE       0     0     0
    cache
      sdb2      ONLINE       0     0     0

errors: No known data errors

root@nodo1-ced:~# zfs list -o name,avail,used,refer,lused,lrefer,mountpoint,compress,compressratio
NAME                    AVAIL   USED  REFER  LUSED  LREFER  MOUNTPOINT  COMPRESS  RATIO
zfspool1                11.7T  29.6T   192K  13.0T     40K  /zfspool1        lz4  1.00x
zfspool1/test16k        11.7T  5.08G   112K    26K     26K  -                lz4  1.00x
zfspool1/vm-100-disk-0  11.7T   195G   194G   101G    100G  -                lz4  1.03x
zfspool1/vm-100-disk-1  11.7T  5.98T  5.98T  2.99T   2.99T  -                lz4  1.00x
zfspool1/vm-100-disk-2  11.7T  8.43T  8.43T  4.24T   4.24T  -                lz4  1.00x
zfspool1/vm-100-disk-3  11.7T  7.33T  7.33T  3.67T   3.67T  -                lz4  1.00x
zfspool1/vm-101-disk-0  11.8T   285G   181G  91.2G   91.2G  -                lz4  1.00x
zfspool1/vm-101-disk-1  11.8T  61.9G   128K    34K     34K  -                lz4  1.00x
zfspool1/vm-108-disk-0  11.9T   423G   268G   139G    139G  -                lz4  1.03x
zfspool1/vm-108-disk-1  12.8T  1.56T   541G   277G    277G  -                lz4  1.02x
zfspool1/vm-109-disk-0  11.7T  34.5G  34.5G  19.1G   19.1G  -                lz4  1.10x
zfspool1/vm-110-disk-0  11.9T   383G   228G   119G    119G  -                lz4  1.04x
zfspool1/vm-110-disk-1  13.8T  4.70T  2.63T  1.32T   1.32T  -                lz4  1.00x
zfspool1/vm-112-disk-0  11.9T   221G  14.7G  7.61G   7.61G  -                lz4  1.03x

After a lot, I have understood that the problem is blocksize.

Questions:
1) Mostly VMs are windows server. Is it better to use a 4k block size, as suggested from proxmox support, or, as them are block volumes is could be better a 32k like your last example?
2) May I move VMs between nodes so users can work on and rebuild one by one? If this is possible, could you kindly help give me the sequence I have to do?
3) I have also a QNAP NAS with large space to do backups, but I have fear for timing on make all the job and data safety. Probably because of wrong blocksize performance were poor and I needed more than a week to move all datas, also if I'm in a 10Gbit lan.
4) Aside problem that I faced now, I don't know if related with other problem, was a wrong disk size on one disk of a VM that instead to show as 4TB is displayed in bytes ad is different from original size. Have you had similar problem with zfs?
Migration from old server was done making vzdump and restore on new servers.

Thank you in advance
Francesco
 
Hi,

1. only from zfs perspective (raidz2 x 6 hdd) at minimum any VM need to use 4 hdd x 4k(ashift12) = 16 k as VM blocksize and zvol. For win guest 32-64 k is ok at least I can see in my own case.

2. yes you can - stop VM, move VM on other node, start VM.

3. you can make backup for a VM using a nfs storage
 
  • Like
Reactions: fsgiudice
Hi guletz,

thank you for reply.

Sorry but probably I explain myself not well.
About point 2 I meant about what I have to do after moved all VMs on one node.
As I saw on previous replies, mainly Chumblsys' replies, I have undestood that I have to leave untouched zpool I already have and that have a default ashift=12.
Innstead I have to change /etc/pve/storage.cfg setting 16k blocksize and than recreate every disk volumes. Have I understood correctly?
If yes which are the right steps?
1) backup all VMs with vzdump
2) move all VMs on node 2 less the one on which I have to work that I will leave on node 1
3) on that VM delete replica jobs (and so related snapshots)

At this point what is the next step?
A) it will be sufficient, after set 16K blocksize on storage.cfg, to create new replica job and make first one? On the other node zfs will create a replica with new blocksize?
B) Or have I to add a new disk in VM than copy all data internally from windows?
C) Or, last option, have I to restore from vzdump (but in this case I have to stop surely customer for a lot)?

Thank you again for help.

Francesco
 
have to leave untouched zpool I already have and that have a default ashift=12.
Innstead I have to change /etc/pve/storage.cfg setting 16k blocksize and than recreate every disk volumes. Have I understood correctly?

Yes


Steps 1-3 are ok.

Then do the same on the second node (16k). And after both nodes will have the same 16k for zvol, then you can re-create the replycation task from node A to node B.

As a side note, you can check if all of yours hdds from each node if they have 4k block size (gdisk -l /devsdx, or smartctl -a /dev/sdx)

Good luck!
 
  • Like
Reactions: fsgiudice
Yes


Steps 1-3 are ok.

Hi guletz,
thank you again for help.

Sorry if I ask again (I hope for last time), but only as clarification before to make errors. In my previous message I wrote this sentence:

Code:
and than recreate every disk volumes

now in your reply

Then do the same on the second node (16k). And after both nodes will have the same 16k for zvol, then you can re-create the replycation task from node A to node B.

I haven't understood if, after I have defined 16k blocksize in storage.cfg on both nodes,
1) I have to only delete replication jobs and make it again (point A on previous message) and when finished first replication I have to change node side and make a second replication viceversa, but without create new disks,
2) or I have to recreate every volume "one by one" for each VMs like point B of my message (add a new disk to VM and copy content).


As a side note, you can check if all of yours hdds from each node if they have 4k block size (gdisk -l /devsdx, or smartctl -a /dev/sdx)

Thank for suggestion. I have verified and all disks have these results:

Code:
root@nodo2-ced:~# gdisk -l /dev/sde
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sde: 22961717248 sectors, 10.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): B87C3BDB-BDD1-324A-9104-7BD311645E24
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 22961717214
Partitions will be aligned on 2048-sector boundaries
Total free space is 4029 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     22961698815   10.7 TiB    BF01  zfs-992bbaadc51814a8
   9     22961698816     22961715199   8.0 MiB     BF07

Code:
=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721212AL5205
Revision:             NM02
Compliance:           SPC-4
User Capacity:        11,756,399,230,976 bytes [11.7 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca26f1717f0
Serial number:        8CGDPR9H
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu May  2 17:53:00 2019 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

so It seems that I have 4k blocksize, isn't it?
 
Hi again,

Sorry for not so clear response(luck of time, and my bad English )

When you modify in pmx the zvol size this will be only for any new vdisks. So the safe path will be:
- create in storage.cfg this new zvol 16k on both nodes
- but because you allready have another size value for your existing VM this value will be the same when you make any pve-zsync or zfs replication.
- so the simple way will be to create a backup file (vzdump is ok), recreate a new VM from this dump file (so you will have 16 k zvol), stop the old VM (non 16k) verify that is ok (1-2 days) and when you are 100 % that all is ok , then destroy your old VM
- do the same for all yours VM (non 16 k)
- after each successful migrated VM to 16 k, then destroy your old replication task and create a new one (16k storage -> 16k storage)

Regarding your hdd, your block is not 4k native (see smart info), logical is 512 but phisical is 4k. In this situation, from any appl persepctive, the block is 512 but behind the scene, your hdd will write a 4k data block for any 512 data block (not so good). But you can not do nothing in this case (only in future, you will read the data sheet for your desired new hdd, so you can choose a 4k native disks /4kn)

Another ideea is to think carefully if you indeed need a raidz2 pool. I can see that you have 6 hdds. Maybe you can add a new hdd in your server/s and create a new pool using a raid 10 (3 pairs of disks in mirror, and all 3 mirror in a raid0), with 1 hdd as spare disk (so, your pool is ok if you lose only one hdd). And if you have replication ftom node A to node B and in reverse I can think that will be ok (if you lose 2 hdds at the same time, you can start all of your VM on the second node). Think about this only from probability point of view and see how many data you can lose between 2 conservative tasks without big problems (english again)

But with a raid 10 (3 stripe mirror), your performance witll be far better compared with raidz2- iops will be 3x better

Good luck!
 
  • Like
Reactions: fsgiudice
Hi again,

Sorry for not so clear response(luck of time, and my bad English )

Hi,
thank you very much for detailed reply. Also mine English is bad, so sorry if I asked more times, but I have to be sure having to manage so big data.
I hope to handle correctly this problem and to solve it.

UPDATE: I'm reading carefully your message. Dell have removed these disks from catalogue (I have bought servers 2 months ago!) so I have to check if are available as spare parts otherwise it will be a problem in future again.

Thank you again.
Francesco
 
Last edited:
Hi Franceso,

In the future, any hdd with 4kn will be good, so it will not be a problem. As a suggestion try to get different manufacture hdds, or at least different bach/model so the chance to have many broken hdd at the same time to be lower.

Good luck!
 
Progress report: replicas apparently will always use the same blocksize as the original ZFS volume, so the original volume needs to be (re)created with the desired volblocksize.

Simply changing /etc/pve/storage.cfg to this...

Code:
zfspool: local-zfs-hdd-images
        pool hddtank/vmdata
        content rootdir,images
    blocksize 32k
        sparse

...and replicating still resulted in the old block size:

Code:
NAME                               PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-nas-files-0  volblocksize  8K        default

A newly created & replicated volume shows the new value:

Code:
NAME                          PROPERTY      VALUE     SOURCE
hddtank/vmdata/vm-117-disk-0  volblocksize  32K       -

Making room for an extra 8TB volume is a bit of a pain, but at least it looks it should work.
EDIT: All copied now, and with 32k block size, storage overhead is <2%, so problem solved.
If your target is to save the space, then you may consider using zfs-compression and zeroing free-space inside the VM. It will help even if your zvol is not sparse.
 
This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K.

So, your 9.86T will take 19.6T(around 2x 9.86T) !
@guletz Can you tell why you think the blocksize is divided by number of data disk? Because data does not have to be spread into every disk in vdev?

The only requirement is that the write size should be multiple of (number of parity + 1)
https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz
If ashift=12 then 4k is smallest io possible. Write size should be multiples of 12k (for raidz2)

6 disk pool(raidz2) will have 2 parity + 1 = 3 .... and this means there won't be padding needed as long as total size is multiple of 12k. So 8k volblocksize will waste space because 8k parity + 8k data = 16k which is not multiple of 12k. It needs another 8k for padding. Writing two blocks 16k is fine because 8k parity + 16k totals to 24k. Writing 24k require 16k parity + 24k data = 40k and this also require 8k padding.

If one has 16k volblocksize then 8k parity + 16k data = 24k total which is multiple of 12k, Writing 32k data is also fine with 16k parity + 32k data = 48k total. Writing 48k data is also good, 24k parity + 48k data = 72k total... in these cases all totals are multiple of 12k...

Of course if one is using compression, the results may vary...
 
Last edited:
  • Like
Reactions: guletz
Can you tell why you think the blocksize is divided by number of data disk? Because data does not have to be spread into every disk in vdev?

OK, think at this:
16k volblocksize then 8k parity + 16k data, and like you say 16k data it is write only on one disk and parity on other disks(as you say)
now let imagine that this disk where 16 was write is broken, so you lose 16k of data but you have 8k parity only
how do you think that you can recover ALL 16 k data using only 8K parity? You can recover the data only if ALL 16 K are spred on all vdev-disks(excluding parity disk), so losing only one disk you do not loase ALL 16K data!


Good luck / Bafta!
 
Last edited:
@guletz no. I never said 16k is written on one disk. I only said data does not have to spread to "every" disk on the vdev. Please read the article linked. It explain how this works in detail.

With RaidZ2 using 6 disks. 16K will be written as

Disk1Disk2Disk3Disk4Disk5Disk6
P0P1D0D1D2D3
P2P3D4D5D6D7

However if you write 8K (D0,D1) then write 16k (D2,D3,D4,D5,D6,D7,D8,D9) it would be like this:
Disk1Disk2Disk3Disk4Disk5Disk6
P0P1D0D1P2P3
D2D3D4D5P4P5
D6D7D8D9

As you can see, there is no requirement of data being on every disk. If you check the article I linked. You can see a similar table also. If you do not believe me, believe the guy who wrote that article because he is one of the original architects of ZFS.

What I am saying is that your math is wrong when you write something like:
8K / 4 data disk = 0.5 K.
because 8k does NOT spread over 4 data disk when ashift=12
 
Last edited:
  • Like
Reactions: guletz
Thx. for the article! really aprecitated. It was my mistake! So what I have write was wrong from my part!
In my country we have like this "During the entire life we learn"!

Good luck / Bafta !
 
  • Like
Reactions: yurtesen

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!