Tuning ZFS 4+2 RAIDZ2 parameters to avoid size multiplication

Nov 22, 2020
80
11
13
51
Hi,

On a machine with 6x4 TB hdd I installed PVE 6.4 (up to date) choosing RAIDZ2 (ashift left to default 12), and this should leave 4x4=16 TB or 14.1 TiB usable.

Code:
# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 1 days 00:35:33 with 0 errors on Mon Sep 13 00:59:36 2021
config:

    NAME                                               STATE     READ WRITE CKSUM
    rpool                                              ONLINE       0     0     0
      raidz2-0                                         ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PBKYN8HT-part3  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCG0XTRB-part3  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCG1H6TB-part3  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PBKYBWET-part3  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PBKY3SAT-part3  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PBKV0PRT-part3  ONLINE       0     0     0

On this PVE I created a single VM running debian with its default ext4 and fstrim and a 10000 GB (9.77 TiB) disk (discard checked).

No snapshot or anything other than just running this particular VM, it's been ok for a while.

But when the VM reached 7.05 TiB of data (reported by df -h) it suddenly got I/O error reported by PVE, and after checking various things I noticed PVE reported a full 14.1 TiB ZFS (!).

I cleaned up a few things so it got back to (very small) 4.12GiB free:


Code:
# zfs list -t all -o space
NAME                      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
rpool                     4.12G  14.1T        0B    208K             0B      14.1T
rpool/ROOT                4.12G  2.00G        0B    192K             0B      2.00G
rpool/ROOT/pve-1          4.12G  2.00G        0B   2.00G             0B         0B
rpool/data                4.12G  14.1T        0B    192K             0B      14.1T
rpool/data/vm-100-disk-0  4.12G  14.1T        0B   14.1T             0B         0B

zfs get all (see below) says 14.1 TiB is allocated for this 9.77 TiB disk and "only" 7.05 TiB effectively used, meaning that I lost exactly half of the usable disk space with the default PVE ZFS parameters.

I read a few things about ZFS and it looks for zvol the parameter volblocksize (8k default) might be tunable to achieve a better disk utilization for zvol (and recordsize for datasets).

May be I missed other important parameters.

However I'm no ZFS expert, what would you recommand in my case?

And if I upgrade to a 8 x 8 TB disk RAIDZ2 ?

Code:
# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1

Code:
# zfs get all rpool/data/vm-100-disk-0
NAME                      PROPERTY              VALUE                  SOURCE
rpool/data/vm-100-disk-0  type                  volume                 -
rpool/data/vm-100-disk-0  creation              Fri Feb 19 10:02 2021  -
rpool/data/vm-100-disk-0  used                  14.1T                  -
rpool/data/vm-100-disk-0  available             4.12G                  -
rpool/data/vm-100-disk-0  referenced            14.1T                  -
rpool/data/vm-100-disk-0  compressratio         1.00x                  -
rpool/data/vm-100-disk-0  reservation           none                   default
rpool/data/vm-100-disk-0  volsize               9.77T                  local
rpool/data/vm-100-disk-0  volblocksize          8K                     default
rpool/data/vm-100-disk-0  checksum              on                     default
rpool/data/vm-100-disk-0  compression           on                     inherited from rpool
rpool/data/vm-100-disk-0  readonly              off                    default
rpool/data/vm-100-disk-0  createtxg             423                    -
rpool/data/vm-100-disk-0  copies                1                      default
rpool/data/vm-100-disk-0  refreservation        none                   default
rpool/data/vm-100-disk-0  guid                  17557223982948174465   -
rpool/data/vm-100-disk-0  primarycache          all                    default
rpool/data/vm-100-disk-0  secondarycache        all                    default
rpool/data/vm-100-disk-0  usedbysnapshots       0B                     -
rpool/data/vm-100-disk-0  usedbydataset         14.1T                  -
rpool/data/vm-100-disk-0  usedbychildren        0B                     -
rpool/data/vm-100-disk-0  usedbyrefreservation  0B                     -
rpool/data/vm-100-disk-0  logbias               latency                default
rpool/data/vm-100-disk-0  objsetid              145                    -
rpool/data/vm-100-disk-0  dedup                 off                    default
rpool/data/vm-100-disk-0  mlslabel              none                   default
rpool/data/vm-100-disk-0  sync                  standard               inherited from rpool
rpool/data/vm-100-disk-0  refcompressratio      1.00x                  -
rpool/data/vm-100-disk-0  written               14.1T                  -
rpool/data/vm-100-disk-0  logicalused           7.05T                  -
rpool/data/vm-100-disk-0  logicalreferenced     7.05T                  -
rpool/data/vm-100-disk-0  volmode               default                default
rpool/data/vm-100-disk-0  snapshot_limit        none                   default
rpool/data/vm-100-disk-0  snapshot_count        none                   default
rpool/data/vm-100-disk-0  snapdev               hidden                 default
rpool/data/vm-100-disk-0  context               none                   default
rpool/data/vm-100-disk-0  fscontext             none                   default
rpool/data/vm-100-disk-0  defcontext            none                   default
rpool/data/vm-100-disk-0  rootcontext           none                   default
rpool/data/vm-100-disk-0  redundant_metadata    all                    default
rpool/data/vm-100-disk-0  encryption            off                    default
rpool/data/vm-100-disk-0  keylocation           none                   default
rpool/data/vm-100-disk-0  keyformat             none                   default
rpool/data/vm-100-disk-0  pbkdf2iters           0                      default
 
Yes this is where I read about volblocksize.

However it's not clear what value is optimal in my case, 128k ? 1MB ? 4MB (like ceph does I think) ?

Also on the following advice: "When doing this, the guest needs to be tuned accordingly and depending on the use case, the problem of write amplification if just moved from the ZFS layer up to the guest." How does it translate on a debian with ext4 guest?

I assume disk on the guest is seen with traditionnal 512B sectors, ext4 has "blocks" of 8 sectors hence 4k by default to match the memory cache size. I haven't seen recommandations to change this value, ext4 wiki says "You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages)"

Note: the data stored on the VM is already compressed.

Thanks again!
 
the usual advice is something in the ball park of 64k, but you can play around with that yourself. ext4 has a bigalloc feature (it doesn't change the block size itself, but changes allocation and some metadata structures to use clusters of blocks instead - see the mkfs.ext4 / ext4 manpages and bigalloc/cluster-size). stripe_width and stride might also help.

but mostly it's up to your workload - e.g., your application(s) might have their own tuning parameters for things like journaling files, ... if they are IO heavy.
 
  • Like
Reactions: guerby
@guerby Go to this spreadsheet: https://docs.google.com/spreadsheet...jHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=1576424058
Select TAB "Raidz2 total parity cost in % of total storage" and scroll to the bottom with the power of 2 block sizes that you can actually use.

Then divide your volblocksize by your sector size (4k for ashift 12) to get your sectors per block and look up your overhead in the table.

For example for a volblocksize of 8k you have 8k/4k = 2 sectors per block and with 6 disks according to the table that will give you a space overhead of 67%. Or in other words you need 3 times the storage (+200%) than if you had no parity.

But for a volblocksize of 64k you have 64k/4k = 16 sectors per block and with 6 disks according to the table that will give you a space overhead of 33% of the whole. Or in other words you only need 50% more storage than if you had no parity.

You can see from the table that with raidz2, ashift=12 and just 6 disks you don't get any increased storage space efficiency benefits above a volblocksize of 16k.

How do I know this? Because I had the same question/problem before: https://www.reddit.com/r/zfs/comments/opu43n/zvol_used_size_far_greater_than_volsize/
 
Last edited:
  • Like
Reactions: guerby
I would also go with 16K volblocksize. Especially if you plan to run some MySQL DBs you don't want a blocksize above 16K because otherwise you would get a terrible write amplification and performance loss. Its always bad to write with a smaller blocksize to a bigger blocksize, especially for sync writes. And MySQL is sync writing 16K blocks, so a volblocksize of 32K or higher would be bad.
 
  • Like
Reactions: guerby
Your problem is volblocksize. Because You use RAIDZ2 with 6 disks and ashift=12 (4k block) your minimal volblocksize is 4x4k (4 stripes + 2 parity).
 
Because You use RAIDZ2 with 6 disks and ashift=12 (4k block) your minimal volblocksize is 4x4k (4 stripes + 2 parity).
That's wrong. The minimum volbocksize for ashift 12 is 4k. But you are right that ideal volblocksize for is 16K both from a storage efficiency perspective (not taking into account compression) and from a perspective of VM workloads. But if in doubt benchmark your specific workload.
 
Last edited:
That's wrong. The minimum volbocksize for ashift 12 is 4k. But you are right that ideal volblocksize for is 16K both from a storage efficiency perspective (not taking into account compression) and from a perspective of VM workloads. But if in doubt benchmark your specific workload.
You can use 4k volblocksize, but zfs will write 4x4k for every 4k from zvol. Zvol with 4k block will use 4GB on zpool for every 1GB size. Thread creator have zvol with default 8k block and his zvol use 2GB for every 1GB size.
 
You can use 4k volblocksize, but zfs will write 4x4k for every 4k from zvol. Zvol with 4k block will use 4GB on zpool for every 1GB size. Thread creator have zvol with default 8k block and his zvol use 2GB for every 1GB size.
Thats not corrext. Padding blocks are only added to fill up the space to a multiple of parity + 1. So for raidz2 it has to be 3/6/9/12/... blocks.
So with a 4K volblocksize zvol writing will always result in 1x 4K of data + 2x 4k of parity and no padding. No matter what the number of drives is. So you loose 2/3 of your storage to parity. No padding because 3 is a multiple of 3.
If you use a volblocksize of 8K your are writing 2x 4K of data + 2x 4k parity. 4 is not a multiple of 3 so its rounded up to 6 blocks by adding 2x 4K padding blocks. Now you loose 1/3 to parity and 1/3 to padding so you loose in total 2/3 like with 4k volblocksize before.
With a 16k volblocksize and 6 disks you write 4x 4k of data + 2x 4k parity. 6 is a multiple of 3 so no padding blocks required. Now you only loose 1/3 to parity and nothing to padding.

So with a 8K volblocksize it looks like everything is double the size because for every 2 blocks of data also 2 blocks of padding are required that are wasting space.

Here is a nice explanation of the OpenZFS head engineer how raidz works and why there is padding overhead.
 
Last edited:
  • Like
Reactions: kwinz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!