[SOLVED] Unexpected pool usage of ZVOL created on RAIDZ3 Pool vs. Mirrored Pool

apoc

Famous Member
Oct 13, 2017
1,045
164
108
Hello all,

I hope someone can bring some light into my confusion and help me understand some (from my perspective) odd behavior I am seeing.

Some background information: My Proxmox setup was running fine since almost three years. I was using a Mirrored vDev ZPOOL using 5x individual mirrors.
While I thought I understand the risk and accept, that none of the two disks of a vDev will fail simultaniously, I was proven wrong last week.
One disk died. The other one in the mirror went off creating a lot of read-errors and so my pool was literally gone / corrupted all over the place.
So far so good (or bad). Backups are in place, tried to recover the pool, wasted a lot of time and finally dumped it completely.

I thought I use the opportunity (and the lesson) to redesign my storage pool(s). Since on the large HDD-Backed pool, primarily backups are stored I thought: Give a RAIDZ3 a try. Any 3 of 8 disks I was planning to use could die...
Went of upgrading to PVE 6.1 and ZOL 0.83, creating the new pool containing one vDev. Added a mirrored SLOG and L2ARC devices to speed things up a little. This is how it looks like:
Code:
    NAME             STATE     READ WRITE CKSUM
    HDD-POOL-RAIDZ3  ONLINE       0     0     0
      raidz3-0       ONLINE       0     0     0
        C0-S0        ONLINE       0     0     0
        C0-S1        ONLINE       0     0     0
        C0-S2        ONLINE       0     0     0
        C0-S3        ONLINE       0     0     0
        C0-S4        ONLINE       0     0     0
        C0-S5        ONLINE       0     0     0
        C0-S6        ONLINE       0     0     0
        C0-S7        ONLINE       0     0     0
    logs  
      mirror-1       ONLINE       0     0     0
        ON-S3-part1  ONLINE       0     0     0
        C1-S3-part1  ONLINE       0     0     0
    cache
      ON-S3-part2    ONLINE       0     0     0
      C1-S3-part2    ONLINE       0     0     0

Checking the size of the pool shows 5..45 TB in Size. That is expected since I am using 8x 750 GB disks which results in 8x 700 GB roughly (5.600 GB)
Code:
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDD-POOL-RAIDZ3  5.45T  1.28T  4.17T        -         -     0%    23%  1.00x    ONLINE  -

A "zfs list" reported a little over 3TB available space (used + avail in the following output), which was logical to me because the parity will eat up 3 drives. So my capacity is around 5x 700 GB = 3.500 GB
Code:
NAME                                                      USED  AVAIL     REFER  MOUNTPOINT
HDD-POOL-RAIDZ3                                           758G  2.27T      219K  /HDD-POOL-RAIDZ3

I started to create my volumes on top of the pool and this is where it gets really weird. Because the system told me "there is not enough space" for my largest backup-volume (after I only have copied a few hundred GBs).
This is the behavior I see (on a smaller disk):

On the RAIDZ3 pool creation of a 5GB volume results in the following consumption on the pool:
Code:
NAME                                                      USED  AVAIL     REFER  MOUNTPOINT
HDD-POOL-RAIDZ3/vm-1002-disk-0                           11.6G  2.28T      128K  -

Yes. Exactly. It eats up roughly 12 GB in the pool. What the hack is going on here? That is
  • more than twice the space I want (5GB)
  • There are already 3 disks off for my parity
Adding the same disk (5GB) to another ZFS pool based on a mirrored vDev shows the "overhead" I am used to:
Code:
NAME                                                      USED  AVAIL     REFER  MOUNTPOINT
SSD-POOL/vm-1002-disk-0                                  5.16G   354G       56K  -

I can see the same behavior on larger scale. 1205GB volume eat up 2.29TB in the pool!

<edit>Forgot to mention. It is the very same behavior indpendent from UI and/or CLI. </edit>

I am totally confused/puzzled about my "understanding" of RAIDZ3 and the ZPOOL in general. Couldn't find anything via g**gle, bing and so on. All people explain a "hidden cost" when growing such a RAIDZ-Pool (in terms of vDevs need to be the same) but noone mentions, that you actually can only use half of your storage (as I seem to experience it).

It is my first try on anything else than mirrored vDevs - so someone might just tell me: "It's expected mate". If that is the case. Could you please explain?

Thanks for your help.
 
Some more information:
I found, that a filesystem exported from the RAIDZ3-pool seems to only consume the storage, which is actually placed in there.

The share:
Code:
root@proxmox:/HDD-POOL-RAIDZ3/ISO-IMAGES/template/iso$ du -h
16G

"zfs list" reports
Code:
NAME                                                      USED  AVAIL     REFER  MOUNTPOINT
HDD-POOL-RAIDZ3/ISO-IMAGES                               15.1G  2.26T     15.1G  /HDD-POOL-RAIDZ3/ISO-IMAGES

That seems to be fairly consistent. So my observation must be related to ZFS block volumes. Still not getting any meaning behind of that.
Anyone else please?
 
I still have some hope that someone can help me understanding it. Here is a thing I have thought of.
Read yesterday that ZFS has variable length of data vs. stripes / parity information - so can not compared to the RAID behavior per se (e.g. RAID 5/6/7).

I thought I'll check "block sizes" of my volumes.
According to "zfs get all" command the default blocksize of a zvol is 8K

Code:
NAME                                         PROPERTY            VALUE              SOURCE
SSD-POOL/vm-4000-disk-1-8g-boot  volblocksize          8K                     default

Whenever I use the 8k default on the RAIDZ3-pool, then I observe the odd behavior.
So I thought, lets see how it behaves with other block-sizes.
Used the following code to create some ZVOLs:
Code:
sudo zfs create -V 10gb -b 512b HDD-POOL-RAIDZ3/test-512b-bs
sudo zfs create -V 10gb -b 1k HDD-POOL-RAIDZ3/test-1k-bs
sudo zfs create -V 10gb -b 2k HDD-POOL-RAIDZ3/test-2k-bs
sudo zfs create -V 10gb -b 4k HDD-POOL-RAIDZ3/test-4k-bs
sudo zfs create -V 10gb -b 8k HDD-POOL-RAIDZ3/test-8k-bs
sudo zfs create -V 10gb -b 16k HDD-POOL-RAIDZ3/test-16k-bs
sudo zfs create -V 10gb -b 32k HDD-POOL-RAIDZ3/test-32k-bs
sudo zfs create -V 10gb -b 64k HDD-POOL-RAIDZ3/test-64k-bs
sudo zfs create -V 10gb -b 128k HDD-POOL-RAIDZ3/test-128k-bs

This is the result:
Code:
NAME                                                      USED  AVAIL     REFER  MOUNTPOINT
HDD-POOL-RAIDZ3/test-512b-bs                              188G  2.09T      128K  -
HDD-POOL-RAIDZ3/test-1k-bs                               93.9G  2.00T      128K  -
HDD-POOL-RAIDZ3/test-2k-bs                               47.0G  1.95T      128K  -
HDD-POOL-RAIDZ3/test-4k-bs                               23.5G  1.93T      128K  -
HDD-POOL-RAIDZ3/test-8k-bs                               23.2G  1.93T      128K  -
HDD-POOL-RAIDZ3/test-16k-bs                              11.6G  1.92T      128K  -
HDD-POOL-RAIDZ3/test-32k-bs                              11.5G  1.92T      128K  -
HDD-POOL-RAIDZ3/test-64k-bs                              10.0G  1.92T      128K  -
HDD-POOL-RAIDZ3/test-128k-bs                             10.0G  1.92T      128K  -

As we can clearly see the 8K seem to be a "bad spot". It can be worse, when using smaller blocks, but it also can significantly be enhanced when using larger block sizes.
So why is that?

sudo zpool get all HDD-POOL-RAIDZ3 shows no block size of the pool I could refer things to and start doing some maths.
Code:
NAME             PROPERTY                       VALUE                          SOURCE
HDD-POOL-RAIDZ3  size                           5.45T                          -
HDD-POOL-RAIDZ3  capacity                       20%                            -
HDD-POOL-RAIDZ3  altroot                        -                              default
HDD-POOL-RAIDZ3  health                         ONLINE                         -
HDD-POOL-RAIDZ3  guid                           2262302775495901210            -
HDD-POOL-RAIDZ3  version                        -                              default
HDD-POOL-RAIDZ3  bootfs                         -                              default
HDD-POOL-RAIDZ3  delegation                     on                             default
HDD-POOL-RAIDZ3  autoreplace                    off                            default
HDD-POOL-RAIDZ3  cachefile                      -                              default
HDD-POOL-RAIDZ3  failmode                       wait                           default
HDD-POOL-RAIDZ3  listsnapshots                  off                            default
HDD-POOL-RAIDZ3  autoexpand                     off                            default
HDD-POOL-RAIDZ3  dedupditto                     0                              default
HDD-POOL-RAIDZ3  dedupratio                     1.00x                          -
HDD-POOL-RAIDZ3  free                           4.33T                          -
HDD-POOL-RAIDZ3  allocated                      1.12T                          -
HDD-POOL-RAIDZ3  readonly                       off                            -
HDD-POOL-RAIDZ3  ashift                         12                             local
HDD-POOL-RAIDZ3  comment                        -                              default
HDD-POOL-RAIDZ3  expandsize                     -                              -
HDD-POOL-RAIDZ3  freeing                        0                              -
HDD-POOL-RAIDZ3  fragmentation                  0%                             -
HDD-POOL-RAIDZ3  leaked                         0                              -
HDD-POOL-RAIDZ3  multihost                      off                            default
HDD-POOL-RAIDZ3  checkpoint                     -                              -
HDD-POOL-RAIDZ3  load_guid                      17941881024137393072           -
HDD-POOL-RAIDZ3  autotrim                       off                            default
HDD-POOL-RAIDZ3  feature@async_destroy          enabled                        local
HDD-POOL-RAIDZ3  feature@empty_bpobj            active                         local
HDD-POOL-RAIDZ3  feature@lz4_compress           active                         local
HDD-POOL-RAIDZ3  feature@multi_vdev_crash_dump  enabled                        local
HDD-POOL-RAIDZ3  feature@spacemap_histogram     active                         local
HDD-POOL-RAIDZ3  feature@enabled_txg            active                         local
HDD-POOL-RAIDZ3  feature@hole_birth             active                         local
HDD-POOL-RAIDZ3  feature@extensible_dataset     active                         local
HDD-POOL-RAIDZ3  feature@embedded_data          active                         local
HDD-POOL-RAIDZ3  feature@bookmarks              enabled                        local
HDD-POOL-RAIDZ3  feature@filesystem_limits      enabled                        local
HDD-POOL-RAIDZ3  feature@large_blocks           enabled                        local
HDD-POOL-RAIDZ3  feature@large_dnode            enabled                        local
HDD-POOL-RAIDZ3  feature@sha512                 enabled                        local
HDD-POOL-RAIDZ3  feature@skein                  enabled                        local
HDD-POOL-RAIDZ3  feature@edonr                  enabled                        local
HDD-POOL-RAIDZ3  feature@userobj_accounting     active                         local
HDD-POOL-RAIDZ3  feature@encryption             enabled                        local
HDD-POOL-RAIDZ3  feature@project_quota          active                         local
HDD-POOL-RAIDZ3  feature@device_removal         enabled                        local
HDD-POOL-RAIDZ3  feature@obsolete_counts        enabled                        local
HDD-POOL-RAIDZ3  feature@zpool_checkpoint       enabled                        local
HDD-POOL-RAIDZ3  feature@spacemap_v2            active                         local
HDD-POOL-RAIDZ3  feature@allocation_classes     enabled                        local
HDD-POOL-RAIDZ3  feature@resilver_defer         enabled                        local
HDD-POOL-RAIDZ3  feature@bookmark_v2            enabled                        local
 
Thanks DanielB for pointing that out. I think we can conclude that what I see is expected.
The padding-explanation makes total sense.
I have chosen 64k volume blocksize now, as it is the lowest blocksize which does not create the additional overhead.
Hopefully this thread is useful to others as well.
 
  • Like
Reactions: bligher
Dunuin shared something which I like to cross-reference here for documentation purposes (note the link!):

"The wasted space by padding overhead is also interesting to see. If you are using ashift=13 and want to test a raidz1 with 5 SSDs you should always waste space if your volblocksize is below 64K (see here) and that wasted space should increase the write amplfication too."
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!