ZFS pool layout and limiting zfs cache size

Dunuin · Nov 27, 2022

fahadshery said:
I do not understand (now) how you were able to separate out parity from padding from the spreadsheet?

You can calculate that. The formulas for padding overhead and parity overhead are in the spreadsheet, as the spreadsheet will calculate those based on the number of data disks, parity disks and number of sectors. But if you want it easy...subtract the parity loss (which is easy to find out) from the combined parity+padding loss and you get just padding loss.

fahadshery said:
Lastly, how do you come up with 8K random write IOPS, 8K random read IOPS, big seq write throughputs and big seq. read throughputs?

IOPS is number of striped vdevs.
For raidz1/2/3, throughput is number of data disks minus number of parity disks.
For (striped) mirrors, read throughput is number of disks and write throughput is number of vdevs.

For 32K I included the read/write amplification, as writing/reading 8K blocks to/from a 32K volblocksize pool will cause a 32K read or write, so only 1/4 of the performance.

fahadshery · Nov 27, 2022

Dunuin said:
You can calculate that. The formulas for padding overhead and parity overhead are in the spreadsheet, as the spreadsheet will calculate those based on the number of data disks, parity disks and number of sectors. But if you want it easy...subtract the parity loss (which is easy to find out) from the combined parity+padding loss and you get just padding loss.

Sorry for being a pain but how do you work out which part is parity and which part is padding??
Here is the formula for example:

Code:

=((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),2))/$A4-1)/((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),2))/$A4)

What baffling me is that the formula is using "division(/)" instead of "addition (+)" so no idea which part was for padding and which one was for parity.
There is another sheet "RAIDZ2 parity cost" and that's not making any sense either. The explanation reads "This table shows the space used by RAIDZ1 parity + padding for varying number of disks in the RAIDZ1 group, for blocks of various sizes (in sectors, after compression)."
And in this sheet the first value (CELL B4) is showing 100% where as in the sheet "RAIDZ1 parity cost, percent of total storage", the explanation is the same as before "This table shows the space used by RAIDZ1 parity + padding for varying number of disks in the RAIDZ1 group, for blocks of various sizes (in sectors, after compression)." and the first value (CELL B4) is showing 50%???

Could you give a clear example of how you calculated padding. That would be idea, the rest should be clear for me

Dunuin · Nov 27, 2022

I reversed engeneered that formula once, but can't find the result. We can try that again:

Code:

B4 Raidz1 formula:
=((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),2))/$A4-1)/((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),2))/$A4)
B4 Raidz2 formula:
=((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),3))/$A4-1)/(((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),3))/$A4))
B4 Raidz3 formula:
=((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),4))/$A4-1)/((CEILING($A4+$A$3*FLOOR(($A4+B$3-$A$3-1)/(B$3-$A$3)),4))/$A4)

So only difference between the forumlas is that raidz1 will ceil to a factor of 2, raidz2 to a factor of 3 and raidz3 to a factor of 4. Lets call this number CeilFactor.

"$A$3" is the number of parity disks of the vdev. Let us call it ParityDisks
"$A4" is then number of sectors, or in other words "volblocksize / 2^ashift". Let us call it Sectors.
"B$3" is the total number of disks of the vdev. Let us call it TotalDisks

Lets have a look at the B4 Raidz1 formula and make it a bit more readable:

Code:

(
    (
        CEILING( Sectors + ParityDisks *
            FLOOR( ( Sectors + TotalDisks - ParityDisks - 1) / ( TotalDisks - ParityDisks ) )
        , CeilFactor)
    ) / Sectors - 1
)
/
(
    (
        CEILING ( Sectors + ParityDisks *
            FLOOR( ( Sectors + TotalDisks - ParityDisks - 1) / ( TotalDisks - ParityDisks ) )
        , CeilFactor)
    ) / Sectors
)

To make it even more readable we could shorten it by replacing "TotalDisks - ParityDisks" with DataDisks:

Code:

(
    (
        CEILING( Sectors + ParityDisks *
            FLOOR( ( Sectors + DataDisks - 1) / DataDisks  )
        , CeilFactor)
    ) / Sectors - 1
)
/
(
    (
        CEILING ( Sectors + ParityDisks *
            FLOOR( ( Sectors + DataDisks - 1) / DataDisks )
        , CeilFactor)
    ) / Sectors
)

With that you can calculate the parity+padding overhead of any number of disks and any amount of sectors (in other words any volblocksize) of a raidz1:
Lets for examle use Sectors = 4, DataDisks = 8. ParityDisks is always 1 and CeilFactor always 2 for a raidz1:

Code:

(
    (
        CEILING( 4 + 1 *
            FLOOR( ( 4 + 8 - 1) / 8 )
        , 2)
    ) / 4 - 1
)
/
(
    (
        CEILING ( 4 + 1 *
            FLOOR( ( 4 + 8 - 1) / 8 )
        , 2)
    ) / 4
)

becomes...

(
    (
        CEILING( 4 + 1 *
            FLOOR( 1,375 )
        , 2)
    ) / 4 - 1
)
/
(
    (
        CEILING ( 4 + 1 *
            FLOOR( 1,375 )
        , 2)
    ) / 4
)

becomes...

(
    CEILING( 4 + 1 * 1 , 2) / 4 - 1
)
/
(
    CEILING ( 4 + 1 * 1 , 2) / 4
)

becomes...

(
    CEILING( 5 , 2) / 4 - 1
)
/
(
    CEILING ( 5 , 2) / 4
)

becomes...

( 6 / 4 - 1 ) / ( 6 / 4 )

becomes...

0.5 / 1.5

results in...

0.3333333

And if you look at the table for raidz1 and 9 total disks and 4 sectors you will also see 33% combined parity+padding loss.

Parity loss is always "ParityDisks / TotalDisks" or "ParityDisks / (DataDisks + ParityDisks)".
So in our example above that would be:

Code:

ParityDisks / (DataDisks + ParityDisks)

becomes...

1 / (8 + 1)

results in...

0,111111

If you now want to find out what just the padding loss is, you subtract the parity loss from the combined parity+padding loss:
0.3333333 - 0.111111 = 0.222222

So there is 11% parity loss and 22% padding loss forming 33% of total capacity loss.

And padding loss is indirect. Your pool won't show it as not the pool will become smaller, but the size of the zvols will become bigger. Results in the same...you can store less on your pool...but most people just don't get that parity overhead, as ZFS won't show it anywhere when showing the total or available pool size.

If you want to know how much bigger your zvols will get, you can calculate this:

Code:

ZvolSize = (1 - ParityLoss) / (1 - ParityLoss - PaddingLoss)

In our example this would result in:

Code:

ZvolSize = (1 - 0.111111) / (1 - 0.111111 - 0.222222)

Results in...

ZvolSize = 1,333333

So all zvols will be 133% in size, meaning that storing 1TB of data on a zvol will cause the zvol to consume 1.33TB of the pools capacity, as for every 1TB of data blocks there will have to be 333GB of empty padding blocks to be stored.

Now let us say the 9 disks from our example are each 2TB in size.
So we got a total raw capacity (which zpool list will report) of 18TB (because 9 * 2TB).
zfs list will report that we got a capacity of 16TB, as it already subtracted the capacity used to store parity data (so 9 * 2TB raw storage - 2TB parity data).
For datasets that would be true, as datasets got no parity overhead. We could indeed store 16TB of files in datasets on that pool.
But this isn't true for zvols, as our zvols would be 133% in size. We can only store 12TB of zvols, because 12TB of zvols would consume 16TB of space. This is why I told you that padding overhead is indirect.

And then keep in mind that a zfs pool should always have 20% of free space to operate optimally. So you actually have to subtract additional 20% when calculating your real available capacity. So for 100% datasets this would be 12.8TB of real usable capacity. For 100% zvols it would be just 9.6 TB of real usable capacity. For 50% datasets + 50% zvols it would be 11.2TB of real usable capacity and so on.

fahadshery · Nov 28, 2022

This is awesome!! Thank you so much for explaining it.
However, it's not tallying up what we already discussed previously?

Dunuin said:
With just 5 drives your options would be:

Raw storage: Parity loss: Padding loss: Keep 20% free: Real usable space: 8K random write IOPS: 8K random read IOPS: big sequential write throughput: big sequential read throughput:
5x 800 GB raidz1 @ 8K volblocksize: 4000 GB - 800 GB -1200 GB -400 GB ≈1600 GB or 1490 GiB ≈1x ≈1x 4x 4x
4x 800 GB str. mirror @ 8K volblocksize: 3200 GB - 1600 GB 0 -320 GB ≈1280 GB or 1192 GiB ≈ 2x ≈4x 2x 4x
5x 800 GB raidz1 @ 32K volblocksize: 4000 GB - 800 GB 0 -640 GB ≈ 2560 GB or 2384 GiB ≈ 0.25x ≈0.25x 4x 4x

You mentioned that there was no `padding loss` if you had `5x 800 GB raidz1 @ 32K volblocksize`. According to the spreadsheet, Column C (because the total number of disks are 5 and that's including the parity disk) hence the padding + parity loss is 33%.

Then you mentioned:

Parity loss is always "ParityDisks / TotalDisks" or "ParityDisks / (DataDisks + ParityDisks)"

This means that the parity loss will be:

Code:

1 / (1+4) = 0.2

So the Padding would be: 0.33 - 0.2 = 0.13 or 13% out of a total of 33%....

am I right in my understanding?

once again, thank you

Dunuin · Nov 28, 2022

fahadshery said:
You mentioned that there was no `padding loss` if you had `5x 800 GB raidz1 @ 32K volblocksize`. According to the spreadsheet, Column C (because the total number of disks are 5 and that's including the parity disk) hence the padding + parity loss is 33%.

"5x 800 GB raidz1 @ 32K volblocksize" in the spreadsheet would be a raidz1 of 5 disk with 8 sectors (32K volblocksize / 4K (because 2^12 for ashift of 12)). Have a look at that table Cell D11 and you will see that it reports a parity+padding loss of 20%. That 20% parity+padding loss consists of 20% parity loss and 0% padding loss. So there is indeed no padding loss.
Row 3 refers to total number of disks (so 4x data disk + 1x parity disk). So its Column D and not C.

fahadshery · Nov 28, 2022

Thank you for clarification but then your numbers are still not matching up

. I am loosing my hairs lol

For example:
In this spreadsheet that I created. usable space @ 100% datasets column "i" is showing what you already calculated except for one cell. So I am not bothered about that.
But when calculating the useable zvol space, I am using this formula:

Code:

total usable space @ 100% datasets - (parity + padding loss)

and the numbers are not aligned with what you already have sent.

I plugged in your formulas using this magnificent explanation.
Gosh, need coffee now.

Dunuin · Nov 28, 2022

fahadshery said:
total usable space @ 100% datasets - (parity + padding loss)

Datasets got no padding loss. Only zvols do.
And where do you get the number of "total usable space @ 100% datasets" from? When you get the usable space using the zfs command, then that is not the raw capapacity (which is just the sum of capacity of all your disks forming the raidz1 vdev). The ZFS command will already subtract the padding loss, so subtracting the parity loss again would give wrong number. Or what do you mean?

fahadshery · Nov 28, 2022

Dunuin said:
Datasets got no padding loss. Only zvols do.

yup, super clear on this by now

Dunuin said:
And where do you get the number of "total usable space @ 100% datasets" from?

This was calculated using the formula:
(TotalNumberOfDrives - ParityDrives) * DiskSize * 0.8

Dunuin said:
When you get the "total usable space" using the zfs command, then that is not total raw capapacity. The ZFS command will already subtract the padding loss, so subtracting the parity loss again would give wrong number. Or what do you mean?

`zvol` datasets as we already know have extra padding + parity loss on top of actual useable space. So I calculate the total useable space as normal using your above formula (TotalNumberOfDrives - ParityDrives) * DiskSize * 0.8) and then subtract the padding+parity loss from it.

For example, you calculated "5.28 TB" to be the "usable space @ 100% zvols" for the 11 disks of 1.8TB using the RAIDZ2 @ 8K.

this is how I was calculating:
(TotalNumberOfDrives - ParityDrives) * DiskSize * 0.8 = (11 - 2) * 1.8 * 0.8 = 12.96
Now 67% is the padding + parity loss @ 8K for raidz2 (cell I2 of the spreadsheet, tab=RAIDZ2 parity cost, percent of total storage)
Now I simply calculate the 67% of 12.96 to be 8.68 and this gives us the total useable space for zvols to be = 12.96 - 8.68 = 4.28

I am struggling to understand how you calculated it to be 5.28 TB?

Thanks

Dunuin · Nov 28, 2022

fahadshery said:
`zvol` datasets as we already know have extra padding + parity loss on top of actual useable space

"zvols" and "datasets" are different things. "Zvols" are block storage, "datasets" are filesystems. And parity loss effects both, but not on top of actual usable space, but from the the raw capacity.

fahadshery said:
This was calculated using the formula:
(TotalNumberOfDrives - ParityDrives) * DiskSize * 0.8

Just calculate with "TotalNumberOfDrive * DiskSize * 0.8". Then subtract the percentage of parity loss for datasets. For zvols subtract the percentage of parity and the percentage of padding loss.

fahadshery said:
I am struggling to understand how you calculated it to be 5.28 TB?

In case of "11 disks of 1.8TB using the RAIDZ2 @ 8K":

RawCapacity: 11x 1.8TB = 19,8TB

ParityAndPaddingLoss according to table: 67%
ParityLoss: ParityDisks / (DataDisks + ParityDisks) = 2 / (9 + 2) = 0.18 = 18%
PaddingLoss: ParityAndPaddingLoss - ParityLoss = 67% - 18% = 0.49 = 49%

ZFS command will show as usable capacity: RawCapacity * (1 - ParityLoss) = 19,8TB * (1 - 0.18) = 16,236 TB
Real usable capacity for datasets (when 20% should be kept free): 16,236 TB * 0.8 = 12,99 TB

Zvols that will fit on that pool: RawCapacity * (1 - (ParityLoss + PaddingLoss)) = 19,8TB * (1 - (0.18 + 0.49)) = 6,534 TB
Real usable capacity for zvols (when 20% should be kept free): 6,534 TB * 0.8 = 5,227 TB

This 5,227 TB then fits the 5.28 TB if you keep rounding errors in mind.

fahadshery · Nov 28, 2022

Finally, my quest is over! I am making a YouTube video, I will mention your name

Thanks for helping me.

fahadshery · Nov 29, 2022

I have SEAGATE ST800FM0053 SSD drives. It's data sheet states:

Code:

User-selectable logical block size (512, 520, 524, 528, 4096, 4160, 4192, or 4224 bytes per logical block)

But when I do:

Code:

root@truenas[~]# diskinfo -v da9
da9
        512                        # sectorsize
        800166076416    # mediasize in bytes (745G)
        1562824368        # mediasize in sectors
        4096                     # stripesize
        0                            # stripeoffset
        97281                    # Cylinders according to firmware.
        255                        # Heads according to firmware.
        63                                                 # Sectors according to firmware.
        SEAGATE ST800FM0053             # Disk descr.
        Z3G014PF0000Z3G014PF          # Disk ident.
        mpr1                                             # Attachment
        Yes                                                 # TRIM/UNMAP support
        0                                                     # Rotation rate in RPM
        Not_Zoned                                    # Zone Mode

It is showing the sector size of 512. Is that a problem because we have been calculating based on 4K volblocksize. Do I need to change it? if yes then how in TrueNas?

And if I run:

Code:

root@truenas[~]# sg_readcap --long /dev/da9
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=1, lbprz=0
   Last LBA=1562824367 (0x5d26ceaf), Number of logical blocks=1562824368
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block length=4096 bytes]
   Lowest aligned LBA=0
Hence:
   Device size: 800166076416 bytes, 763097.8 MiB, 800.17 GB

It shows exponent of 3 would means: 2³ x 512 = 8 x 512 = 4,096 bytes.. and that's what's it reporting for physical block length!

Dunuin · Nov 29, 2022

Disks got two sector sizes. A logical one and a physical one. Usually it is 512B logical + 4K physical. But can also be 512B logical + 512B physical or 4K logical + 4K physical. What you care about is the physical sector size.

Logical block length=512 bytes
Logical blocks per physical block exponent=3 [so physical block length=4096 bytes]

Sounds to me you got a 512B logical + 4K physical sector size SSD, so I would use ashift=12 and calculate with 4K sectors.

fahadshery · Dec 2, 2022

Hi,

last couple of questions:

I want to run Windows VMs via iSCSI on proxmox, would you recommend 4K block size since NTFS writes with 4K?
Ubuntu using ext4 will also write with 4K chunks so a volblocksize of 4 would also work better?
I know VMware (VDI) is typically using around 24-32K but what about proxmox? What is it using when creating VMs?

In your previous msgs, you wrote:

	8K random write IOPS:	8K random read IOPS:
11 disk raidz1 @ 8K volblocksize	1x	1x
8 disk str. mirror @ 16K	2x	4x
4 disk str. mirror @ 8K	2x	4x

What was the formula for calculating 8K random read/write IOPS?

Search

Search

ZFS pool layout and limiting zfs cache size

Dunuin

Distinguished Member

fahadshery

Member

Dunuin

Distinguished Member

fahadshery

Member

Dunuin

Distinguished Member

fahadshery

Member

Dunuin

Distinguished Member

fahadshery

Member

Dunuin

Distinguished Member

fahadshery

Member

fahadshery

Member

Dunuin

Distinguished Member

fahadshery

Member

	Raw storage:	Parity loss:	Padding loss:	Keep 20% free:	Real usable space:	8K random write IOPS:	8K random read IOPS:	big sequential write throughput:	big sequential read throughput:
5x 800 GB raidz1 @ 8K volblocksize:	4000 GB	- 800 GB	-1200 GB	-400 GB	≈1600 GB or 1490 GiB	≈1x	≈1x	4x	4x
4x 800 GB str. mirror @ 8K volblocksize:	3200 GB	- 1600 GB	0	-320 GB	≈1280 GB or 1192 GiB	≈ 2x	≈4x	2x	4x
5x 800 GB raidz1 @ 32K volblocksize:	4000 GB	- 800 GB	0	-640 GB	≈ 2560 GB or 2384 GiB	≈ 0.25x	≈0.25x	4x	4x