Block Size for PBS/ZFS

Fatz · May 15, 2023

Disclaimer: I tried my best to search for an answer to this, but couldn't find anything even though i'm sure the discussion has come up already.

So i have just created a PBS server with:

6x 20TB HDDs (mirrored) as the 'backups' pool
6x 4TB SSDs (mirrored) as the special devices for said pool

I wanted to run some test backups to see what kind of `special_small_blocks` setting i could/should set for the opt-in file storage on the special devices. I'm trying 16K initially, and on first impression, the speeds on the backup runs are looking very good for my usecase!

However, one thing i noticed is that when i try to configure the record size of the pool, it seems to be limited to a max record size of 1MB? I'm quite sure i read that this limit was increased in the past, as i've seen people discuss using up to ~16MB record sizes elsewhere. Am i missing something?

I expected that setting it to 4MB might be ideal, since as far as i'm aware, PBS stores data in (up to?) 4MB chunks? I'd just like to avoid read/write amplification as best i can, and also still benefit from a high compression ratio.

Any advice to help me tune these settings would be appreciated!

fabian · May 15, 2023

1MB is the maximum (with the large_blocks feature enabled) - see the "zfsprops" and "zpool-features" man pages. that should already help a lot, since even big chunks (chunks are zstd-compressed, so 4MB input size usually means less actual chunk size on disk) should only need 1-3 records normally, improving the data:metadata ratio.

Fatz · May 15, 2023

OK thanks @fabian ... i had wondered if compression might take away some of the need for a 4MB ceiling. Since i have so much special device space, i'm not too worried about more metadata anyway.

Are there any other tips you might have with regards to tuning specifically for PBS? Other than the 1MB record size, i configured:

LZ4 Compression (not sure if this affects PBS after your `zstd` comment)
`xattr=sa`
`acltype=posixacl`
`atime=off`
special_small_blocks=16K (will review/change this later)

Most of these settings are what i use on my pure-SSD pools on PVE hosts. So i'm not sure if there's anything else to add/remove when considering HDDs being the base of this particular pool. I left the metaslab allocator option enabled also.

fabian · May 15, 2023

LZ4 is still needed for ZFS to make files sparse, and it basically costs nothing. I'd recommend setting atime=on relatime=on (the current GC implementation should work without, but again, relatime basically costs almost nothing). the special_small_blocks probably depends on the size of the indices. it should provide some speedup for operations like snapshot listing if your backup metadata (the files outside of the .chunks dir) mostly fit on the special vdevs, if you have enough space for that. the size of the indices is directly correlated with logical size of the disks/archives they represent.

Fatz · May 16, 2023

Thanks for your help so far @fabian, i've done some further tests over the past day, and i'm a little confused by some of my findings.

1. It seems that the backup task always takes the same amount of time to complete, no matter if pushing to a pure-SSD pool, or a SSD-backed HDD pool. I'm guessing this bottleneck is likely based around the CPU in the PBS host (E5-2499 v4). It's fast enough though, so no big problem.
2. The PBS host has plenty of RAM spare, yet an L2ARC sped up the verify tasks by approx 25%. Kind of unexpected.
3. The biggest gain i saw was going from 16k to 512K for the special_small_blocks setting.

However, i'm still a little perplexed and not sure if this is me lacking knowledge in either PBS or ZFS.
I noticed that after a backup, 51G out of 412G (asize) was in the 512K or below block sizes according to the histogram.

Code:

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:    776   388K   388K    776   388K   388K      0      0      0
     1K:    514   596K   984K    514   596K   984K      0      0      0
     2K:    132   350K  1.30M    132   350K  1.30M      0      0      0
     4K:   367K  1.43G  1.43G    252  1.41M  2.71M  1.41K  5.65M  5.65M
     8K:  19.4K   187M  1.62G    472  5.44M  8.15M   385K  3.04G  3.04G
    16K:  16.8K   369M  1.98G   135K  2.14G  2.14G  16.0K   335M  3.37G
    32K:  21.9K  1.09G  3.07G  10.8K   615M  2.74G  17.0K   775M  4.13G
    64K:  20.2K  1.88G  4.95G  15.0K  1.43G  4.18G  26.5K  2.30G  6.43G
   128K:  8.04K  1.44G  6.39G   238K  29.9G  34.1G  8.11K  1.45G  7.88G
   256K:  12.2K  4.57G  11.0G  1.92K   720M  34.8G  12.2K  4.57G  12.5G
   512K:  52.7K  40.0G  51.0G  9.42K  8.44G  43.2G  51.2K  38.6G  51.1G
     1M:   359K   359G   410G   466K   466G   509G   360K   360G   412G
     2M:      0      0   410G      0      0   509G      0      0   412G
     4M:      0      0   410G      0      0   509G      0      0   412G
     8M:      0      0   410G      0      0   509G      0      0   412G
    16M:      0      0   410G      0      0   509G      0      0   412G

Running zpool iostat -v also showed me that there was 399G stored on the HDD pool, but only 12.6G on the special device pool. This adds up fine based on the above figure of 412G, but I actually expected this to be higher based upon what i saw in the histogram output (pool was wiped prior, to make this accurate per backup task).

Code:

pool-hdd                                                  412G  64.6T     36     63  19.4M  14.8M
  mirror-0                                                133G  18.1T      6      5  6.30M  4.69M
    wwn-0x5000cca2c7627e00                                   -      -      3      2  3.15M  2.35M
    wwn-0x5000cca2c761d000                                   -      -      3      2  3.15M  2.35M
  mirror-1                                                133G  18.1T      6      5  6.30M  4.68M
    wwn-0x5000cca2c761399c                                   -      -      3      2  3.15M  2.34M
    wwn-0x5000cca2ed00c4e8                                   -      -      3      2  3.15M  2.34M
  mirror-2                                                133G  18.1T      6      5  6.28M  4.67M
    wwn-0x5000cca2c761d664                                   -      -      3      2  3.14M  2.34M
    wwn-0x5000cca2c76200e8                                   -      -      3      2  3.14M  2.34M
special                                                      -      -      -      -      -      -
  mirror-3                                               4.18G  3.48T      5     15   165K   266K
    wwn-0x5002538b71501f50                                   -      -      2      7  82.1K   133K
    wwn-0x5002538b715015a0                                   -      -      2      7  82.9K   133K
  mirror-4                                               4.18G  3.48T      5     16   164K   267K
    wwn-0x5002538b71501540                                   -      -      2      8  81.9K   134K
    wwn-0x5002538b715015e0                                   -      -      2      8  81.9K   134K
  mirror-5                                               4.19G  3.48T      5     15   164K   262K
    wwn-0x5002538b71501690                                   -      -      2      7  81.9K   131K
    wwn-0x5002538b71501660                                   -      -      2      7  81.7K   131K
logs                                                         -      -      -      -      -      -
  nvme0n1p1                                                  0   127G      0      0     99    505
cache                                                        -      -      -      -      -      -
  nvme0n1p2                                              4.75G  1.62T      0      2     99  1021K

I then tried to increase the special_small_blocks setting to 1M, purely as a test to see if the special devices got filled up more (obviously i ultimately need a value below the pool record size, like 512K). But they didn't fill up any more than at the 512K setting, which is not what i expected at all.

I'm not sure if there's a way for me to push any more data to the special devices than i already am, which i'd like to do, as there's quite a lot of space available there and any extra IOPs would be welcome.

I'm a bit confused and wondering if i'm just observing things incorrectly since i'm not well-versed in PBS yet and fairly new to the ZFS world too.

Any help would be appreciated.

Fatz · May 16, 2023

Another point i just thought of:

Before i turned this host from a PVE host to a PBS host, i was actually testing PBS on it as a VM. Which now i think about it, makes me curious how backups to a pure-SSD pool are so slow.

The above numbers were working out to about 1h40min per TB (even to a pure-SSD pool).
When it was virtualized on the host, it was pushing approx 1.2TB an hour.

So there definitely seems something very wrong at the moment, surely if anything it should run bettern un-virtualized?
I found some mention of a benchmark tool and decided to run that (thought it's against the HDD/SSD pool listed above).

Code:

Time per request: 4558 microseconds.
TLS speed: 920.11 MB/s
SHA256 speed: 459.59 MB/s
Compression speed: 452.08 MB/s
Decompress speed: 638.00 MB/s
AES256/GCM speed: 1413.26 MB/s
Verify speed: 265.02 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 920.11 MB/s (75%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 459.59 MB/s (23%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 452.08 MB/s (60%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 638.00 MB/s (53%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 265.02 MB/s (35%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1413.26 MB/s (39%) │
└───────────────────────────────────┴────────────────────┘

Spec:
E5-2499 v4, 256GB RAM, pool listed in first post above + running off 3x 9300-8i HBAs.

I have to say the numbers do seem a bit underwhelming when i compare to other people's benchmarks, but again, this all worked faster when virtualized with a ZVOL as the datastore, when i compare like-for-like testing with backups to a pure-SSD pools. It doesn't make sense, does it?

fabian · May 16, 2023

how did you generate that first histogram?

Fatz said:
Thanks for your help so far @fabian, i've done some further tests over the past day, and i'm a little confused by some of my findings.

1. It seems that the backup task always takes the same amount of time to complete, no matter if pushing to a pure-SSD pool, or a SSD-backed HDD pool. I'm guessing this bottleneck is likely based around the CPU in the PBS host (E5-2499 v4). It's fast enough though, so no big problem.

either the CPU, or the network, or the source storage

did you do your benchmark locally (client == server), or from some other machine to your PBS server?

Fatz said:
2. The PBS host has plenty of RAM spare, yet an L2ARC sped up the verify tasks by approx 25%. Kind of unexpected.

plenty of RAM to spare, or still plenty of ARC growth potential? by default, the ARC (ZFS own cache) only grows up to 50% of your server's RAM.

Fatz said:
3. The biggest gain i saw was going from 16k to 512K for the special_small_blocks setting.

that lines up with the distribution below I'd say.

Fatz said:

However, i'm still a little perplexed and not sure if this is me lacking knowledge in either PBS or ZFS.
I noticed that after a backup, 51G out of 412G (asize) was in the 512K or below block sizes according to the histogram.

Code:

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:    776   388K   388K    776   388K   388K      0      0      0
     1K:    514   596K   984K    514   596K   984K      0      0      0
     2K:    132   350K  1.30M    132   350K  1.30M      0      0      0
     4K:   367K  1.43G  1.43G    252  1.41M  2.71M  1.41K  5.65M  5.65M
     8K:  19.4K   187M  1.62G    472  5.44M  8.15M   385K  3.04G  3.04G
    16K:  16.8K   369M  1.98G   135K  2.14G  2.14G  16.0K   335M  3.37G
    32K:  21.9K  1.09G  3.07G  10.8K   615M  2.74G  17.0K   775M  4.13G
    64K:  20.2K  1.88G  4.95G  15.0K  1.43G  4.18G  26.5K  2.30G  6.43G
   128K:  8.04K  1.44G  6.39G   238K  29.9G  34.1G  8.11K  1.45G  7.88G
   256K:  12.2K  4.57G  11.0G  1.92K   720M  34.8G  12.2K  4.57G  12.5G
   512K:  52.7K  40.0G  51.0G  9.42K  8.44G  43.2G  51.2K  38.6G  51.1G
     1M:   359K   359G   410G   466K   466G   509G   360K   360G   412G
     2M:      0      0   410G      0      0   509G      0      0   412G
     4M:      0      0   410G      0      0   509G      0      0   412G
     8M:      0      0   410G      0      0   509G      0      0   412G
    16M:      0      0   410G      0      0   509G      0      0   412G

so this is with recordsize 1M, and special small blocks 512k?

if you keep in mind that a record will never span the data of multiple files, that seems entirely plausible. you'll have chunks (and potentially the occasional index) that are over 1M, and thus occupy multiple 1M records. but you'll also have indices and smaller chunks < 1M, that might occupy 512k chunks. and then you have a bit of metadata in even smaller records, but that doesn't amount to much (e.g., anything up to 128K is only 6.4G, or 1,5% of your total used space!). the block histogram is also a bit confusing - a file consisting of a single 1000K record is counted as 512k

Fatz said:

Running zpool iostat -v also showed me that there was 399G stored on the HDD pool, but only 12.6G on the special device pool. This adds up fine based on the above figure of 412G, but I actually expected this to be higher based upon what i saw in the histogram output (pool was wiped prior, to make this accurate per backup task).

Code:

pool-hdd                                                  412G  64.6T     36     63  19.4M  14.8M
  mirror-0                                                133G  18.1T      6      5  6.30M  4.69M
    wwn-0x5000cca2c7627e00                                   -      -      3      2  3.15M  2.35M
    wwn-0x5000cca2c761d000                                   -      -      3      2  3.15M  2.35M
  mirror-1                                                133G  18.1T      6      5  6.30M  4.68M
    wwn-0x5000cca2c761399c                                   -      -      3      2  3.15M  2.34M
    wwn-0x5000cca2ed00c4e8                                   -      -      3      2  3.15M  2.34M
  mirror-2                                                133G  18.1T      6      5  6.28M  4.67M
    wwn-0x5000cca2c761d664                                   -      -      3      2  3.14M  2.34M
    wwn-0x5000cca2c76200e8                                   -      -      3      2  3.14M  2.34M
special                                                      -      -      -      -      -      -
  mirror-3                                               4.18G  3.48T      5     15   165K   266K
    wwn-0x5002538b71501f50                                   -      -      2      7  82.1K   133K
    wwn-0x5002538b715015a0                                   -      -      2      7  82.9K   133K
  mirror-4                                               4.18G  3.48T      5     16   164K   267K
    wwn-0x5002538b71501540                                   -      -      2      8  81.9K   134K
    wwn-0x5002538b715015e0                                   -      -      2      8  81.9K   134K
  mirror-5                                               4.19G  3.48T      5     15   164K   262K
    wwn-0x5002538b71501690                                   -      -      2      7  81.9K   131K
    wwn-0x5002538b71501660                                   -      -      2      7  81.7K   131K
logs                                                         -      -      -      -      -      -
  nvme0n1p1                                                  0   127G      0      0     99    505
cache                                                        -      -      -      -      -      -
  nvme0n1p2                                              4.75G  1.62T      0      2     99  1021K

I then tried to increase the special_small_blocks setting to 1M, purely as a test to see if the special devices got filled up more (obviously i ultimately need a value below the pool record size, like 512K). But they didn't fill up any more than at the 512K setting, which is not what i expected at all.

could you give the full output of zdb -PbbbLs run on your pool? the part above the block size histogram is also quite interesting

Fatz said:
I'm not sure if there's a way for me to push any more data to the special devices than i already am, which i'd like to do, as there's quite a lot of space available there and any extra IOPs would be welcome.

I'm a bit confused and wondering if i'm just observing things incorrectly since i'm not well-versed in PBS yet and fairly new to the ZFS world too.

Any help would be appreciated.

Fatz · May 16, 2023

fabian said:
how did you generate that first histogram?

zdb -Lbbbs pool-hdd

either the CPU, or the network, or the source storage did you do your benchmark locally (client == server), or from some other machine to your PBS server?

Actually, just now was my first time hearing about the tool, so the above post was a local test on the PBS host while it was idle. I then did another test from a PVE host, while the PBS host was doing a verify task, and yet somehow the numbers look... better? Are the compression and such tests performed by the PVE host you're backing up from? That would make more sense, as the PVE host is a 64 core EPYC, much higher spec than the PBS host.

Code:

Time per request: 5538 microseconds.
TLS speed: 757.29 MB/s
SHA256 speed: 1632.31 MB/s
Compression speed: 516.72 MB/s
Decompress speed: 755.06 MB/s
AES256/GCM speed: 1882.98 MB/s
Verify speed: 528.42 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 757.29 MB/s (61%)  │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1632.31 MB/s (81%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 516.72 MB/s (69%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 755.06 MB/s (63%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 528.42 MB/s (70%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1882.98 MB/s (52%) │
└───────────────────────────────────┴────────────────────┘

fabian said:
plenty of RAM to spare, or still plenty of ARC growth potential? by default, the ARC (ZFS own cache) only grows up to 50% of your server's RAM.

Both... i think?

The output of arcstat looks good?

Code:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
09:50:32     1     0      0     0    0     0    0     0    0  128G  128G   108G

I manually set the ARC RAM cap on my hosts usually when testing, just happens to be 50% of RAM that i picked on this occasion.

fabian said:
so this is with recordsize 1M, and special small blocks 512k?

I actually get the identical output with small blocks set to 1M or 512K.

fabian said:
if you keep in mind that a record will never span the data of multiple files, that seems entirely plausible. you'll have chunks (and potentially the occasional index) that are over 1M, and thus occupy multiple 1M records. but you'll also have indices and smaller chunks < 1M, that might occupy 512k chunks. and then you have a bit of metadata in even smaller records, but that doesn't amount to much (e.g., anything up to 128K is only 6.4G, or 1,5% of your total used space!). the block histogram is also a bit confusing - a file consisting of a single 1000K record is counted as 512k

Oh OK, this starts to make a little more sense now.

could you give the full output of zdb -PbbbLs[/icode[ run on your pool? the part above the block size histogram is also quite interesting

... need to split in to 2 posts due to character limit...

Fatz · May 16, 2023

Code:

Traversing all blocks ...

 401G completed (16791MB/s) estimated time remaining: 0hr 00min 00sec
        bp count:                901889
        ganged count:                 0
        bp logical:        547063890432      avg: 606575
        bp physical:       440280750592      avg: 488176     compression:   1.24
        bp allocated:      441912238080      avg: 489985     compression:   1.24
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:      428425646080     used:  0.71%
        Special class       13486321664     used:  0.12%
        Embedded log class              0     used:  0.00%

        additional, non-pointer bps of type 0:       2695
         number of (compressed) bytes:  number of bps
                         25:      2 *
                         26:      0
                         27:      1 *
                         28:    137 ***
                         29:   2090 ****************************************
                         30:      0
                         31:      0
                         32:      0
                         33:      3 *
                         34:      0
                         35:      0
                         36:      0
                         37:      0
                         38:      0
                         39:      0
                         40:      0
                         41:      2 *
                         42:      3 *
                         43:      0
                         44:      0
                         45:      0
                         46:      2 *
                         47:      0
                         48:      0
                         49:      0
                         50:      1 *
                         51:      2 *
                         52:      0
                         53:      2 *
                         54:      0
                         55:      0
                         56:      0
                         57:      2 *
                         58:      0
                         59:      0
                         60:      6 *
                         61:      0
                         62:      0
                         63:      0
                         64:      0
                         65:      3 *
                         66:      5 *
                         67:      0
                         68:      0
                         69:      0
                         70:      0
                         71:      0
                         72:      0
                         73:      2 *
                         74:      1 *
                         75:      0
                         76:      3 *
                         77:      1 *
                         78:      0
                         79:      0
                         80:      0
                         81:      0
                         82:     17 *
                         83:      3 *
                         84:      0
                         85:      5 *
                         86:     58 **
                         87:      5 *
                         88:     35 *
                         89:      6 *
                         90:     24 *
                         91:     22 *
                         92:     15 *
                         93:     15 *
                         94:     13 *
                         95:     31 *
                         96:     10 *
                         97:      2 *
                         98:     12 *
                         99:     17 *
                        100:     12 *
                        101:     22 *
                        102:     13 *
                        103:      8 *
                        104:      7 *
                        105:     10 *
                        106:      8 *
                        107:      7 *
                        108:     15 *
                        109:     13 *
                        110:     12 *
                        111:      4 *
                        112:      6 *
        Dittoed blocks on same vdev: 6565

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2  32768    8192   24576   12288    4.00     0.00  object directory
     6   3072    1536   36864    6144    2.00     0.00  object array
     2  32768    8192   24576   12288    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     -      -       -       -       -       -        -  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
    12  196608  49152   147456  12288    4.00     0.00      L1 SPA space map
   130  17039360        2519040 7557120 58131    6.76     0.00      L0 SPA space map
   142  17235968        2568192 7704576 54257    6.71     0.00  SPA space map
     -      -       -       -       -       -        -  ZIL intent log
     3  393216  12288   24576    8192   32.00     0.00      L5 DMU dnode
     3  393216  12288   24576    8192   32.00     0.00      L4 DMU dnode
     3  393216  12288   24576    8192   32.00     0.00      L3 DMU dnode
     3  393216  12288   24576    8192   32.00     0.00      L2 DMU dnode
    12  1572864 458752  925696  77141    3.43     0.00      L1 DMU dnode
  9068  148570112       38379520        76914688         8481    3.87     0.02      L0 DMU dnode
  9092  151715840       38887424        77938688         8572    3.90     0.02  DMU dnode
     4  16384   16384   36864    9216    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     -      -       -       -       -       -        -  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
     6   3072     512   12288    2048    6.00     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
175928  23059234816     720601088       1441202176       8192   32.00     0.33      L1 ZFS plain file
523893  513448082432    438735887872    438819536896    837612   1.17    99.30      L0 ZFS plain file
699821  536507317248    439456488960    440260739072    629104   1.22    99.63  ZFS plain file
 63335  8301445120      259444736       518889472        8192   32.00     0.12      L1 ZFS directory
129393  2084866048      523211776       1046437888       8087    3.98     0.24      L0 ZFS directory
192728  10386311168     782656512       1565327360       8121   13.27     0.35  ZFS directory
     3   1536    1536   24576    8192    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1  131072   4096   12288   12288   32.00     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     -      -       -       -       -       -        -  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
     3   4608    4608   24576    8192    1.00     0.00  SA attr registration
     6  98304   24576   49152    8192    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     -      -       -       -       -       -        -  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
    40  969728  79872   282624   7065   12.14     0.00  other
     3  393216  12288   24576    8192   32.00     0.00      L5 Total
     3  393216  12288   24576    8192   32.00     0.00      L4 Total
     3  393216  12288   24576    8192   32.00     0.00      L3 Total
     3  393216  12288   24576    8192   32.00     0.00      L2 Total
239287  31362449408     980553728       1961164800       8195   31.98     0.44      L1 Total
662590  515699868160    439300147712    439950974976    663986   1.17    99.56      L0 Total
901889  547063890432    440280750592    441912238080    489985   1.24   100.00  Total

Block Size Histogram

block   psize                   lsize                     asize
size    Count   Size    Cum.    Count   Size    Cum.    Count   Size    Cum.
512     776     397312  397312  776     397312  397312  0       0       0
1024    514     610304  1007616 514     610304  1007616 0       0       0
2048    132     357888  1365504 132     357888  1365504 0       0       0
4096    375057  1536681472      1538046976      252     1480704 2846208 1446    5922816 5922816
8192    20161   198182400       1736229376      472     5699072 8545280 394125  3261546496      3267469312
16384   17157   387037184       2123266560      138339  2293458944      2302004224      16606   355098624       3622567936
32768   22387   1171398144      3294664704      11022   645164544       2947168768      17420   812883968       4435451904
65536   20691   2022990848      5317655552      15359   1540564992      4487733760      27160   2468466688      6903918592
131072  8229    1547141120      6864796672      243223  32074616832     36562350592     8304    1560375296      8464293888
262144  12495   4903541248      11768337920     1966    754613760       37316964352     12523   4910641152      13374935040
524288  53922   42979329024     54747666944     9649    9060992000      46377956352     52460   41455472640     54830407680
1048576 367673  385533083648    440280750592    477490  500684554240    547062510592    369150  387081830400    441912238080
2097152 0       0       440280750592    0       0       547062510592    0       0       441912238080
4194304 0       0       440280750592    0       0       547062510592    0       0       441912238080
8388608 0       0       440280750592    0       0       547062510592    0       0       441912238080
16777216        0       0       440280750592    0       0       547062510592    0       0       441912238080

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
pool-hdd                   412G 64.6T 9.71K     0 39.2M     0     0     0     0
  mirror                   133G 18.1T     0     0 37.8K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2c7627e00-part1                0     0 18.9K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2c761d000-part1                0     0 18.9K     0     0     0     0
  mirror                   133G 18.1T     0     0 37.9K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2c761399c-part1                0     0 19.0K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2ed00c4e8-part1                0     0 18.9K     0     0     0     0
  mirror                   133G 18.1T     0     0 37.8K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2c761d664-part1                0     0 18.9K     0     0     0     0
    /dev/disk/by-id/wwn-0x5000cca2c76200e8-part1                0     0 18.9K     0     0     0     0
  mirror (special)        4.18G 3.48T 3.23K     0 13.0M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b71501f50-part1            1.62K     0 6.55M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b715015a0-part1            1.61K     0 6.48M     0     0     0     0
  mirror (special)        4.18G 3.48T 3.26K     0 13.1M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b71501540-part1            1.63K     0 6.57M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b715015e0-part1            1.63K     0 6.58M     0     0     0     0
  mirror (special)        4.19G 3.48T 3.21K     0 12.9M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b71501690-part1            1.61K     0 6.47M     0     0     0     0
    /dev/disk/by-id/wwn-0x5002538b71501660-part1            1.61K     0 6.47M     0     0     0     0
  hole                                    0     0     0     0     0     0     0
  /dev/nvme0n1p1 (log)        0  127G     0     0 18.9K     0     0     0     0
pthread_mutex_lock(&mp->m_lock) == 0 (0x16 == 0)
ASSERT at kernel.c:178:mutex_enter()Aborted

Fatz · May 16, 2023

Had to spill in to 3 posts here, had some formatting issues with the post above.

I wanted to say: I'm not sure if that last line (ASSERT at kernel.c:178:mutex_enter()Aborted) is something indicating some issue? I've noticed a few things from these zdb outputs here and there before now (segfault once also).

Thanks again!

fabian · May 16, 2023

many parts of zdb assume they operate on an exported pool, and choke on concurrent changes that might happen if the pool is currently imported

Code:

523893  513448082432    438735887872    438819536896    837612   1.17    99.30      L0 ZFS plain file

this is basically the part that is eligible for special_small_blocks treatment. you can see that psize and size almost line up, so very little wasted space on that front (good!). the average blocksize is "just" 837612. I would expect that bumping up to 1M should increase that usage of the special vdev, but it does require rewriting all the data to take effect. as long as you have the free space, you can test this with send/receive as well:

Code:

zfs snapshot pool/source_dataset@snapshot
zfs send -pL pool/source_dataset@snapshot | zfs recv -o special_small_blocks=NEW_VALUE pool/target_dataset

and compare the usage of the special device before and after. for example, with the source dataset using a special threshold of 512k and two 1000K files, the special usage was almost zero (and regular usage ~2M) before sending, after sending with a threshold of 1M on the target dataset only the special usage jumped by ~2M, since those two 1000K blocks are now fully stored on the special vdev.

zdb can also tell you where a given block is stored (on which vdev), e.g., if I compare the output of zdb -dddd -e pool/dataset looking for the file in question (there's also the '-O' option to map paths within a dataset to objects, but that has more limitations on when it can be used):

file with block size > special threshold:

Code:

   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K  1000K  1000K     512  1000K  100.00  ZFS plain file (K=inherit) (Z=inherit=uncompressed)
                                               176   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /testfile
        uid     0
        gid     0
        atime   Tue May 16 13:17:04 2023
        mtime   Tue May 16 13:17:04 2023
        ctime   Tue May 16 13:17:04 2023
        crtime  Tue May 16 13:17:04 2023
        gen     118
        mode    100644
        size    1024000
        parent  34
        links   1
        pflags  840800000004
Indirect blocks:
               0 L0 0:180019000:fa000 fa000L/fa000P F=1 B=118/118 cksum=1f46917bf7d50:d0ca2f258972ca67:acb46707ae9129e1:422342dddf0e0708

                segment [0000000000000000, 00000000000fa000) size 1000K

here we see the 1000K L0 block that actually contains the data, including the information where on disk it is stored: 0:180019000:fa000. this is a non-redundant pool, so this is the only DVA for this block in this case. the first part is the vdev index (zdb -C POOLNAME prints the config of the pool including the vdev tree). the second one is the offset within that vdev, the last one is the size (this is relevant when there are holes in a file

)

the same thing repeated, but with the receiving dataset with the 1M threshold:

Code:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K  1000K  1000K     512  1000K  100.00  ZFS plain file (K=inherit) (Z=inherit=uncompressed)
                                               176   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /testfile
        uid     0
        gid     0
        atime   Tue May 16 13:17:04 2023
        mtime   Tue May 16 13:17:04 2023
        ctime   Tue May 16 13:17:04 2023
        crtime  Tue May 16 13:17:04 2023
        gen     118
        mode    100644
        size    1024000
        parent  34
        links   1
        pflags  840800000004
Indirect blocks:
               0 L0 1:19000:fa000 fa000L/fa000P F=1 B=239/239 cksum=1f46917bf7d50:d0ca2f258972ca67:acb46707ae9129e1:422342dddf0e0708

                segment [0000000000000000, 00000000000fa000) size 1000K

note how the DVA points at vdev number '1' now, which is my special vdev (also non-redundant in this case

). this should allow you to quickly verify whether the special allocation works as you expect it to work.

Fatz · May 16, 2023

Phew, i think i need a nap and a coffee before i get through that, but i really appreciate the help!

Before i do more digging though, i just wondered - You suggested the "snapshot/send/receive" steps above to rewrite the data.

But what i did between each backup/verify run was to:

Either destroy the pool and remake it entirely with all new settings, or...
Just delete the backup on PBS, run GC, change special blocks setting, then backup again (which is what i did in the 512K > 1M test case).

I'm guessing from what i'm seeing now, that #1 will certainly ensure the data is rewritten, but i guess #2 maybe doesn't? Just curious for future reference.

Fatz · May 16, 2023

OK, answering my own question above, it seems my thoughts were correct. I did a backup of a different set of data to avoid some dedup/chunk of the same data existing, and all the data did indeed land on the SSD special devices with 1MB block setting (matching the HDD pool's record size). So that's good to know it's working as i expected, i'll play around a little more now to test further.

Could you expand on what you said above about the pool being "non-redundant"? These are all mirrored pairs, by redundant are you talking about in another sense, like the entire 'pool' being redundant and not just an individual disk or mirror? (Replication? HA?)

fabian · May 17, 2023

no, I was talking about my test pool not being redundant. with a redundant pool the data is actually stored multiple times (on multiple vdevs), I wasn't sure how zdb would display the location then. it seems it prints the top-level vdev anyway, so the output would still look like that.

regarding your point number 2 above - possibly GC didn't cleanup (all) the chunks because they were too new? that would have been noted in the GC log. then chunks are still there, and if possible, would be re-used for new incoming backups even if the client doesn't know about them existing. only new chunks would be newly written in that case.

Search

Search

Block Size for PBS/ZFS

Fatz

New Member

fabian

Proxmox Staff Member

Fatz

New Member

fabian

Proxmox Staff Member

Fatz

New Member

Fatz

New Member

fabian

Proxmox Staff Member

Fatz

New Member

Fatz

New Member

Fatz

New Member

fabian

Proxmox Staff Member

Fatz

New Member

Fatz

New Member

fabian

Proxmox Staff Member

We value your privacy