The 1 millionth thread about L2ARC

enoch85 · Dec 19, 2024

Hello,

I stumbled across an Kingston DC1000B the other day, which I formatted to 4K sectors. I have no use for it since I already run mirrored SLOGs in my main PVE, but my PVE server are pretty slow, so I thought "why not".

This is some of the stats right now:

Bash:

L2ARC size (adaptive):                                         120.8 GiB
        Compressed:                                     6.7 %    8.1 GiB
        Header size:                                    0.1 %  101.0 MiB
        MFU allocated size:                            58.1 %    4.7 GiB
        MRU allocated size:                            41.8 %    3.4 GiB
        Prefetch allocated size:                        0.2 %   16.4 MiB
        Data (buffer content) allocated size:           0.0 %    0 Bytes
        Metadata (buffer content) allocated size:     100.0 %    8.1 GiB

L2ARC breakdown:                                                    1.9M
        Hit ratio:                                     49.6 %     944.4k
        Miss ratio:                                    50.4 %     958.5k
    
ARC total accesses:                                                18.3M
        Total hits:                                    89.5 %      16.4M
        Total I/O hits:                                 0.1 %      11.8k
        Total misses:                                  10.5 %       1.9M

Bash:

NAME                        PROPERTY        VALUE           SOURCE
backupstorage               secondarycache  metadata        local

And this is my setup:

Bash:

    NAME                                                    STATE     READ WRITE CKSUM
    backupstorage                                           ONLINE       0     0     0
      raidz2-0                                              ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                 ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                  ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                  ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                  ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                 ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V8redacted                 ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V9redacted                   ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V9redacted                  ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V9redacted                   ONLINE       0     0     0
        ata-HGST_HUS726T6TALN6L4_V9redacted                   ONLINE       0     0     0
    logs
      nvme-KINGSTON_SEDC1000BM8240G_50redacted644-part1  ONLINE       0     0     0
    cache
      nvme-KINGSTON_SEDC1000BM8240G_50redacted644-part2  ONLINE       0     0     0

No I'm trying to figure out if it's better, the same, or worse. But a whole day of reading (this was very nice: https://klarasystems.com/articles/openzfs-all-about-l2arc/) didn't make me an expert.

Server has 32 GB RAM, and holds this much data:

Bash:

NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backupstorage  54.6T  24.9T  29.6T        -         -     5%    45%  1.00x    ONLINE  -

So, what are you guys say in this? I know "buy more RAM!!" but yeah, if this is just sliiiightly better I think it's worth it. I got the NVME cheap.

waltar · Dec 19, 2024

Your I/O problem is the pool design with raidz2 with just 1 vdev of 10 disks. With the definition of your l2arc cache device (which even needs ram for housekeeping) you did your I/O even worse as you loose the iops to fill and update the cache permanently for your normal pool usage. It depends on your activ data of the pool if you get a benefit or loose more than without.
If you want the best performance you should have a pool of mirrored hdd's and a special device, just maybe add a slog and even less maybe do a cache device also. But as seen if you rebuild your pool as mirror with your actual data on it would be unuseable full.
You could think about a 2-vdev raidz2 but even won't be happy with the performance even if would be double fast as today but still lacking.
You have to much data, too small disks or too less disks for a mirrored performance pool yet, sorry.
#
But the best on this thread is the title !!

enoch85 · Dec 19, 2024

Thanks for your pointers! Yeah, a special device would be the best option - already figured that out actually. And also split the large pool into two vdevs maybe... But I have no more room for any more disks I'm afraid, and the special device would be a crucial part of the pool, so I'd better mirror the heck out of it - but I can't since I'm out if slots to out any disks in. I've been thinking of getting a PCIe adapter with 2 M.2 slots for a special device, but then the adapter card it self would be single point...

That's also the reason why I only cache metadata on L2. I read some thread deep down in the internet L2ARC hole that it was working.

Since PBS is "just a backup" and I have an offsite PBS as well, I thought I'd play with my main backup a little bit. Also since I only have one NVME and neither SLOG or L2ARC is critical devices, well, I thought it at least was worth trying partitioning it that single device as well.

So just to get this straight, I can see like 23u latency on the NVME and like 2 ms on the rust when running iostat, but you still think that's worse than without L2ARC? Also just did a scrub, and it was reading in like 1.3 GB/s.

My problem is verification, which takes around 2 days!

Kingneutron · Dec 19, 2024

> My problem is verification, which takes around 2 days!

Wait, what? Define "verification" - I have a 14x4TB DRAID (SAS disks + shelf) that takes less than 6-8 hours to scrub, and it has ~23TB allocated

enoch85 · Dec 19, 2024

Kingneutron said:
> My problem is verification, which takes around 2 days!

Wait, what? Define "verification" - I have a 14x4TB DRAID (SAS disks + shelf) that takes less than 6-8 hours to scrub, and it has ~23TB allocated

It's PBS "Verify" that takes more than two days (48 hours) for one of the datasets. Getting a special device here would help a lot, or least that's what I've read. But I've also read that L2ARC would help with this.

Scrub the pool "only" takes one day.

Bash:

scan: scrub repaired 0B in 1 days 00:31:57 with 0 errors on Thu Dec 19 00:49:17 2024

Kingneutron · Dec 19, 2024

Yeah, if you have lots of little files that could be part of it.

Code:

 pool: zshelf15
 state: ONLINE
  scan: scrub repaired 0B in 03:53:50 with 0 errors on Fri Oct  4 23:35:30 2024

config:
        NAME                                                   STATE     READ WRITE CKSUM
        zshelf15                                               ONLINE       0     0     0
          draid2:7d:14c:1s-0                                   ONLINE       0     0     0
            wwn-0x5000cca07321bea8                             ONLINE       0     0     
            wwn-0x5000cca07325f6b0                             ONLINE       0     0     0
            wwn-0x5000cca05d546dcc                             ONLINE       0     0     0
            wwn-0x5000cca05d54a848                             ONLINE       0     0     0
            wwn-0x5000cca03b6d92ec                             ONLINE       0     0     0
            wwn-0x5000cca03b6be528                             ONLINE       0     0     0
            wwn-0x5000cca03b6f6090                             ONLINE       0     0     0
            scsi-35000cca244360c24                             ONLINE       0     0     0
            wwn-0x5000cca25c099f8c                             ONLINE       0     0     0
            wwn-0x5000cca244389584                             ONLINE       0     0     0
            wwn-0x5000cca25d57dc60                             ONLINE       0     0     0
            wwn-0x5000cca25d555290                             ONLINE       0     0     0
            wwn-0x5000cca03b6f6c84                             ONLINE       0     0     0
            wwn-0x5000cca03b6fc490                             ONLINE       0     0     0
        special 
          mirror-1                                             ONLINE       0     0     0
            ata-THNSF8800CCSE_57FS104YTBUT-part1               ONLINE       0     0     0
            wwn-0x58ce38ee2032d0bd-part1                       ONLINE       0     0     0
        cache
          ata-Samsung_SSD_860_PRO_512GB_S5GBNS0NB01060P-part4  ONLINE       0     0     0
          wwn-0x58ce38ee2032d0bd-part2                         ONLINE       0     0     0
          ata-THNSF8800CCSE_57FS104YTBUT-part2                 ONLINE       0     0     0
        spares
          draid2-0-0                                           AVAIL   
errors: No known data errors


zpool iostat -v zshelf15
                                                         capacity     operations     bandwidth 
pool                                                   alloc   free   read  write   read  write
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
zshelf15                                               25.6T  21.8T      1    565  25.2K   185M
  draid2:7d:14c:1s-0                                   25.5T  21.8T      0    509  13.1K   184M

UdoB · Dec 19, 2024

Kingneutron said:
draid2:7d:14c:1s-0

I am just curious: "draid2" mean this one has two parity drives = two drives may get lost without affecting data, right?

The Special Device is crucial, if it is lost the pool is toast. It is a mirror, so your SD can only lose one drive --> it does not have the same redundancy level. Is there a reason for this, or did you just have no chance for a triple mirror?

Disclaimer: I never used draid - I am too small for this.

Kingneutron · Dec 19, 2024

I see no reason to triple-mirror a Special device on what is essentially only a tertiary backup. It only gets turned on once a month (or two) for updates and dedup.

The mirror is already comprised of 2 different brands/models, so the odds of both failing at the same time are pretty low.

Technically my 14-disk DRAID is "too small" as well according to some, but it works fine for me. I needed free space over redundancy, and the drives are only 4TB. Just don't try to run an interactive VM off of 14 spinning disks, I learned the hard way.

alexskysilk · Dec 19, 2024

UdoB said:
I am just curious: "draid2" mean this one has two parity drives = two drives may get lost without affecting data, right?

note the 1s. His pool has a distributed spare. as long as you dont have three failures in quick succession a single disk would be rebuilt to the spare.

the 7d is an odd choice.

UdoB · Dec 19, 2024

Kingneutron said:
I see no reason to triple-mirror a Special device on what is essentially only a tertiary backup.

Okay. So you have chosen the configuration that fits your needs. That's absolutely fine, of course.

enoch85 · Dec 19, 2024

Kingneutron said:

Code:

        special
          mirror-1                                             ONLINE       0     0     0
            ata-THNSF8800CCSE_57FS104YTBUT-part1               ONLINE       0     0     0
            wwn-0x58ce38ee2032d0bd-part1                       ONLINE       0     0     0
                               25.5T  21.8T      0    509  13.1K   184M

That right there is what I would need, but don't have any room for currently. I'm on Fujitsu RX2540 M2, so no NVME for me. :/

Just installed Netdata yesterday, and well, look at this.

ARC

L2ARC

I can answer my own question, no need for L2ARC at all. Just add more RAM. Now it's confirmed.

enoch85 · Friday at 20:16

alexskysilk said:
the 7d is an odd choice.

The reason is that it's a 4 bay server, with and extra 3 bays in the 5.25" slot.

waltar · Friday at 21:06

Kingneutron said:
draid2:7d:14c:1s-0

alexskysilk said:
the 7d is an odd choice.

enoch85 said:
The reason is that it's a 4 bay server, with and extra 3 bays in the 5.25" slot.

??? It's a draid2 with 14 children (with a 7 data stripe) and 2 special and 3 cache and 1-2 for os ...

alexskysilk · Monday at 02:29

7 data means a 28K (4kx7) stripe size. Not many applications/use cases write data in 28k chunks, meaning there is going to be a lot of write amplification.

alexskysilk · Monday at 02:35

enoch85 said:
The reason is that it's a 4 bay server, with and extra 3 bays in the 5.25" slot.

that comment was for @Kingneutron

in a draid pool you can have any number of data chunks up to c-p-d (where c is total number of children disks, p=parity, s=spares) which in the case of 14 children means up to 11d.

Search

Search

The 1 millionth thread about L2ARC

enoch85

Member

waltar

Renowned Member

enoch85

Member

Kingneutron

Active Member

enoch85

Member

Kingneutron

Active Member

UdoB

Distinguished Member

Kingneutron

Active Member

alexskysilk

Distinguished Member

UdoB

Distinguished Member

enoch85

Member

enoch85

Member

waltar

Renowned Member

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member