Low ZFS read performance Disk->Tape

Lephisto · Jan 9, 2024

hi,

I need some Input for troubleshooting a performance Issue I have on a Proxmox Backup Server I run.

Basically it's an off-site System that syncs with a production PBS. The Specs as follows:

- Supermicro Board + Chasssis
- Dual E-2680v4 Xeon (2x 14c)
- 128GB RAM
- 2x 480gb Seagate Nytro SSD (boot)
- 12x 20TB Seagate Exos / SAS
- 3x 2TB Samsung PM9A3
- Quantum Superloader 3 / LTO9

Now, the Datastore is configured as ZFS RAIDZ2, and has a ZFS special device in form of a 3-way mirrored NVME:

Code:

  pool: datastore
 state: ONLINE
config:

    NAME         STATE     READ WRITE CKSUM
    datastore    ONLINE       0     0     0
      raidz2-0   ONLINE       0     0     0
        sdc      ONLINE       0     0     0
        sdd      ONLINE       0     0     0
        sde      ONLINE       0     0     0
        sdf      ONLINE       0     0     0
        sdg      ONLINE       0     0     0
        sdh      ONLINE       0     0     0
        sdi      ONLINE       0     0     0
        sdj      ONLINE       0     0     0
        sdk      ONLINE       0     0     0
        sdl      ONLINE       0     0     0
        sdm      ONLINE       0     0     0
        sdn      ONLINE       0     0     0
    special
      mirror-1   ONLINE       0     0     0
        nvme0n1  ONLINE       0     0     0
        nvme1n1  ONLINE       0     0     0
        nvme2n1  ONLINE       0     0     0

Thing is: i can't seem to achieve a sustained read data rate to saturate the LTO9 Drive, which should be able to write at 300MB/s. I seem to max out at ~150MB/s, and IOPS shouldn't be the Problem.

Also Systemload shouldn't be much of an issue:

Interessting thing is: Writing to the ZFS (while restoring) works at sustained 300MB/s.

ZFS Params are PBS default, no compression.

These are the Benchmark Values:

Code:

SHA256 speed: 424.56 MB/s
Compression speed: 408.38 MB/s
Decompress speed: 615.06 MB/s
AES256/GCM speed: 1225.13 MB/s
Verify speed: 248.88 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ not tested         │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 424.56 MB/s (21%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 408.38 MB/s (54%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 615.06 MB/s (51%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 248.88 MB/s (33%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1225.13 MB/s (34%) │
└───────────────────────────────────┴────────────────────┘

Where to dig?

greetings.

dcsapak · Jan 10, 2024

HI,

you could install a tool like 'atop' that monitors cpu/memory/disk etc. speed while making a tape backup and check where might be the bottleneck

e.g. it seems you cpu is on the lower side, maybe there is the issue?

Lephisto · Jan 10, 2024

Code:

Linux 6.5.11-6-pve (pbs-ba50-1)     01/10/24     _x86_64_    (56 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.35    0.00    0.92    0.63    0.00   98.10



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    0.58    0.84    0.00   98.53



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.70    1.43    0.00   97.80



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.75    1.41    0.00   97.78



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.77    1.39    0.00   97.77

and:

Code:

cat /proc/pressure/io
some avg10=48.26 avg60=44.90 avg300=42.09 total=318108937759
full avg10=48.26 avg60=44.88 avg300=42.05 total=317466677296

dcsapak · Jan 10, 2024

Lephisto said:

seems rather io starved... so maybe the disks after all? note that pbs has to read a lot of "smaller" chunks (between 128k and 16 MiB; most often between 2-4MiB) from random places, so the read throughput is not really sequential and the special device only helps for metadata access and very small files, so spinning disks can be the bottleneck here

Lephisto · Jan 10, 2024

But please, I can write with sustained 300MB/s. 150MB/s really seems rather low..

Lephisto · Jan 10, 2024

I killed the RAIDz2 now, and re-setup with a striped mirror. we'll see.

Dunuin · Jan 10, 2024

I'm only guessing...but shouldn't writes be more sequential and reads more random because of the deduplication? This then could explain why reads are hitting the IOPS bottleneck of the HDDs earlier.

Lephisto · Jan 10, 2024

Sounds plausible. Soon testing w/ RAIDZ10...

Lephisto · Jan 10, 2024

Okay,

this:

Code:

    NAME         STATE     READ WRITE CKSUM
    zfs          ONLINE       0     0     0
      mirror-0   ONLINE       0     0     0
        sdc      ONLINE       0     0     0
        sdd      ONLINE       0     0     0
      mirror-1   ONLINE       0     0     0
        sde      ONLINE       0     0     0
        sdf      ONLINE       0     0     0
      mirror-2   ONLINE       0     0     0
        sdg      ONLINE       0     0     0
        sdh      ONLINE       0     0     0
      mirror-3   ONLINE       0     0     0
        sdi      ONLINE       0     0     0
        sdj      ONLINE       0     0     0
      mirror-4   ONLINE       0     0     0
        sdk      ONLINE       0     0     0
        sdl      ONLINE       0     0     0
      mirror-5   ONLINE       0     0     0
        sdm      ONLINE       0     0     0
        sdn      ONLINE       0     0     0
    special   
      mirror-6   ONLINE       0     0     0
        nvme0n1  ONLINE       0     0     0
        nvme1n1  ONLINE       0     0     0
        nvme2n1  ONLINE       0     0     0

is even slower, like 81MB/s..

meeh. I would really really like to max out the write perforamance of the LTO9 drive[/CODE]

Dunuin · Jan 10, 2024

If IOPS performance would be the problem this should have helped increasing the IOPS performance by factor 6.

Maybe you could optimize ZFS a bit? Enabling LZ4 compression and increasing the recordsize to something like 4M for example could help so its doing less 128K IO and more 1-4M IO?

Lephisto · Feb 29, 2024

So after some weeks/months and after a lot of testing I gained some insights:

- When replicating one PBS to the other (Both 12-Disk RaidZ3 + special device) I easily get 300-400MB/s read/write performance on the source PBS and can easily write the same speed on the Offsite PBS.
- When writing to tape I am stuck at read speeds of 100 to 150MB/s while the tape could write at 300MB/s.

The suspicion is that for replicating backups it doesn't matter in which order the chunks are read. For writing to tape it in fact does matter in which order the reading of chunks to memory does happen and maybe there is no read-ahead buffer there.

Can we have any clarification from the devs? Maybe a hidden performance tuning option somewhere?

greetings.

tom · Feb 29, 2024

Lephisto said:
Can we have any clarification from the devs? Maybe a hidden performance tuning option somewhere?

See docs. https://pbs.proxmox.com/docs/tape-backup.html

"...If you want to write to your tape at full speed, please make sure that the source datastore is able to deliver that performance (for example, by using SSDs)."

Lephisto · Feb 29, 2024

no offense, but well yeah, sorry that answer sounds a bit too easy.

I could imagine that a smarter read ahead buffer feature could also do the "trick". Having close to one pb as flash for backup storage is just not very cost effective, and also quite unusual. Spinning rust is still the default in backup servers today i assume.

I mean it's not the hardware per se that's not capable enough. It's the chosen method of chunking combined with inefficient read procedures that are not maxing out the hardware capabilities in their current implementation. Maybe the developers can look into it?

regards

dcsapak · Mar 1, 2024

well there is an internal read ahead buffer, but that is maybe not enough (i don't remember the specifics atm)

there is one tuning option that might help that is on the datastore tuning the 'chunk order' by inode or none at all, but the default for that is already 'inode'
depending on the underlying storage though it might not help to sort by inode, then setting it to 'none' could mean a speedup since we don't waste time listing/sorting the chunks inode number
(you can set this on the gui in datastore -> options)

aside from that, i know where we have some potential for performance increase in the tape backup code, but sadly had no time to implement them yet

Lephisto · Mar 2, 2024

@dcsapak thanks for the elaborate answer. I'm really trying to figure out how to scale pbs out.. I will try setting to "none", still, the documentation suggests, that it might have no effect or even be slower with spinners..

I guess it performs a lot better when doing a sync, because reads/write are heavily parallelized and the chunk order does not matter?

regrads

dcsapak · Mar 4, 2024

technically there should not be that much difference between a tape backup and a sync, we basically iterate over the snapshots and for each we write/pull the relevant chunks from the datastore

there might be some issues with not enough buffer for the tape so that it has to slow down, which of course won't be a problem for the regular sync

i have some tape work to do soon anyways, maybe i can find time to check if we have some improvements

Lephisto · Mar 4, 2024

@dcsapak that'd be awewsome. I was wondering all the time how LTO is handling constant buffer underruns. Will it just pause and wait until the buffer is filled or will the constant tape speed just be tapered down?

i already looked into the code, but i am not a rust guru

. if i still can be helpful in some way, i would be happy to.

guerby · Mar 5, 2024

@Lephisto @dcsapak For the record on our Dell TL4000 LTO7 tape library with three drives (internally IBM 3573-TL + 3 ULT3580-HH7 drives) when setting up in parallel three tape backup jobs with three different media pools on three different drives we've seen up to 900 MByte/s of tape write, aggregate average of 600+ MByte/s over a few hours.

Our PBS is full SSD (RAIDZ3 9+3), server is HPE DL385 AMD EPYC, 2 PCIe HBA dual external SAS, three SAS cables for the three drives.

It would be nice to have the notion of "drive set" for a job and have PBS do the balancing work between the tape drives, for now the only option I see with PBS is to set up everything manually.

Lephisto · Mar 5, 2024

@guerby yeah well, this thread was about performance issues specifically with hdd setups.

Lephisto · Apr 7, 2024

dcsapak said:
technically there should not be that much difference between a tape backup and a sync, we basically iterate over the snapshots and for each we write/pull the relevant chunks from the datastore

there might be some issues with not enough buffer for the tape so that it has to slow down, which of course won't be a problem for the regular sync

i have some tape work to do soon anyways, maybe i can find time to check if we have some improvements

hey @dcsapak,

let me bump this up. Is there any news regarding tape write performance?

thanks in advance.

Low ZFS read performance Disk->Tape

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

We value your privacy