Low ZFS read performance Disk->Tape

Lephisto

Well-Known Member
Jun 22, 2019
154
16
58
47
hi,

I need some Input for troubleshooting a performance Issue I have on a Proxmox Backup Server I run.

Basically it's an off-site System that syncs with a production PBS. The Specs as follows:

- Supermicro Board + Chasssis
- Dual E-2680v4 Xeon (2x 14c)
- 128GB RAM
- 2x 480gb Seagate Nytro SSD (boot)
- 12x 20TB Seagate Exos / SAS
- 3x 2TB Samsung PM9A3
- Quantum Superloader 3 / LTO9

Now, the Datastore is configured as ZFS RAIDZ2, and has a ZFS special device in form of a 3-way mirrored NVME:

Code:
  pool: datastore
 state: ONLINE
config:

    NAME         STATE     READ WRITE CKSUM
    datastore    ONLINE       0     0     0
      raidz2-0   ONLINE       0     0     0
        sdc      ONLINE       0     0     0
        sdd      ONLINE       0     0     0
        sde      ONLINE       0     0     0
        sdf      ONLINE       0     0     0
        sdg      ONLINE       0     0     0
        sdh      ONLINE       0     0     0
        sdi      ONLINE       0     0     0
        sdj      ONLINE       0     0     0
        sdk      ONLINE       0     0     0
        sdl      ONLINE       0     0     0
        sdm      ONLINE       0     0     0
        sdn      ONLINE       0     0     0
    special
      mirror-1   ONLINE       0     0     0
        nvme0n1  ONLINE       0     0     0
        nvme1n1  ONLINE       0     0     0
        nvme2n1  ONLINE       0     0     0

Thing is: i can't seem to achieve a sustained read data rate to saturate the LTO9 Drive, which should be able to write at 300MB/s. I seem to max out at ~150MB/s, and IOPS shouldn't be the Problem.

1704827093992.png

Also Systemload shouldn't be much of an issue:

1704827261526.png

Interessting thing is: Writing to the ZFS (while restoring) works at sustained 300MB/s.

ZFS Params are PBS default, no compression.

These are the Benchmark Values:

Code:
SHA256 speed: 424.56 MB/s
Compression speed: 408.38 MB/s
Decompress speed: 615.06 MB/s
AES256/GCM speed: 1225.13 MB/s
Verify speed: 248.88 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ not tested         │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 424.56 MB/s (21%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 408.38 MB/s (54%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 615.06 MB/s (51%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 248.88 MB/s (33%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1225.13 MB/s (34%) │
└───────────────────────────────────┴────────────────────┘

Where to dig?

greetings.
 
Last edited:
HI,

you could install a tool like 'atop' that monitors cpu/memory/disk etc. speed while making a tape backup and check where might be the bottleneck

e.g. it seems you cpu is on the lower side, maybe there is the issue?
 
Code:
Linux 6.5.11-6-pve (pbs-ba50-1)     01/10/24     _x86_64_    (56 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.35    0.00    0.92    0.63    0.00   98.10



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    0.58    0.84    0.00   98.53



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.70    1.43    0.00   97.80



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.75    1.41    0.00   97.78



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.77    1.39    0.00   97.77

and:

Code:
cat /proc/pressure/io
some avg10=48.26 avg60=44.90 avg300=42.09 total=318108937759
full avg10=48.26 avg60=44.88 avg300=42.05 total=317466677296
 
Last edited:
Code:
cat /proc/pressure/io
some avg10=48.26 avg60=44.90 avg300=42.09 total=318108937759
full avg10=48.26 avg60=44.88 avg300=42.05 total=317466677296

seems rather io starved... so maybe the disks after all? note that pbs has to read a lot of "smaller" chunks (between 128k and 16 MiB; most often between 2-4MiB) from random places, so the read throughput is not really sequential and the special device only helps for metadata access and very small files, so spinning disks can be the bottleneck here
 
But please, I can write with sustained 300MB/s. 150MB/s really seems rather low..
 
I'm only guessing...but shouldn't writes be more sequential and reads more random because of the deduplication? This then could explain why reads are hitting the IOPS bottleneck of the HDDs earlier.
 
  • Like
Reactions: Lephisto
Okay,

this:

Code:
    NAME         STATE     READ WRITE CKSUM
    zfs          ONLINE       0     0     0
      mirror-0   ONLINE       0     0     0
        sdc      ONLINE       0     0     0
        sdd      ONLINE       0     0     0
      mirror-1   ONLINE       0     0     0
        sde      ONLINE       0     0     0
        sdf      ONLINE       0     0     0
      mirror-2   ONLINE       0     0     0
        sdg      ONLINE       0     0     0
        sdh      ONLINE       0     0     0
      mirror-3   ONLINE       0     0     0
        sdi      ONLINE       0     0     0
        sdj      ONLINE       0     0     0
      mirror-4   ONLINE       0     0     0
        sdk      ONLINE       0     0     0
        sdl      ONLINE       0     0     0
      mirror-5   ONLINE       0     0     0
        sdm      ONLINE       0     0     0
        sdn      ONLINE       0     0     0
    special   
      mirror-6   ONLINE       0     0     0
        nvme0n1  ONLINE       0     0     0
        nvme1n1  ONLINE       0     0     0
        nvme2n1  ONLINE       0     0     0

is even slower, like 81MB/s..

meeh. I would really really like to max out the write perforamance of the LTO9 drive[/CODE]
 
If IOPS performance would be the problem this should have helped increasing the IOPS performance by factor 6.

Maybe you could optimize ZFS a bit? Enabling LZ4 compression and increasing the recordsize to something like 4M for example could help so its doing less 128K IO and more 1-4M IO?
 
So after some weeks/months and after a lot of testing I gained some insights:

- When replicating one PBS to the other (Both 12-Disk RaidZ3 + special device) I easily get 300-400MB/s read/write performance on the source PBS and can easily write the same speed on the Offsite PBS.
- When writing to tape I am stuck at read speeds of 100 to 150MB/s while the tape could write at 300MB/s.

The suspicion is that for replicating backups it doesn't matter in which order the chunks are read. For writing to tape it in fact does matter in which order the reading of chunks to memory does happen and maybe there is no read-ahead buffer there.

Can we have any clarification from the devs? Maybe a hidden performance tuning option somewhere?

greetings.
 
no offense, but well yeah, sorry that answer sounds a bit too easy.

I could imagine that a smarter read ahead buffer feature could also do the "trick". Having close to one pb as flash for backup storage is just not very cost effective, and also quite unusual. Spinning rust is still the default in backup servers today i assume.

I mean it's not the hardware per se that's not capable enough. It's the chosen method of chunking combined with inefficient read procedures that are not maxing out the hardware capabilities in their current implementation. Maybe the developers can look into it?

regards
 
well there is an internal read ahead buffer, but that is maybe not enough (i don't remember the specifics atm)

there is one tuning option that might help that is on the datastore tuning the 'chunk order' by inode or none at all, but the default for that is already 'inode'
depending on the underlying storage though it might not help to sort by inode, then setting it to 'none' could mean a speedup since we don't waste time listing/sorting the chunks inode number
(you can set this on the gui in datastore -> options)

aside from that, i know where we have some potential for performance increase in the tape backup code, but sadly had no time to implement them yet ;)
 
@dcsapak thanks for the elaborate answer. I'm really trying to figure out how to scale pbs out.. I will try setting to "none", still, the documentation suggests, that it might have no effect or even be slower with spinners..

I guess it performs a lot better when doing a sync, because reads/write are heavily parallelized and the chunk order does not matter?

regrads
 
technically there should not be that much difference between a tape backup and a sync, we basically iterate over the snapshots and for each we write/pull the relevant chunks from the datastore

there might be some issues with not enough buffer for the tape so that it has to slow down, which of course won't be a problem for the regular sync

i have some tape work to do soon anyways, maybe i can find time to check if we have some improvements
 
@dcsapak that'd be awewsome. I was wondering all the time how LTO is handling constant buffer underruns. Will it just pause and wait until the buffer is filled or will the constant tape speed just be tapered down?

i already looked into the code, but i am not a rust guru :(. if i still can be helpful in some way, i would be happy to.
 
@Lephisto @dcsapak For the record on our Dell TL4000 LTO7 tape library with three drives (internally IBM 3573-TL + 3 ULT3580-HH7 drives) when setting up in parallel three tape backup jobs with three different media pools on three different drives we've seen up to 900 MByte/s of tape write, aggregate average of 600+ MByte/s over a few hours.

Our PBS is full SSD (RAIDZ3 9+3), server is HPE DL385 AMD EPYC, 2 PCIe HBA dual external SAS, three SAS cables for the three drives.

It would be nice to have the notion of "drive set" for a job and have PBS do the balancing work between the tape drives, for now the only option I see with PBS is to set up everything manually.
 
Last edited:
technically there should not be that much difference between a tape backup and a sync, we basically iterate over the snapshots and for each we write/pull the relevant chunks from the datastore

there might be some issues with not enough buffer for the tape so that it has to slow down, which of course won't be a problem for the regular sync

i have some tape work to do soon anyways, maybe i can find time to check if we have some improvements
hey @dcsapak,

let me bump this up. Is there any news regarding tape write performance?

thanks in advance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!