PBS Remote Restore too Slow

crashnightmare · Dec 28, 2024

We are testing our business continuity planning.

We have a local PBS instance that syncs to a remote PBS instance. We are trying to benchmark restores from this remote PBS instance back up to our datacenter's PVE hosts.

The Host node specs are as follows:

2x 4TB Samsung NVME (vm storage)
AMD Ryzen 7950x3d
128GB Ram

The PBS instance has:
4gb Ram
4vCPU (physical host cpus are actually XEON E5-2670 2.6ghz)
64GB HD
Networked truenas Datastore of 30TB storage - Raidz2

CPUS on both sides have aes support.

When we do a restore we get about 7M /s transfer speed. Restoring vm's is super slow.

What we have tried:

- Run benchmark below
- Check atop on host and inside vm when restore is running (no cpu pegging, tons of idle and free resources)
- Check iostat (no bottlenecks)
- Use iperf3 to test transfer speed between host machine and vm (below)

ploaded 104 chunks in 55 seconds.
Time per request: 535207 microseconds.
TLS speed: 7.84 MB/s
SHA256 speed: 2353.60 MB/s
Compression speed: 728.08 MB/s
Decompress speed: 934.53 MB/s
AES256/GCM speed: 5154.45 MB/s
Verify speed: 668.44 MB/s
┌───────────────────────────────────┬─────────────────────┐
│ Name │ Value │
╞═══════════════════════════════════╪═════════════════════╡
│ TLS (maximal backup upload speed) │ 7.84 MB/s (1%) │
├───────────────────────────────────┼─────────────────────┤
│ SHA256 checksum computation speed │ 2353.60 MB/s (116%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 compression speed │ 728.08 MB/s (97%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 decompression speed │ 934.53 MB/s (78%) │
├───────────────────────────────────┼─────────────────────┤
│ Chunk verification speed │ 668.44 MB/s (88%) │
├───────────────────────────────────┼─────────────────────┤
│ AES256 GCM encryption speed │ 5154.45 MB/s (141%) │
└───────────────────────────────────┴─────────────────────┘

root@pbs:~# iperf3 -c host
Connecting to host host, port 5201
[ 5] local 192.168.10.24 port 34596 connected to x.x.x.x port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 19.4 MBytes 163 Mbits/sec 34 112 KBytes
[ 5] 1.00-2.00 sec 20.0 MBytes 168 Mbits/sec 1 178 KBytes
[ 5] 2.00-3.00 sec 21.1 MBytes 177 Mbits/sec 26 167 KBytes
[ 5] 3.00-4.00 sec 16.5 MBytes 139 Mbits/sec 37 130 KBytes
[ 5] 4.00-5.00 sec 25.6 MBytes 215 Mbits/sec 18 102 KBytes
[ 5] 5.00-6.00 sec 28.8 MBytes 242 Mbits/sec 0 235 KBytes
[ 5] 6.00-7.00 sec 27.0 MBytes 227 Mbits/sec 37 80.6 KBytes
[ 5] 7.00-8.00 sec 14.5 MBytes 122 Mbits/sec 4 148 KBytes
[ 5] 8.00-9.00 sec 24.7 MBytes 207 Mbits/sec 13 147 KBytes
[ 5] 9.00-10.00 sec 29.6 MBytes 249 Mbits/sec 2 197 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 227 MBytes 191 Mbits/sec 172 sender
[ 5] 0.00-10.00 sec 226 MBytes 190 Mbits/sec receiver

iperf Done.

Any ideas on what we could check or improve?

tcabernoch · Dec 28, 2024

Look, you have put together a great post with lots of data for people to tear apart.
Let me just point out ... you don't really know _where_ the problem is unless you've tried a restore with a host local to that PBS instance.
Do you have a PVE host at the remote site? Or can you create one for the purposes of testing how restores work out locally at that site?
If you were able to test a restore onsite, and it was sufficient, then its network.
If you tested a restore onsite and it was insufficient, its your PBS deployment.
This is of course, obvious. Have you tried it?

UdoB · Dec 29, 2024

@tcabernoch made a good point. After you've followed his recommendations I would like to add:

crashnightmare said:
Networked truenas Datastore of 30TB storage - Raidz2

That is by far the absolute worse construction possible.

Are you the only user or is it shared? That's rotating rust or SSDs? If HDD: destroy it and rebuild it with striped mirrors. The more vdevs the better. Add a "Special Device" for Metadata, using a mirror of SSD/NVMe.

crashnightmare said:
root@pbs:~# iperf3 -c host

I can not recognize the "ping"-distance from this. Low latency is a requirement to get good performance. (But I have no numbers to compare to.)

crashnightmare · Dec 29, 2024

Thank you both... I will build a PVE node local to the remote PBS and try a restore directly on that device. It might take me a few days to get it done, but its an excellent testing step. It appears obvious now that it was mentioned....it didn't cross my mind until you challenged me. (We don't have a spare server around, but we do have a tower that I can repurpose for a few days.

The truenas questions: we are the only user, rotating rust.

I'll look to see if we can reconfigure or add a special device for metadata to an already running pool.

crashnightmare · Dec 29, 2024

As a follow up, ping time is less than 6ms (between 3 - 5.53ms).

_gabriel · Dec 29, 2024

crashnightmare said:
The PBS instance has:
...
CPUS on both sides have aes support.

Set Processors/CPU Type to host , to fix weird TLS speed.

tcabernoch · Dec 30, 2024

Hey crash ... Search this forum for NFS mounts and PBS.

You're gonna see lots of posts about how to do it. Don't read those.

You'll also find posts where the nerds get down and dirty about how much worse NFS mounts are than any other storage option.
You should read those.

In particular, look for @Der Harry 's team's post where they tested various storage options, for the sole purpose of proving that NFS is terrible.
This is the core of what Udo was telling you ... although he got into ZFS tuning immediately afterwards, and it may be a bit difficult to untangle.

To be specific, the very worst PBS storage option is to mount an NFS store in the PBS operating system, and then use that NFS mount as if it was local storage and deploy a PBS Datastore to it.

(Yes, Udo and I interpret for each other. It's getting to be a habit.)

---------------------------------
Some references for you.

ZFS special vdev
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

More ZFS special vdev. this is gold.
https://klarasystems.com/articles/openzfs-understanding-zfs-vdev-types/

PBS .chunks Abuse Script - Der Harry

D

[TUTORIAL] Thread 'Datastore Performance Tester for PBS'

Jun 11, 2024

This is a performance tester for datastores of your PBS.

-> Intended before you setup a production PBS. <-

Bash:

apt-get update
apt-get install git
git clone https://github.com/egandro/pbs-storage-perf-test.git
cd pbs-storage-perf-test
# replace datastore-dir with your own directory
./create_random_chunks.py /datastore-dir
rm -rf /datastore-dir/dummy-chunks

- Read how/what/and why we test in this way: https://github.com/egandro/pbs-storage-perf-test/blob/main/README.md
- Our results: https://github.com/egandro/pbs-storage-perf-test/blob/main/results.md
- Our conclusion...

Search

Search

PBS Remote Restore too Slow

crashnightmare

New Member

tcabernoch

Active Member

UdoB

Distinguished Member

crashnightmare

New Member

crashnightmare

New Member

_gabriel

Famous Member

tcabernoch

Active Member

[TUTORIAL] Thread 'Datastore Performance Tester for PBS'

We value your privacy