PBS Remote Restore too Slow

crashnightmare

New Member
Dec 15, 2023
7
0
1
We are testing our business continuity planning.

We have a local PBS instance that syncs to a remote PBS instance. We are trying to benchmark restores from this remote PBS instance back up to our datacenter's PVE hosts.

The Host node specs are as follows:

2x 4TB Samsung NVME (vm storage)
AMD Ryzen 7950x3d
128GB Ram

The PBS instance has:
4gb Ram
4vCPU (physical host cpus are actually XEON E5-2670 2.6ghz)
64GB HD
Networked truenas Datastore of 30TB storage - Raidz2

CPUS on both sides have aes support.

When we do a restore we get about 7M /s transfer speed. Restoring vm's is super slow.

What we have tried:

- Run benchmark below
- Check atop on host and inside vm when restore is running (no cpu pegging, tons of idle and free resources)
- Check iostat (no bottlenecks)
- Use iperf3 to test transfer speed between host machine and vm (below)




ploaded 104 chunks in 55 seconds.
Time per request: 535207 microseconds.
TLS speed: 7.84 MB/s
SHA256 speed: 2353.60 MB/s
Compression speed: 728.08 MB/s
Decompress speed: 934.53 MB/s
AES256/GCM speed: 5154.45 MB/s
Verify speed: 668.44 MB/s
┌───────────────────────────────────┬─────────────────────┐
│ Name │ Value │
╞═══════════════════════════════════╪═════════════════════╡
│ TLS (maximal backup upload speed) │ 7.84 MB/s (1%) │
├───────────────────────────────────┼─────────────────────┤
│ SHA256 checksum computation speed │ 2353.60 MB/s (116%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 compression speed │ 728.08 MB/s (97%) │
├───────────────────────────────────┼─────────────────────┤
│ ZStd level 1 decompression speed │ 934.53 MB/s (78%) │
├───────────────────────────────────┼─────────────────────┤
│ Chunk verification speed │ 668.44 MB/s (88%) │
├───────────────────────────────────┼─────────────────────┤
│ AES256 GCM encryption speed │ 5154.45 MB/s (141%) │
└───────────────────────────────────┴─────────────────────┘

root@pbs:~# iperf3 -c host
Connecting to host host, port 5201
[ 5] local 192.168.10.24 port 34596 connected to x.x.x.x port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 19.4 MBytes 163 Mbits/sec 34 112 KBytes
[ 5] 1.00-2.00 sec 20.0 MBytes 168 Mbits/sec 1 178 KBytes
[ 5] 2.00-3.00 sec 21.1 MBytes 177 Mbits/sec 26 167 KBytes
[ 5] 3.00-4.00 sec 16.5 MBytes 139 Mbits/sec 37 130 KBytes
[ 5] 4.00-5.00 sec 25.6 MBytes 215 Mbits/sec 18 102 KBytes
[ 5] 5.00-6.00 sec 28.8 MBytes 242 Mbits/sec 0 235 KBytes
[ 5] 6.00-7.00 sec 27.0 MBytes 227 Mbits/sec 37 80.6 KBytes
[ 5] 7.00-8.00 sec 14.5 MBytes 122 Mbits/sec 4 148 KBytes
[ 5] 8.00-9.00 sec 24.7 MBytes 207 Mbits/sec 13 147 KBytes
[ 5] 9.00-10.00 sec 29.6 MBytes 249 Mbits/sec 2 197 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 227 MBytes 191 Mbits/sec 172 sender
[ 5] 0.00-10.00 sec 226 MBytes 190 Mbits/sec receiver

iperf Done.


Any ideas on what we could check or improve?
 
Look, you have put together a great post with lots of data for people to tear apart.
Let me just point out ... you don't really know _where_ the problem is unless you've tried a restore with a host local to that PBS instance.
Do you have a PVE host at the remote site? Or can you create one for the purposes of testing how restores work out locally at that site?
If you were able to test a restore onsite, and it was sufficient, then its network.
If you tested a restore onsite and it was insufficient, its your PBS deployment.
This is of course, obvious. Have you tried it?
 
@tcabernoch made a good point. After you've followed his recommendations I would like to add:
Networked truenas Datastore of 30TB storage - Raidz2
That is by far the absolute worse construction possible.

Are you the only user or is it shared? That's rotating rust or SSDs? If HDD: destroy it and rebuild it with striped mirrors. The more vdevs the better. Add a "Special Device" for Metadata, using a mirror of SSD/NVMe.

root@pbs:~# iperf3 -c host
I can not recognize the "ping"-distance from this. Low latency is a requirement to get good performance. (But I have no numbers to compare to.)
 
Thank you both... I will build a PVE node local to the remote PBS and try a restore directly on that device. It might take me a few days to get it done, but its an excellent testing step. It appears obvious now that it was mentioned....it didn't cross my mind until you challenged me. (We don't have a spare server around, but we do have a tower that I can repurpose for a few days.

The truenas questions: we are the only user, rotating rust.

I'll look to see if we can reconfigure or add a special device for metadata to an already running pool.
 
Hey crash ... Search this forum for NFS mounts and PBS.

You're gonna see lots of posts about how to do it. Don't read those.

You'll also find posts where the nerds get down and dirty about how much worse NFS mounts are than any other storage option.
You should read those.

In particular, look for @Der Harry 's team's post where they tested various storage options, for the sole purpose of proving that NFS is terrible.
This is the core of what Udo was telling you ... although he got into ZFS tuning immediately afterwards, and it may be a bit difficult to untangle.

To be specific, the very worst PBS storage option is to mount an NFS store in the PBS operating system, and then use that NFS mount as if it was local storage and deploy a PBS Datastore to it.

(Yes, Udo and I interpret for each other. It's getting to be a habit.)

---------------------------------
Some references for you.

ZFS special vdev
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

More ZFS special vdev. this is gold.
https://klarasystems.com/articles/openzfs-understanding-zfs-vdev-types/

PBS .chunks Abuse Script - Der Harry
 
Last edited:
  • Like
Reactions: _gabriel and UdoB