Why Restore is so slow from PBS and what is slow/fast ?

Aug 2, 2022
23
1
1
Hello

1. What speed should one expect from PBS ? (restore / backup) Is it Normal that restore operation of 60GB VM took us 10 minutes ?

2. If it's not slow by design why it's so painfully slow in our situation?
PVE nodes are 2x10G LACP with separate 10G for proxmox operations (Xeon 2680V3, 512GB DDR4 and 2 x SSD RAID1 disks for VMs)
PBS is a Xeon X5650 (single) 96GB RAM, 2x10G and 12 x SAS HGST He10 disks in RAID 10

Restore operation doesn't consume anything at all.
CPU on both PVE and PBS is <10% busy, LA <1. Disk on PBS at the time of restore - READ ~100 iops, utilization <10%, latency <5ms.

At the time of restore we are seeing 200-400Mbit restore speed with no resources consumed (less than 10% consumption of CPU/RAM/DISK IO / Network)

Iperf3 from PVE to PBS is 8-9Gbit in both directions.

p.s. Restoration from Acronis Cloud via WAN network from DC in another country/continent is faster than this.
p.p.s. why PBS client opens new connection on every chunk download?
 
PBS isn't slow by design. It's designed for local SSDs as the datastore and you are using HDDs:

Recommended Server System Requirements

...
  • Backup storage:
    • Use only SSDs, for best results
    • If HDDs are used: Using a metadata cache is highly recommended, for example, add a ZFS special device mirror.
Everything is stored as millions of small ~2MB files. So a lot of random IO performance is needed where HDDs are terrible at.
 
Last edited:
You really only need SSDs for the metadata cache.
Then, you will see reasonable speeds while restoring from spinning disks.
 
PBS isn't slow by design. It's designed for local SSDs as the datastore and you are using HDDs:

Everything is stored as millions of small ~2MB files. So a lot of random IO performance is needed where HDDs are terrible at.
Thank you for the reply and please don't start with "you are using HDDs" )
Seriously, we are selling storage systems ( part of our business)
I already wrote in main post that disk IO stays idle, no latency spikes, no high IO usage, nothing.
This is a stable 1k iops system with ~5ms latency. (10ms max)


This system is not some kind of desktop with WD Green drives in RAID1 with parking heads )
 
Also PBS benchmark - is it only to check Upload ? Backup speed (just tested with VM backup) is stable 130-150MB/s. That is 1.3Gbit +
So why restore is ~200-400Mbit ?

Uploaded 471 chunks in 5 seconds.
Time per request: 10650 microseconds.
TLS speed: 393.82 MB/s
SHA256 speed: 214.14 MB/s
Compression speed: 327.04 MB/s
Decompress speed: 516.87 MB/s
AES256/GCM speed: 628.96 MB/s
Verify speed: 165.04 MB/s
 
The PBS benchmark also doesn't benchmark any storage speeds. as it just uploads to RAM. So its just benchmarks CPU and network and storage can still be the bottleneck, even when that benchmark shows great numbers.

Your biggest problem should be the GC. With HDD-only, this might run for days and while it is running your HDDs will be 100% utilized slowing down all backup/restore/verify jobs.
 
Last edited:
PBS isn't slow by design. It's designed for local SSDs as the datastore and you are using HDDs:

Everything is stored as millions of small ~2MB files. So a lot of random IO performance is needed where HDDs are terrible at.
Even with SSDs I have encountered the same issue as the OP.
 
Resurrecting this thread as I have the same problem. Performance of backup and restore is terrible, even with 10G network and all flash source/destination servers.

See here:

https://forum.proxmox.com/threads/pbs-performance-improvements-enterprise-all-flash.150514/

I am also running a secondary PBS with spinners, I can see 140MB/s per node to that... Yet my primary PBS with 24 x 7.68TB SAS12 SSD's and 2 40GB NIC's still only nets me 200-250MB/s.

Restore performance is absolutely terrible. 3 1/2 minutes for a 35 gig VM.

Xeon platinum 8268's in the PVE's, Xeon gold 6144's in the PBS.
 
Resurrecting this thread as I have the same problem. Performance of backup and restore is terrible, even with 10G network and all flash source/destination servers.

See here:

https://forum.proxmox.com/threads/pbs-performance-improvements-enterprise-all-flash.150514/

I am also running a secondary PBS with spinners, I can see 140MB/s per node to that... Yet my primary PBS with 24 x 7.68TB SAS12 SSD's and 2 40GB NIC's still only nets me 200-250MB/s.

Restore performance is absolutely terrible. 3 1/2 minutes for a 35 gig VM.

Xeon platinum 8268's in the PVE's, Xeon gold 6144's in the PBS.
Hello

I believe that the main issue is in backup process on PBS itself - it is single threaded. Try to start the backup and check TOP on PBS (bash/cli) - you will find single process eating only single core on 100%. I can assume that this a questionable design.

p.s. I wrote in another similar forum thread that our new PBS is on Xeon scalable v1 (81**) and on NVMe disks - speed still the same (low).
 
Hello

I believe that the main issue is in backup process on PBS itself - it is single threaded. Try to start the backup and check TOP on PBS (bash/cli) - you will find single process eating only single core on 100%. I can assume that this a questionable design.

p.s. I wrote in another similar forum thread that our new PBS is on Xeon scalable v1 (81**) and on NVMe disks - speed still the same (low).

I have testing using my secondary PBS which is all spinning disks, vs my primary which is all SSD.... Performance is very close, almost the same.

There is a MASSIVE CPU limitation here.
 
So I'm seeing a massive drop in performance when restoring over a high bandwidth, high latency link. Could an implementation that downloads chunks in parallel instead of sequence help a lot here? The bandwidth is there but due to the high latency the TCP throughput for a single connection is massively reduced.

Even increasing the TCP window sizes so that the proxmox-backup-client benchmark TLS speed increases does _not_ alleviate this problem, because its simply one connection after the other. Parallel chunk downloads would potentially massively speed up recovery here.
 
Last edited:
So I'm seeing a massive drop in performance when restoring over a high bandwidth, high latency link. Could an implementation that downloads chunks in parallel instead of sequence help a lot here? The bandwidth is there but due to the high latency the TCP throughput for a single connection is massively reduced.

the chunks are downloaded in parallel already, the problem is that HTTP2 has a maximum of "in-flight" data, and with a high latency link, throughput suffers as a result.
 
  • Like
Reactions: Dunuin
Is there a way to get an answer regarding this matter from Proxmox representative?
we are currently going through the code to identify potential bottle necks for fast/highly parallel hardware.
 
  • Like
Reactions: Dunuin
the chunks are downloaded in parallel already, the problem is that HTTP2 has a maximum of "in-flight" data, and with a high latency link, throughput suffers as a result.
Would a move to QUIC a.k.a. HTTP3 improve this somewhat?

I'm asking because what I'm seeing in regards to maximum bandwidth used we're far from the maximum possible (about 1/6 of the link bandwidth).

I've run iperf over the same link and I am seeing a much higher bandwidth.

For clarification this is a 1 Gbit link with a latency of 120ms (Cross-Continent). Standard bandwidth without TCP window tuning that was reachable for a single connection was about 250 Mbit (which was about the speed that the restore operation was running at, roughly 28Mb/s ~ 228 Mbit).

For parallel iperf connections I could saturate the full Gbit link.

After playing around with TCP window size defaults in the kernels on both machines I've been able to reach a throughput for a single iperf connection of about 800 Mbit.

The proxmox-backup-client benchmark TLS max upload speed has increased accordingly from previously 30 Mb/s to 78 Mb/s (which still isn't the maximum but it gets much closer).

However running an actual restore operation still caps out at 28 Mb/s. I get that the recommendation is to run a PBS instance locally if possible and that is probably what we'll do, I'm just wondering whether this problem is also going to exist for PBS sync between two locations or whether the mechanism used is different.

In general I'm seeing improved bandwidth usage with parallel operations so I'm wondering whether multiple HTTP/2 connections wouldn't also increase the throughput if a single connection cannot saturate the link due to the higher latency.
 
Would a move to QUIC a.k.a. HTTP3 improve this somewhat?

something that we want to evaluate at some point ;)

I'm asking because what I'm seeing in regards to maximum bandwidth used we're far from the maximum possible (about 1/6 of the link bandwidth).

I've run iperf over the same link and I am seeing a much higher bandwidth.

For clarification this is a 1 Gbit link with a latency of 120ms (Cross-Continent). Standard bandwidth without TCP window tuning that was reachable for a single connection was about 250 Mbit (which was about the speed that the restore operation was running at, roughly 28Mb/s ~ 228 Mbit).

For parallel iperf connections I could saturate the full Gbit link.

After playing around with TCP window size defaults in the kernels on both machines I've been able to reach a throughput for a single iperf connection of about 800 Mbit.

The proxmox-backup-client benchmark TLS max upload speed has increased accordingly from previously 30 Mb/s to 78 Mb/s (which still isn't the maximum but it gets much closer).

this

However running an actual restore operation still caps out at 28 Mb/s. I get that the recommendation is to run a PBS instance locally if possible and that is probably what we'll do, I'm just wondering whether this problem is also going to exist for PBS sync between two locations or whether the mechanism used is different.

and this make me wonder whether there is still something lurking there.. since the benchmark and a restore use the same connection mechanism (just the bulk of data is transferred in the different direction)..

could you maybe post the full benchmark output?

In general I'm seeing improved bandwidth usage with parallel operations so I'm wondering whether multiple HTTP/2 connections wouldn't also increase the throughput if a single connection cannot saturate the link due to the higher latency.

yes, multiple connections in parallel would help solve the problem for such setups, but add complexity (and also, overhead at scale - connections aren't an unlimited resource after all). that doesn't mean we won't add such support at some point..

FWIW, a sync is pretty much equivalent to a restore from a connection stand point (both use a "reader" session over http2)
 
could you maybe post the full benchmark output?
Before adjusting TCP Window size:

Iperf:
Code:
iperf -c ping.online.net -t 40
------------------------------------------------------------
Client connecting to ping.online.net, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  1] local 10.0.1.75 port 41798 connected with 51.158.1.21 port 5001 (icwnd/mss/irtt=14/1448/86464)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-40.1162 sec  2.56 GBytes   549 Mbits/sec


Proxmox:
Code:
Uploaded 46 chunks in 6 seconds.
Time per request: 132006 microseconds.
TLS speed: 31.77 MB/s
SHA256 speed: 1618.10 MB/s
Compression speed: 387.27 MB/s
Decompress speed: 504.57 MB/s
AES256/GCM speed: 1088.26 MB/s
Verify speed: 383.82 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 31.77 MB/s (3%)    │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1618.10 MB/s (80%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 387.27 MB/s (51%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 504.57 MB/s (42%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 383.82 MB/s (51%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1088.26 MB/s (30%) │
└───────────────────────────────────┴────────────────────┘

After adjusting TCP window size and maximum buffer (which actually improved more than the window size change alone):
Code:
sysctl net.ipv4.tcp_wmem="4096 349520 25165824"
net.ipv4.tcp_wmem = 4096 349520 25165824
sysctl net.ipv4.tcp_rmem="4096 349520 25165824"
net.ipv4.tcp_rmem = 4096 349520 25165824
sysctl net.ipv4.tcp_wmem=25165824
net.ipv4.tcp_wmem = 25165824
sysctl net.ipv4.tcp_rmem=25165824



Iperf:
Code:
iperf -c ping.online.net -t 40
------------------------------------------------------------
Client connecting to ping.online.net, TCP port 5001
TCP window size:  341 KByte (default)
------------------------------------------------------------
[  1] local 10.0.1.75 port 37962 connected with 51.158.1.21 port 5001 (icwnd/mss/irtt=14/1448/86296)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-40.1872 sec  3.39 GBytes   725 Mbits/sec

Proxmox:
Code:
Uploaded 91 chunks in 5 seconds.
Time per request: 57930 microseconds.
TLS speed: 72.40 MB/s
SHA256 speed: 1615.10 MB/s
Compression speed: 385.80 MB/s
Decompress speed: 500.64 MB/s
AES256/GCM speed: 1098.43 MB/s
Verify speed: 381.71 MB/s
┌───────────────────────────────────┬────────────────────┐
│ Name                              │ Value              │
╞═══════════════════════════════════╪════════════════════╡
│ TLS (maximal backup upload speed) │ 72.40 MB/s (6%)    │
├───────────────────────────────────┼────────────────────┤
│ SHA256 checksum computation speed │ 1615.10 MB/s (80%) │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 compression speed    │ 385.80 MB/s (51%)  │
├───────────────────────────────────┼────────────────────┤
│ ZStd level 1 decompression speed  │ 500.64 MB/s (42%)  │
├───────────────────────────────────┼────────────────────┤
│ Chunk verification speed          │ 381.71 MB/s (50%)  │
├───────────────────────────────────┼────────────────────┤
│ AES256 GCM encryption speed       │ 1098.43 MB/s (30%) │
└───────────────────────────────────┴────────────────────┘

FWIW, a sync is pretty much equivalent to a restore from a connection stand point (both use a "reader" session over http2)
Is the sync parallelized across different backups or does it do them one by one? I'm wondering whether the bandwidth we have over this link will suffice or whether we're going to have to do something about this.
 
sync will do snapshot by snapshot (although here it would be fairly trivial to do X groups in parallel I guess), but you can setup multiple sync jobs in parallel and use filters to split the load (e.g., groups 1-10 in one job, 11-20 in another).

edit: parallel syncing is tracked here: https://bugzilla.proxmox.com/show_bug.cgi?id=4182 so if you want to get updates on that, please subscribe to that bug!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!