GC, sync timeline really slow

Sep 26, 2023
89
7
8
Hello.
Here's my environment:
corp office - pve and pbr
dr side - pve/pbr combo.
network at corp office 300/300mb
dr side - 2gb/2gb. limited to 1gb as that's all i have for networking currently.
Speed test at dr side - 970 down/785 up. could be networking but 'big enough' for fast speeds.
Backup strategy is every hour on specific servers, daily on non-essential servers and a few servers w/o much change data - once a week.
Sync schedule at DR is set to run every 2 hours.

Usual transfers for sync job with many servers averages around 5.14gb but takes 2 hrs. this shouldn't take more than about 15 minutes - max.
copy, via windows for 1.4Tb iso file takes 7.47 minutes. ran twice, with corp to dr from different machines - same approx speed.

corp side - gc takes about 3 hours
dr side - takes about 3 hours
verify at corp takes about 10 hours
verify at dr side takes about 22.5 hrs.

yes - i backup at the corp office and verify there - they sync (pull from dr side) and verify the data there.

Currently corp side - running 2 jobs (verify and database read objects) is showing 40%cpu and 30% i/o.
Current dr side - running verify and sync jobs - 40%cpu and 42% I/o.

I am trying to purchase another dedicated pbr for the dr side and offload that instead of running a 'nested' pbr on my pve.
The current Raid 10 is all 'rust' drives at the DR side
Production box at corp is all SSD drives.
2nd pve/pbr box at corp office is 'rust' drives on the data pool.

I haven't 'tweaked' any settings on either side but I shouldn't be experiencing all these types of latency should I? Especially with the sync jobs - i'd expect them to be rather quick. The other, smaller sync jobs, only take about 10-15 minutes to complete.

I can post any jobs needed to help clarify up anything about 'times' if it's helpful.
I know having a 'nested' pbr isn't recommended or good but that's what I've got to work with.
 
how is the latency between the two sides? what if you run "proxmox-backup-client benchmark --repository ..." on on site, pointed at the other?
 
how is the latency between the two sides? what if you run "proxmox-backup-client benchmark --repository ..." on on site, pointed at the other?
I hadn't run that yet but can if I understand correctly what/where it's to be run at.
I presume that this is run on the DR side that is scheduled to pull from the Corp office? Is the repository, the 'repository' the backup datastore on the remote side, which it is pulling from?

meaning - i would run this 'task' on the remote side which is scheduled to pull from the corp side - and specify the corp side datastore name?

from a ping/tracert cmd - latency is 83ms.

not great if I'm reading this correctly.
1749556781567.png
 
Last edited:
from a ping/tracert cmd - latency is 83ms.

then this is the reason why sync is slow - it's using HTTP 2 for transferring chunks, and the high latency reduces the throughput significantly.
 
then this is the reason why sync is slow - it's using HTTP 2 for transferring chunks, and the high latency reduces the throughput significantly.
Ok - then how can it be sped up?
what protocol/system changes need to be made to adjust this?

You can't just say that w/o offering other options or solutions - can you?

The 2 sites are connected via a site-2-site vpn connection.
All servers are running the latest supported code as well. My licenses are 'paid' and not community - if that matters and as such, I'd of thought this might have been a 'patch' update to the code?

I presume you are referring to this article - https://pbs.proxmox.com/docs/backup-protocol.html
If I understand the correctly then, although I am using the GUI to schedule the backups on the DR side, then the protocol needs to be upgraded to -v1. Is this correct? And if so, then this command - GET /api2/json/backup HTTP/1.1 UPGADE: proxmox-backup-protocol -v1 should be run? Or is the documentation doing the Get /api2/json/backup and returning the HTTP/1.1 answer - and then the command to upgrade is what?

Sorry - but it's a little confusing with what needs to be done to show, and the command to 'set' it to the upgraded protocol.
 
Last edited:
Ok - then how can it be sped up?
what protocol/system changes need to be made to adjust this?

You can't just say that w/o offering other options or solutions - can you?

The 2 sites are connected via a site-2-site vpn connection.
All servers are running the latest supported code as well. My licenses are 'paid' and not community - if that matters and as such, I'd of thought this might have been a 'patch' update to the code?
You could try and adapt the congestion control algorithm and TCP buffer size, as a user reported seeing significant throughput improvements by adapting these, see https://forum.proxmox.com/threads/backup-sync-performance-over-vpn.164450/post-761198
 
You could try and adapt the congestion control algorithm and TCP buffer size, as a user reported seeing significant throughput improvements by adapting these, see https://forum.proxmox.com/threads/backup-sync-performance-over-vpn.164450/post-761198
Thanks Chris for the other link. As the pbr is a vm running off of the pve (not recommended I know) by entering these commands on the pbr - are there any changes that would need to be done to the pve (hosting the pbr) or just done on the pbr server. I presume that by rebooting the server and not entering the changes to the actual sysctl.cfg file then the setting would revert back to what I have now?
 
Part of my confusion I guess I still am struggling with is actually WHY the transfer is so low. I can probably tweak the tcp settings as mentioned in other posts (and above) but if I have to only transfer about 5.2gb of data and my rates above are showing 187mb/sec then the whole process shouldn't take that long - and certainly not 'hours'. It seems like there's still something going on with some of the steps because if I can transfer 1.4TB file (mega times the size of a 5gb file) in about 7 minutes I'd expect the transfer and process to complete much faster.
 
http 2 allows using multiple streams within a single connection. for that to work, it needs to manage in-flight data, and that gets slow when latency is high, because before the "slot" used for a certain piece of data is usable again, the sender needs to wait for the receiver to ack having received it, which means a full round trip. simpler protocols just let the lower levels of the stack take care of that (via back pressure) and are less affected by this, and as a result, achieve higher throughput.