Help PBS speed fixed 1gbs

Testani

Member
Oct 22, 2022
39
4
13
Hi everyone, I installed a bare metal PBS on a dual 4110 with 128GB of ram and SSD.
I tried all possible configurations switching from ZFS to any configuration, in extreme I even created a DS with 1 single SSD. I'm on a full 10GB network, the proxmox servers have 2 10gb and the PBS 2 10gb with MTU and network speed tested withiperf.
I attach a screenshot of the PBS configuration and the compression used, in no way can I reach a backup speed higher than 90/100 MB/s. Where can I check and understand where the bottleneck is? Does PBS have 1GB limits somewhere?
 

Attachments

  • p1.png
    p1.png
    182.1 KB · Views: 35
  • p2.png
    p2.png
    158.2 KB · Views: 34
  • p3.png
    p3.png
    145.2 KB · Views: 34
What does your bond status say? What do your ports themselves say? Is the switch configured incorrectly? Is there perhaps a limiter? What does iperf say?
 
This is network tesT:


[ 1] local 10.12.0.249 port 47152 connected with 10.12.0.37 port 5001 (icwnd/mss/irtt=14/1448/106)


[ ID] Interval Transfer Bandwidth


[ 1] 0.0000-10.0134 sec 6.15 GBytes 5.27 Gbits/sec


root@pbs09:~# iperf -c 10.12.0.37


------------------------------------------------------------


Client connecting to 10.12.0.37, TCP port 5001


TCP window size: 85.0 KByte (default)


------------------------------------------------------------


[ 1] local 10.12.0.249 port 41052 connected with 10.12.0.37 port 5001 (icwnd/mss/irtt=14/1448/106)


[ ID] Interval Transfer Bandwidth


[ 1] 0.0000-10.0141 sec 5.80 GBytes 4.98 Gbits/sec
 
Is it possible that your PVE node is simply at its limit and simply can't deliver anymore? Or have you set a limiter on the backup bandwidth so as not to overload your drives? Have you ever done a benchmark on the PBS datastore?
 
here screenshot of speed and pbs load
 

Attachments

  • s7.png
    s7.png
    95.9 KB · Views: 27
  • s8.png
    s8.png
    135 KB · Views: 28
It's difficult to help you if you ignore half the questions.

You can also have thousands of PVE nodes, what would that change? Nothing.
You have to provide a few more facts, answer the questions, check something or try something out.
 
Thanks, i have done all the test. What am i Missing? No limits speed on nodes, no i/o delay, VM run smoothly with 1gb of r/w on Local SSD disk, network bandwidth is ok, PBS test returns this values:

Time per request: 13590 microseconds.


TLS speed: 308.61 MB/s


SHA256 speed: 215.52 MB/s


Compression speed: 373.76 MB/s


Decompress speed: 575.69 MB/s


AES256/GCM speed: 1246.04 MB/s


Verify speed: 161.82 MB/s


+===================================+====================+


| Name | Value |


+===================================+====================+


| TLS (maximal backup upload speed) | 308.61 MB/s (25%) |


+-----------------------------------+--------------------+


| SHA256 checksum computation speed | 215.52 MB/s (11%) |


+-----------------------------------+--------------------+


| ZStd level 1 compression speed | 373.76 MB/s (50%) |


+-----------------------------------+--------------------+


| ZStd level 1 decompression speed | 575.69 MB/s (48%) |


+-----------------------------------+--------------------+


| Chunk verification speed | 161.82 MB/s (21%) |


+-----------------------------------+--------------------+


| AES256 GCM encryption speed | 1246.04 MB/s (34%) |


+===================================+====================+



what can i check?
 
For example, you could evaluate why your PBS obviously can't reach 20g via iperf.

Maybe this could help you: https://forum.proxmox.com/threads/lacp-bonding-with-2-x10g-nic-are-giving-10g-traffic-only.111428/

Check you vzdump config to make Sure, there is really no Limit Set: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_configuration

You haven't told us yet which switches you use. Here you should check the config to see whether the port is not limited by something and is also configured correctly.
I also don't know if you can achieve the 20g between two nodes.
You also didn't reveal which MTU you set and whether the switch can do it and is configured correctly on each node. The wrong MTU settings can cause various problems.

There is also no configuration or information about how you integrated the datastore on the PVE. Whether you e.g. Use encryption or not.
 
Did you find a solution for this?

I've got 2 PBS instances running as VM on PVE.
Both hooked up with 2x25 Gbit/s Nics.
Bonding via LACP Layer 3+4 and passed through via virtio to BPS VM.

I'm observing the following:
Verify jobs and sync jobs are somehow thottled to 1Gbit/s.

I checked the switch interface status, which is up and running on 25Gbit/s.
Also the host ethtool puts out a 25Gbit/s connection.

When running iperf3 between those 2 BPS instaces, a get solid 10 Gbit/s (as virtio network adapter supports correctly).
But when BPS is doing its job, I'm running at like 0.7 to 1 Gbit/s.

vzdump checked, everything is commented out.
BWLimit never set manually, checked PBS itself and jobs to find anything which could throttle... Didnt find anything.

When running proxmox-backup-client against pbs, i get these results:
1741208246212.png

Thats not what I'm seeing in realworld performance :(

When observing the traffic on switches, I can clearly say that its using the correct QSFP Interfaces (Mikrotik Switches).


Another example: VM114

Backup running at nearly 2 Gbit/s.
1741208714660.png

But looking onto verify performance:
1741208784528.png


There is a huge performance gap.

This is taking action when verifying and syncing... As would the datastore throttle those "secondary tasks" somehow...

Any ideas? :S
 

Attachments

  • 1741208740851.png
    1741208740851.png
    44.6 KB · Views: 0
Last edited:
This is taking action when verifying and syncing... As would the datastore throttle those "secondary tasks" somehow...
What kind of storage are you using as datastore? The disk type and local versus network can make a huge difference
 
Last edited:
Reading source disk is faster than your network? What are your read/write disk speeds on both sides for a single thread? If you do FIO, you often end up testing multi-core performance, when something like PBS only uses a single thread per job. You can schedule multiple jobs simultaneously to use more CPU/bandwidth, or you could (like I did once) schedule them all at the same time and have 21 nodes trying to stream over 2x10Gbps, at which point yes less than gigabit is expected, so now I have my schedule better.

The other thing I noticed is that backups only backup changes, so, while it is looping over your disks it may not need to stream more than a gigabit worth of data over the network. Also you may need to enable fleecing depending on your disk workloads.

So there are lots of variables, I would benchmark both sides first and set expectations accordingly.
 
Last edited:
  • Like
Reactions: Johannes S
Thank you everyone for your replies <3

May I answer ur questions:

1. Yes, you are right: It was 1.8 GibiByte while backing up - not Gbit/s. Sorry :)

2. Both storage nodes are running 24x Exos Enterprise SAS HDDs each in ZFS Raid 2 totalling in by nearly 330 Terrabyte + 290 Terrabyte (Yes I know, only Enterprise SSDs are wished), performance output should be well over 1Gbit/s? Due to experience a friendly company of mine running the same setup, I know it should read / write at around 200-250 MB/s, which would be the observed 1,8 Gbit/s.

Why would the backup process achieve the bandwith, but the verify process be limited to 1Gbit/s?

3. PBS running in VM, ZFS Raid hosted by PVE host, passed through via virtio to local PBS VM on that host. Therefore PBS is accessing the disks "locally"?

4. @guruevi: I dont understand yet, how come that you assume that read spead is faster than network?
Your point was good, that it could be multiple task at the same time, which are limiting the bandwith of each process - but while I observe the verify process at 1Gbit/s, there is nothing else running on or to that PBS :(

If run all backup jobs at the time, the Ceph cluster muscles up, und pumps the data - but due to retransmissions and lack of write speeds on PBS nodes (cant handle NVME + SSD traffic) I timed the backups to have pauses inbetween.

May you give me a hand, and explain me how i could to the benchmarks accordingly to your instructions? :) <3


Thank you everyone, I'm lost without you!
 
Last edited:
24 disks in a single RAIDZ2 is not recommended, I would say split them up in 3x8 RAIDZ2 and if possible, add a spare or 2. A single vdev will have the throughput speed of the slowest disk in the array which is truthfully about .8-1.6 Gbps (100MB/s for 5400RPM 200MB/s for 15k RPM) on spinning disks, 3 vdev will have the speed of 3 vdev etc.
 
Last edited:
  • Like
Reactions: UdoB and Johannes S
HDD are too slow for PBS even more with ZFS.
Perhaps RAID10 mdadm or RAID10 HW will give some boost.
But verifiy job and gargabe collect will be slow.
BTW, double check there isn't overlapped tasks, eg: backup during a GC or Verify Tasl ...
 
  • Like
Reactions: Johannes S
You can set up mirrors with ZFS as well, as I said before, a single VDEV will have the speed of the slowest disk. Spreading your load over more VDEV will improve it, so yes, having 12 mirror VDEV will be faster than a single 24 disk RAIDZ2. Over 3x8 RAIDZ2 on 7200RPM, you can definitely push ~2Gbps, add a few SSD for cache and ZLOG, I was able to get 14 nodes backed up by spreading the load over 24 hours.
 
Last edited: