PBS Performance improvements (enterprise all-flash)

thanks for the additional numbers! that does indeed look like there is some severe bottle neck happening that we should get to the bottom of..
I suspect that there is something in the compression pipeline.
Because like i mentioned in the Bug Report, even if the way the Backup works with PBS compared to Local-Storage Backup, the limitations are all exactly the same.

Or the speeds are identical between Local and PBS as soon as Compression is enabled (no matter which compression or how many zstd multitasking threads).
With disabling Compression, it feels like every Limitation is removed, the Backup-Speeds go up to 5GB/s here.
With enabled Compression (no matter which/multitasking etc), no matter if Local or PBS, i can't pass 1GB/s.

And in my opinion it has nothing todo with Clockspeeds either because i Monitored the CPU-Utilization during Backuping (Local and PBS), and i don't see any Cores reaching 100%.
But im not entirely sure, since on other side, the Clockspeeds are definitively important, i get on Servers that can reach higher Clockspeeds faster Backup Speeds.

I took me 3 days of trying to find out what the Bottleneck is, as i created my Bugreport...
And i tryed to make PBS-VM Instances and even reinstalled one of both Genoa-Servers with PBS, to have the fastest possible PVE+PBS on the planet, tryed with Local Storage and all tuning options possible on Compression, etc...
Monitored CPU and IO etc...
And couldn't find out what the Bottleneck is and gaved up.

On the returning side of things, it seems like im still Archieving the fastest PBS Backup-Speeds with 1GB/s even with SAS (HDD) Drives on PBS, while others are using NVME Drives and have troubles to reach 500mb/s.

Cheers
 
  • Like
Reactions: _gabriel
Just as bad, or worse.

Code:
Upload image '/dev/mapper/pve-vm--100--disk--0' to 'root@pam@10.226.10.10:8007:pbs-primary' as tomtest.img.fidx
tomtest.img: had to backup 62.832 GiB of 80 GiB (compressed 42.404 GiB) in 673.35s
tomtest.img: average backup speed: 95.552 MiB/s
tomtest.img: backup was done incrementally, reused 17.168 GiB (21.5%)
Duration: 676.53s
End Time: Tue Jul  9 12:59:22 2024
FWIW, I very likely found the issue with this particular invocation, and might have some test package (if you are willing!) to try for image backups using proxmox-backup-client.
 
Any news here? We have also a performance issue with NVMe Disks on Proxmox Backup Server.
And what will you tell us?
only 10 Mb/s thought write?
Use other enterprice NVMe disks in ZFS like Raid 10 Setup on fast new hardware.
 
There was something change in the git with "input buffer size"
The question is was this generally changed in future pbs versions or is this something we must change manually to see if this solves our problem.
 
There was something change in the git with "input buffer size"
Don't know what are you referring to. Do you have a link to the change you mention? Also, resurrecting a year+ old post without providing details isn't that useful. In the mean time there has been improvements for both restores and verification tasks and you don't mention what kind of "performance issue" you have.

The changes mentioned above [1] were released with PBS3.3 [2]

[1] https://forum.proxmox.com/threads/p...ments-enterprise-all-flash.150514/post-688915
[2] https://pbs.proxmox.com/wiki/Roadmap#Proxmox_Backup_Server_3.3
 
Last edited:
From my Part, or the mentioned bugreport...

Nothing is fixed since this thread was opened. PBS backups are still single core tls limited.

It was an eternity ago and still the same issue today.
9374F -> 1gb/s limit
Xeon 4210R -> 200mb/s limit

So there is absolutely nothing fixed in this thread. I think that people just gaved up on this.
 
We're in the process of implementing PVE and PBS in our environment and have been stuck on this issue for a few days. Here's a quick writeup of our experience and testing. TLDR at the end.

The setup

PVE hosts are Dell R640's with these specs:
  • 2x Xeon Gold 6132
  • 512GB RAM
  • 4x 10G interfaces (2 for data in a bond, 2 for multipathed storage traffic)

PBS is run as a VM on one of the PVE hosts with this setup:
  • 8 vCPU
  • 200GB RAM
  • Multiple network interfaces so backup & storage traffic bypasses firewalls
Storage:
  • VM uses all flash block storage via NVMe-TCP (Pure Storage FlashArray)
  • PBS datastore is backed by all flash object/file storage via NFS (Pure Storage FlashBlade)

Diagram:
Code:
     +--------> PVE (SRC)
     |NVMe-TCP    ^
     V            |
FlashArray        |PBS Backup
     ^            |
     |NVMe-TCP    V            NFS
     +--------> PVE [PBS VM] <-----> FlashBlade

Backup Performance
  • from separate PVE host to PBS was running at ~130MB/s
  • from the same PVE host that PBS runs on was running at ~190MB/s

Troubleshooting & changes
fio from PVE to the FlashArray was getting ~2GB/s as expected
fio from PBS to the FlashBlade was getting ~1GB/s as expected for a single thread
iperf from PVE source host to PBS was getting ~1GB/s as expected for a single thread

PBS VM
from 2 socket/4cores to 1 socket/8 cores -> no change
disabled numa -> no change
spectre mitigations -> no change
reverted kernel from 6.17 to 6.14 -> no change

PBS benchmark
Code:
root@pbs-01:~# proxmox-backup-client benchmark --repository purefb-02-pbs-nfs
Uploaded 319 chunks in 5 seconds.
Time per request: 15778 microseconds.
TLS speed: 265.83 MB/s   
SHA256 speed: 459.42 MB/s   
Compression speed: 380.41 MB/s   
Decompress speed: 540.43 MB/s   
AES256/GCM speed: 446.25 MB/s   
Verify speed: 246.51 MB/s   
┌───────────────────────────────────┬───────────────────┐
│ Name                              │ Value             │
╞═══════════════════════════════════╪═══════════════════╡
│ TLS (maximal backup upload speed) │ 265.83 MB/s (22%) │
├───────────────────────────────────┼───────────────────┤
│ SHA256 checksum computation speed │ 459.42 MB/s (23%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 compression speed    │ 380.41 MB/s (51%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 decompression speed  │ 540.43 MB/s (45%) │
├───────────────────────────────────┼───────────────────┤
│ Chunk verification speed          │ 246.51 MB/s (33%) │
├───────────────────────────────────┼───────────────────┤
│ AES256 GCM encryption speed       │ 446.25 MB/s (12%) │
└───────────────────────────────────┴───────────────────┘
Compared the numbers here to the benchmark wiki page and was surprised our ~5 year old CPUs were performing on par ~10 year old CPUs.

That realisation sent me looking into CPU instruction sets.
change PBS CPU type to host -> AES256 GCM speed increased from 446MB/s to ~3300MB/s! Great! No change to SHA256 speeds which are running about 20% of what I would expect...

Looked into SHA instruction sets and found our intel Cascade Lake CPUs don't have SHA extensions... Apparently they became generally available on Ice Lake CPUs.

So now I'm on the hunt for a physical host with new CPUs that support SHA instruction sets...


I think this is a very important piece of information that gets added to the PBS system requirements documentation, but I'm not sure how to get it in there?


TLDR: The CPUs on the physical servers we run PBS on don't have instruction sets to accelerate SHA calculations, meaning it's being processed entirely in software. This is bottle necking the entire backup traffic path as the SHA256 calculations are used for dedupe.
 
  • Like
Reactions: kaliszad
So it seems I was only looking at half the picture. I happened to run a backup of another VM in another cluster and suddenly the backup was peaking at ~500MB/s and averaging about 300MB/s.

This got me looking at the source PVE node closer and I found its running a Xeon Gold 6526Y which has Intel's SHA extensions.

PBS benchmark numbers for reference:



HostPBS VMPhysical node where PBS VM runsPhysical node where source VM runs
CPUIntel Xeon Gold 6132Intel Xeon Gold 6132Intel Xeon Gold 6526Y
Chunks uploaded in 5s

922​


232​


252​
Time per request (us)

5433​


21894​


20523​
TLS speed (MB/s)

772​


192​


204
SHA256 speed (MB/s)

459​


462​


1688
Compression speed (MB/s)

440​


424​


519​
Decompress speed (MB/s)

702​


627​


814​
AES256/GCM speed (MB/s)

3324​


3370​


10131
Verify speed (MB/s)

275​


268​


548​


You can see:
SHA increased by 3.6x
AES increased by 3x
SSL was basically unchanged

More digging with my good friend Mr GPT, suggests:

For backups the PVE node with the source VM is responsible for:
  • chunking of the VM data
  • SHA256 hashing of chunks
  • compression (zstd)
  • encryption (AES-GCM if enabled)
  • sends only chunks that don't already exist on PBS over TLS
and the PBS server is responsible for:
  • stores incoming chunks
  • updates indexes
  • verify chunks (SHA256)

During restores, the roles flip. The PBS server is responsible for:
  • reads the chunk from the datastore
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • send data to the target PVE node over TLS
and the PVE node is responsible for:
  • receiving the data stream
  • writing the data to VM disk

Again this was AI generated, but all seems to make sense. I haven't found any documentation which confirms this and I'm not inclined to scour through source code at the moment to validate it.

Happy to be correct by anyone that knows better.

TLDR: During backup, the CPU on the source PVE node is more important than CPU on PBS for throughput and vice versa during restores.
 
Last edited:
  • Like
Reactions: kaliszad
Hello,
this is exactly the problem we had with our pbs. The only solution is to run pbs on dedicated server with no hardware raid, you must use ZFS with the physical disks. PBS must be able to handle/address the physical disks directly. You can use a zfs "raid". This is the only way how pbs can run with good performance it fully solved our issue and we never had backup performance issues again.

Best Regards
Marco
 
So it seems I was only looking at half the picture. I happened to run a backup of another VM in another cluster and suddenly the backup was peaking at ~500MB/s and averaging about 300MB/s.

This got me looking at the source PVE node closer and I found its running a Xeon Gold 6526Y which has Intel's SHA extensions.

PBS benchmark numbers for reference:



HostPBS VMPhysical node where PBS VM runsPhysical node where source VM runs
CPUIntel Xeon Gold 6132Intel Xeon Gold 6132Intel Xeon Gold 6526Y
Chunks uploaded in 5s

922​

232​

252​
Time per request (us)

5433​

21894​

20523​
TLS speed (MB/s)

772​

192​

204
SHA256 speed (MB/s)

459​

462​

1688
Compression speed (MB/s)

440​

424​

519​
Decompress speed (MB/s)

702​

627​

814​
AES256/GCM speed (MB/s)

3324​

3370​

10131
Verify speed (MB/s)

275​

268​

548​


You can see:
SHA increased by 3.6x
AES increased by 3x
SSL was basically unchanged

More digging with my good friend Mr GPT, suggests:

For backups the PVE node with the source VM is responsible for:
  • chunking of the VM data
  • SHA256 hashing of chunks
  • compression (zstd)
  • encryption (AES-GCM if enabled)
  • sends only chunks that don't already exist on PBS over TLS
and the PBS server is responsible for:
  • stores incoming chunks
  • updates indexes
  • verify chunks (SHA256)

During restores, the roles flip. The PBS server is responsible for:
  • reads the chunk from the datastore
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • send data to the target PVE node over TLS
and the PVE node is responsible for:
  • receiving the data stream
  • writing the data to VM disk

Again this was AI generated, but all seems to make sense. I haven't found any documentation which confirms this and I'm not inclined to scour through source code at the moment to validate it.

Happy to be correct by anyone that knows better.

TLDR: During backup, the CPU on the source PVE node is more important than CPU on PBS for throughput and vice versa during restores.
Yes, hardware acceleration helps. The TLS speed is probably affected by latency a lot -> that is solved by buffering/ doing more work in parallel if you can. The benchmark is only a small part of the whole picture, don't rely on those numbers too much.
 
During restores, the roles flip. The PBS server is responsible for:

  • reads the chunk from the datastore
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • send data to the target PVE node over TLS
and the PVE node is responsible for:

  • receiving the data stream
  • writing the data to VM disk

that's not quite true, it really looks more like this:

The PBS server is responsible for:
  • reads the chunk from the datastore
  • send data to the target PVE node over TLS
The client/PVE node is responsible for:

  • receiving the data stream
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • writing the data to VM disk
the PBS server does parse each index (the fidx/didx files) once when restoring (to know which chunks are part of this backup snapshot and thus allowed to be accessed), but that is fairly cheap.

when doing a backup, both sides will construct the indices in parallel and verify the results match. and the server will calculate the CRC and if unencrypted, verifty the digest of the uploaded chunks, so yes, for a backup the PBS server has to do a bit more compute work.
 
Yes, hardware acceleration helps. The TLS speed is probably affected by latency a lot -> that is solved by buffering/ doing more work in parallel if you can. The benchmark is only a small part of the whole picture, don't rely on those numbers too much.

Yes I appreciate the benchmark targets very specific stages of the backup data pipeline, however I believe the under performance of backups in my environment lie in these stages, so I'm using it to help guide my investigation.


that's not quite true, it really looks more like this:

The PBS server is responsible for:
  • reads the chunk from the datastore
  • send data to the target PVE node over TLS
The client/PVE node is responsible for:

  • receiving the data stream
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • writing the data to VM disk
the PBS server does parse each index (the fidx/didx files) once when restoring (to know which chunks are part of this backup snapshot and thus allowed to be accessed), but that is fairly cheap.

when doing a backup, both sides will construct the indices in parallel and verify the results match. and the server will calculate the CRC and if unencrypted, verifty the digest of the uploaded chunks, so yes, for a backup the PBS server has to do a bit more compute work.

Thanks for the insight Fabian!

Because I'm not 100% clear based on your post, could you help shed light on which host (source PVE or target PBS) is performing SHA256 hashing during the backup of VM data from the source PVE node to the target PBS?

As in my previous post, backup transfer rates improved drastically when backing up a VM from a PVE node with SHA instruction sets in the CPU vs a a PVE node without (and therefore calculating SHA hashing in software).
Along with the difference in benchmark numbers indicates to me SHA performance is the primary bottleneck in my environment. Therefore I'm trying to understand if there would likely be any benefit to SHA hashing performance by moving PBS to a server with CPUs does hardware accelerated SHA hasing.