PBS Performance improvements (enterprise all-flash)

thanks for the additional numbers! that does indeed look like there is some severe bottle neck happening that we should get to the bottom of..
I suspect that there is something in the compression pipeline.
Because like i mentioned in the Bug Report, even if the way the Backup works with PBS compared to Local-Storage Backup, the limitations are all exactly the same.

Or the speeds are identical between Local and PBS as soon as Compression is enabled (no matter which compression or how many zstd multitasking threads).
With disabling Compression, it feels like every Limitation is removed, the Backup-Speeds go up to 5GB/s here.
With enabled Compression (no matter which/multitasking etc), no matter if Local or PBS, i can't pass 1GB/s.

And in my opinion it has nothing todo with Clockspeeds either because i Monitored the CPU-Utilization during Backuping (Local and PBS), and i don't see any Cores reaching 100%.
But im not entirely sure, since on other side, the Clockspeeds are definitively important, i get on Servers that can reach higher Clockspeeds faster Backup Speeds.

I took me 3 days of trying to find out what the Bottleneck is, as i created my Bugreport...
And i tryed to make PBS-VM Instances and even reinstalled one of both Genoa-Servers with PBS, to have the fastest possible PVE+PBS on the planet, tryed with Local Storage and all tuning options possible on Compression, etc...
Monitored CPU and IO etc...
And couldn't find out what the Bottleneck is and gaved up.

On the returning side of things, it seems like im still Archieving the fastest PBS Backup-Speeds with 1GB/s even with SAS (HDD) Drives on PBS, while others are using NVME Drives and have troubles to reach 500mb/s.

Cheers
 
  • Like
Reactions: _gabriel
Just as bad, or worse.

Code:
Upload image '/dev/mapper/pve-vm--100--disk--0' to 'root@pam@10.226.10.10:8007:pbs-primary' as tomtest.img.fidx
tomtest.img: had to backup 62.832 GiB of 80 GiB (compressed 42.404 GiB) in 673.35s
tomtest.img: average backup speed: 95.552 MiB/s
tomtest.img: backup was done incrementally, reused 17.168 GiB (21.5%)
Duration: 676.53s
End Time: Tue Jul  9 12:59:22 2024
FWIW, I very likely found the issue with this particular invocation, and might have some test package (if you are willing!) to try for image backups using proxmox-backup-client.
 
Any news here? We have also a performance issue with NVMe Disks on Proxmox Backup Server.
And what will you tell us?
only 10 Mb/s thought write?
Use other enterprice NVMe disks in ZFS like Raid 10 Setup on fast new hardware.
 
There was something change in the git with "input buffer size"
The question is was this generally changed in future pbs versions or is this something we must change manually to see if this solves our problem.
 
There was something change in the git with "input buffer size"
Don't know what are you referring to. Do you have a link to the change you mention? Also, resurrecting a year+ old post without providing details isn't that useful. In the mean time there has been improvements for both restores and verification tasks and you don't mention what kind of "performance issue" you have.

The changes mentioned above [1] were released with PBS3.3 [2]

[1] https://forum.proxmox.com/threads/p...ments-enterprise-all-flash.150514/post-688915
[2] https://pbs.proxmox.com/wiki/Roadmap#Proxmox_Backup_Server_3.3
 
Last edited:
From my Part, or the mentioned bugreport...

Nothing is fixed since this thread was opened. PBS backups are still single core tls limited.

It was an eternity ago and still the same issue today.
9374F -> 1gb/s limit
Xeon 4210R -> 200mb/s limit

So there is absolutely nothing fixed in this thread. I think that people just gaved up on this.
 
We're in the process of implementing PVE and PBS in our environment and have been stuck on this issue for a few days. Here's a quick writeup of our experience and testing. TLDR at the end.

The setup

PVE hosts are Dell R640's with these specs:
  • 2x Xeon Gold 6132
  • 512GB RAM
  • 4x 10G interfaces (2 for data in a bond, 2 for multipathed storage traffic)

PBS is run as a VM on one of the PVE hosts with this setup:
  • 8 vCPU
  • 200GB RAM
  • Multiple network interfaces so backup & storage traffic bypasses firewalls
Storage:
  • VM uses all flash block storage via NVMe-TCP (Pure Storage FlashArray)
  • PBS datastore is backed by all flash object/file storage via NFS (Pure Storage FlashBlade)

Diagram:
Code:
     +--------> PVE (SRC)
     |NVMe-TCP    ^
     V            |
FlashArray        |PBS Backup
     ^            |
     |NVMe-TCP    V            NFS
     +--------> PVE [PBS VM] <-----> FlashBlade

Backup Performance
  • from separate PVE host to PBS was running at ~130MB/s
  • from the same PVE host that PBS runs on was running at ~190MB/s

Troubleshooting & changes
fio from PVE to the FlashArray was getting ~2GB/s as expected
fio from PBS to the FlashBlade was getting ~1GB/s as expected for a single thread
iperf from PVE source host to PBS was getting ~1GB/s as expected for a single thread

PBS VM
from 2 socket/4cores to 1 socket/8 cores -> no change
disabled numa -> no change
spectre mitigations -> no change
reverted kernel from 6.17 to 6.14 -> no change

PBS benchmark
Code:
root@pbs-01:~# proxmox-backup-client benchmark --repository purefb-02-pbs-nfs
Uploaded 319 chunks in 5 seconds.
Time per request: 15778 microseconds.
TLS speed: 265.83 MB/s   
SHA256 speed: 459.42 MB/s   
Compression speed: 380.41 MB/s   
Decompress speed: 540.43 MB/s   
AES256/GCM speed: 446.25 MB/s   
Verify speed: 246.51 MB/s   
┌───────────────────────────────────┬───────────────────┐
│ Name                              │ Value             │
╞═══════════════════════════════════╪═══════════════════╡
│ TLS (maximal backup upload speed) │ 265.83 MB/s (22%) │
├───────────────────────────────────┼───────────────────┤
│ SHA256 checksum computation speed │ 459.42 MB/s (23%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 compression speed    │ 380.41 MB/s (51%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 decompression speed  │ 540.43 MB/s (45%) │
├───────────────────────────────────┼───────────────────┤
│ Chunk verification speed          │ 246.51 MB/s (33%) │
├───────────────────────────────────┼───────────────────┤
│ AES256 GCM encryption speed       │ 446.25 MB/s (12%) │
└───────────────────────────────────┴───────────────────┘
Compared the numbers here to the benchmark wiki page and was surprised our ~5 year old CPUs were performing on par ~10 year old CPUs.

That realisation sent me looking into CPU instruction sets.
change PBS CPU type to host -> AES256 GCM speed increased from 446MB/s to ~3300MB/s! Great! No change to SHA256 speeds which are running about 20% of what I would expect...

Looked into SHA instruction sets and found our intel Cascade Lake CPUs don't have SHA extensions... Apparently they became generally available on Ice Lake CPUs.

So now I'm on the hunt for a physical host with new CPUs that support SHA instruction sets...


I think this is a very important piece of information that gets added to the PBS system requirements documentation, but I'm not sure how to get it in there?


TLDR: The CPUs on the physical servers we run PBS on don't have instruction sets to accelerate SHA calculations, meaning it's being processed entirely in software. This is bottle necking the entire backup traffic path as the SHA256 calculations are used for dedupe.
 
So it seems I was only looking at half the picture. I happened to run a backup of another VM in another cluster and suddenly the backup was peaking at ~500MB/s and averaging about 300MB/s.

This got me looking at the source PVE node closer and I found its running a Xeon Gold 6526Y which has Intel's SHA extensions.

PBS benchmark numbers for reference:



HostPBS VMPhysical node where PBS VM runsPhysical node where source VM runs
CPUIntel Xeon Gold 6132Intel Xeon Gold 6132Intel Xeon Gold 6526Y
Chunks uploaded in 5s

922​


232​


252​
Time per request (us)

5433​


21894​


20523​
TLS speed (MB/s)

772​


192​


204
SHA256 speed (MB/s)

459​


462​


1688
Compression speed (MB/s)

440​


424​


519​
Decompress speed (MB/s)

702​


627​


814​
AES256/GCM speed (MB/s)

3324​


3370​


10131
Verify speed (MB/s)

275​


268​


548​


You can see:
SHA increased by 3.6x
AES increased by 3x
SSL was basically unchanged

More digging with my good friend Mr GPT, suggests:

For backups the PVE node with the source VM is responsible for:
  • chunking of the VM data
  • SHA256 hashing of chunks
  • compression (zstd)
  • encryption (AES-GCM if enabled)
  • sends only chunks that don't already exist on PBS over TLS
and the PBS server is responsible for:
  • stores incoming chunks
  • updates indexes
  • verify chunks (SHA256)

During restores, the roles flip. The PBS server is responsible for:
  • reads the chunk from the datastore
  • decrypts the chunk
  • decompresses the chunk
  • verifies the chunk
  • reassembles the data
  • send data to the target PVE node over TLS
and the PVE node is responsible for:
  • receiving the data stream
  • writing the data to VM disk

Again this was AI generated, but all seems to make sense. I haven't found any documentation which confirms this and I'm not inclined to scour through source code at the moment to validate it.

Happy to be correct by anyone that knows better.

TLDR: During backup, the CPU on the source PVE node is more important than CPU on PBS for throughput and vice versa during restores.
 
Last edited: