We're in the process of implementing PVE and PBS in our environment and have been stuck on this issue for a few days. Here's a quick writeup of our experience and testing. TLDR at the end.
The setup
PVE hosts are Dell R640's with these specs:
- 2x Xeon Gold 6132
- 512GB RAM
- 4x 10G interfaces (2 for data in a bond, 2 for multipathed storage traffic)
PBS is run as a VM on one of the PVE hosts with this setup:
- 8 vCPU
- 200GB RAM
- Multiple network interfaces so backup & storage traffic bypasses firewalls
Storage:
- VM uses all flash block storage via NVMe-TCP (Pure Storage FlashArray)
- PBS datastore is backed by all flash object/file storage via NFS (Pure Storage FlashBlade)
Diagram:
Code:
+--------> PVE (SRC)
|NVMe-TCP ^
V |
FlashArray |PBS Backup
^ |
|NVMe-TCP V NFS
+--------> PVE [PBS VM] <-----> FlashBlade
Backup Performance
- from separate PVE host to PBS was running at ~130MB/s
- from the same PVE host that PBS runs on was running at ~190MB/s
Troubleshooting & changes
fio from PVE to the FlashArray was getting ~2GB/s as expected
fio from PBS to the FlashBlade was getting ~1GB/s as expected for a single thread
iperf from PVE source host to PBS was getting ~1GB/s as expected for a single thread
PBS VM
from 2 socket/4cores to 1 socket/8 cores -> no change
disabled numa -> no change
spectre mitigations -> no change
reverted kernel from 6.17 to 6.14 -> no change
PBS benchmark
Code:
root@pbs-01:~# proxmox-backup-client benchmark --repository purefb-02-pbs-nfs
Uploaded 319 chunks in 5 seconds.
Time per request: 15778 microseconds.
TLS speed: 265.83 MB/s
SHA256 speed: 459.42 MB/s
Compression speed: 380.41 MB/s
Decompress speed: 540.43 MB/s
AES256/GCM speed: 446.25 MB/s
Verify speed: 246.51 MB/s
┌───────────────────────────────────┬───────────────────┐
│ Name │ Value │
╞═══════════════════════════════════╪═══════════════════╡
│ TLS (maximal backup upload speed) │ 265.83 MB/s (22%) │
├───────────────────────────────────┼───────────────────┤
│ SHA256 checksum computation speed │ 459.42 MB/s (23%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 compression speed │ 380.41 MB/s (51%) │
├───────────────────────────────────┼───────────────────┤
│ ZStd level 1 decompression speed │ 540.43 MB/s (45%) │
├───────────────────────────────────┼───────────────────┤
│ Chunk verification speed │ 246.51 MB/s (33%) │
├───────────────────────────────────┼───────────────────┤
│ AES256 GCM encryption speed │ 446.25 MB/s (12%) │
└───────────────────────────────────┴───────────────────┘
Compared the numbers here to the
benchmark wiki page and was surprised our ~5 year old CPUs were performing on par ~10 year old CPUs.
That realisation sent me looking into CPU instruction sets.
change PBS CPU type to host -> AES256 GCM speed increased from 446MB/s to ~3300MB/s! Great! No change to SHA256 speeds which are running about 20% of what I would expect...
Looked into SHA instruction sets and found our intel Cascade Lake CPUs don't have SHA extensions... Apparently they became generally available on Ice Lake CPUs.
So now I'm on the hunt for a physical host with new CPUs that support SHA instruction sets...
I think this is a very important piece of information that gets added to the PBS system requirements documentation, but I'm not sure how to get it in there?
TLDR: The CPUs on the physical servers we run PBS on don't have instruction sets to accelerate SHA calculations, meaning it's being processed entirely in software. This is bottle necking the entire backup traffic path as the SHA256 calculations are used for dedupe.