Performance Expectations

Wibla · Oct 15, 2024

I'm not an expert on PBS disk performance profiling, and I have not seen any official recommendations on what FIO tests to run to model real-world PBS performance, so I can't really comment on that.

I do have some experience with bulk storage and this seems slightly absurd:

Erik Horn said:
Drive layout is raid-5 managed by a hardware controller. It has 23 drives in the configuration.

Considering how PBS divides up data in chunks as part of deduplication, this seems like a problematic configuration if you want performance.

If you want to test the max possible single-VM operation speeds with your current hardware (CPU/memory/network), I would suggest using a pair of fast enterprise NVMe drives in a mirrored configuration, either using mdadm or ZFS, then test backup/restore using that as your storage storage.

For bulk storage at that scale, I would also look at running a (or several) ZFS pool(s) with multiple RAIDZ2 vdevs with harddrives, and also (this is the important bit) use metadata AND slog devices. You must use several mirrored metadata devices in a configuration like that, and they must be high-quality/high-endurance enterprise SSDs. SLOG has less stringent requirements.

esi_y · Oct 15, 2024

Wibla said:
For bulk storage at that scale, I would also look at running a (or several) ZFS pool(s) with multiple RAIDZ2 vdevs with harddrives, and also (this is the important bit) use metadata AND slog devices. You must use several mirrored metadata devices in a configuration like that, and they must be high-quality/high-endurance enterprise SSDs. SLOG has less stringent requirements.

But this "primer on ZFS" does not explain how he is getting fraction of what is possible on that HW RAID today already.

Wibla · Oct 15, 2024

I also pointed out how he could test restore performance with SSDs to rule out the HW raid. That said, upon re-reading, the FIO 2MB chunk 64GB random read performance he mentions isn't half bad. I do wonder how much extra random access the restore process adds on because of the chunks + index layout PBS uses...

I wonder how the CPU load is on this host when doing restores, it would not surprise me if we're hitting some sort of performance ceiling simply from a single restore thread being maxed out.

t.lamprecht · Oct 16, 2024

Spinners on slow RAID configs will never get very fast, their benchmark are just seldom a IO pattern that can be achievable by real world access. And increasing parallelization on spinner backed disks, which have a big penalty for random IO, most often makes things worse, we tested quite a bit on this. And different VM restore tasks run in parallel, so for your storage, where random IO gets penalized, it could be even better to reduce parallelism. Or what did you mean exactly here?

Note also that the target storage and the network between PBS and PVE has also some say, as that can slow down the whole restore, especially if multiple VMs are restored the IO queues will never get filled as deep, especially not with ordered requests, compared to your FIO case; because your 128 deep queue can be sorted by the IO scheduler of the controller or HDD itself and is not so random anymore, especially if laid out on a file that's just 64 GB big, so not the best comparison with achievable performance of a multi TB restore.

FWIW, you could try switching the "chunk-order" tuning option to "none", while sorting by inode should make it a bit more performant on spinning disks, as the data then should be more local meaning less movements required from the magnetic needle, some systems, i.e. especially proprietary RAID controllers, which are a black box that's not really predictable, and storage where metadata access is much slower than reading the data, the ordering might make it worse.

https://pbs3.work.tlmp.it:8007/docs/storage.html#datastore-tuning-options

Note that modern enterprise SSDs are not only much more reliable and long-living compared to spinners, meaning less redundancy is required to get the same statistical safety, they are also available in 30 TB form for 3200 to 4000 € (excl. VAT) here, i.e., bigger than current HDDs, and the trend is still pointing upward. Sure, that clocks in about 8 times the raw €/TB price as spinning HDDs, but they are also hardly comparable in terms of performance and, as mentioned, reliability. Thus, they can save lots of operating costs, both in less maintenance and wait times required by admins, and more importantly, much less restore time – which, depending on the business, can often recoup the higher initial investment cost by a high margin. My point is, there's more than just the bucks-per-TB cost to look at when setting up a backup server. And FWIW, there are always combinations possible, like a smaller SSD backed datastore for incoming and latest backup and a bigger one HDD or even tape backed one to provide long term archival.

I mention that because we spent a significant amount of time with looking into improving throughput on spinners, and while it's certainly not impossible that there are some changes possible that could cause a dramatic improvement, it's rather likely that, if they exist, they are quite elaborate, which normally means high maintenance cost and regression potential for other things.
That said, we occasionally look into this, especially if we get a case in the enterprise support, and if we find anything we naturally are going to improve on it; but it's unlikely that the storage system you describe will get highly performant anytime soon.

Wibla said:
I wonder how the CPU load is on this host when doing restores, it would not surprise me if we're hitting some sort of performance ceiling simply from a single restore thread being maxed out.

That would be rather surprising, the CPU doesn't need to do that much work here, especially not so much to be the bigger bottleneck as IO served from spinning storage.

But that would be easy to check through looking at top to see if one thread is maxed out and also the Pressure Stall Information, e.g.:
head /proc/pressure/*

The latter shows the time some or all (full) processes had to wait due to being starved by IO, CPU or Memory, ideally all numbers would be zero. If IO is non-zero during a restore it's very unlikely that increasing parallelism would help, rather the access pattern would require changing, and that's far from easy with the PBS chunk-store content addressable storage design (which brings many benefits, so IMO still a good trade-off).

Erik Horn · Oct 18, 2024

I would like to thank everybody who has responded so far to my question. Unfortunately it has gotten off-track from the original question which was how fast can it go with appropriate hardware.

I did originally specify that all-flash wasn't reasonable, however, after some price comparisons I found that it can be considered a reasonable incremental upgrade rather than 10x+ price increase that my preferred hardware vendor is advertising.

The reason I needed the information is that I'm working on a proposal to implement PBS in an enterprise environment. I must be able to show that the required recovery points objectives (RPO) and recovery time objectives (RTO) can be met with room for future growth.

I have concluded that PBS will meet our the needs of our recovery points. This was due to the incremental nature of the backups which reduces the amount of data to be processed. Our old/slow testing repo was still sufficiently fast enough to meet our needs.

Recovery times are more concerning because a disaster will likely require tape recovery which consumes much of the recovery window. Average sized VMs aren't expected to be an issue, but we do have a few oversized VMs which are considered critical. My testing so far has not been able to confirm that recovery performance would be sufficient to meet our recovery times for oversized VMs.

However, I realized that I could wipe one of my PVE hosts then install PBS on it. It does meet the requirements for PBS and has more than enough nvme storage to test a larger VM.

If anybody is interested, I'll be happy to post the results once that testing is complete.

Erik Horn · Oct 31, 2024

I've had some time to reconfigure one of my PVE hosts into a PBS server so that I can test PBS performance using a server that meets the recommended requirements.

The initial configuration of my PVE test cluster was:

4 primary hosts all configured the same:

Dell Poweredge R7615 (vSAN Ready Node)
Single EPYC 9354P CPU (32 core, 3.25 GHz base clock)
256GB RAM
Network: 2x25Gb LACP (layer3+4), Mellanox ConectX-6 Lx
OS disk: BOSS card with 2x 480GB NVMe drives, raid-1
VM Storage: 4x Dell Enterprise NVMe PM1735a Mixed-Use 6.4TB
Storage Configuration: Ceph Reef, Encrypted OSDs, Stretch Cluster Mode, 2 copies per failure domain, 4 copies total.

1 witness host, running in a virtual machine in our production vmware environment. This host only provides quorum for proxmox and ceph.

I removed one node from the cluster by deleting the Ceph OSDs, Monitors, and Managers from the node, then following the instructions to remove the node from the proxmox cluster configuration. After this, the ceph status was degraded since there are not enough hosts remaining to recreate the missing copies. There were no recovery operations running during testing.

I then installed PBS on the server and patched it to current from the no-subscription repository. The datastore was setup using ZFS RaidZ, on all 4 storage drives with the default settings. I then added the PBS server to the cluster.

The VM that I'm testing is 350GB and powered off and is mostly uncompressible, however based on the performance numbers, there are some areas that are compressible. The performance numbers reported are based off the logs from the backup and restore jobs. Tests were performed with backup encryption enabled and disabled with no measurable performance difference.

First backup, all data transfered to the PBS server:

Uncompressible data: 500-700 MB/s Read and Write
Compressible data: 1500 MB/s Read, 600 MB/s Write
Unallocated space (due to thin provisioning): 8500 MB/s Read, 0 Write

Second backup, no data written, but all data is scanned since changed block tracking is unavailable:

Uncompressible data: 1700-2000 MB/s Read, 0 Write
Compressible data: 3000 MB/s Read, 0 Write

Full restore of VM, not live:

Overall Average: 482 MB/s

I also ran the proxmox-backup-client benchmark and got the following:

Uploaded 1025 chunks in 5 seconds.
Time per request: 4886 microseconds.
TLS speed: 858.38 MB/s
SHA256 speed: 1866.10 MB/s
Compression speed: 560.80 MB/s
Decompress speed: 736.97 MB/s
AES256/GCM speed: 1476.33 MB/s
Verify speed: 526.29 MB/s

Some observations made during the backup process:

CPU usage information is based on output from top. Therefore, 100% usage represents 100% usage of a cpu core, not the entire CPU.

The host performing the backup has no running vms on it. When running a backup with no data written to PBS, there was only one kvm process consuming significant amounts of cpu at 250%. All other processes were using < 1% cpu. The PBS server cpu usage was very low.

Based on output from iostat, this host also showed only minimal disk IO, ceph had directed most of the disk io to the other two cluster members. The host performing the backup is in the degraded failure domain and it seems logical that ceph would direct IO to the healthy failure domain.

When running a backup where all data is transferred, the kvm process on the host performing the backup drops to 170%. There are still no other processes with significant cpu usage. On the PBS server, the proxmox-backup-proxy process has about 50-80% cpu usage and the IO delay remained below 0.2%.

Running three concurrent backups, one per host, did not degrade performance.

I did do an additional brief test that showed the power status of the VM being backed up did not significantly impact backup performance.

Observations made during the restore process:

On the host performing the restore, the highest CPU user is pbs-restore at 85-90% CPU usage. Ceph osd cpu usage is 20-25% on each of the four OSDs in the host.

On the PBS server, the proxmox-backup-proxy process is using 30-40% CPU and the IO delay remained below 0.1%.

My thoughts and assumptions:

Most of the data processing for backups happens in the client: the PVE host. The rate of data processing is reasonably fast as seen in the no-data-changed backups, because without changed block tracking, the source data must be mostly or fully processed so that it can be determined that it does not need to be sent to the PBS server. However, whenever the data needs to be sent to the PBS server, the process slows down significantly.

Conclusion:

Expected backup speeds, after compression, are in the 500-700 MB/s range.

Expected restore speeds are around 500 MB/s.

A reasonable amount of concurrency is unlikely to impact performance.

Search

Search

Performance Expectations

Wibla

New Member

esi_y

Renowned Member

Wibla

New Member

t.lamprecht

Proxmox Staff Member

Erik Horn

New Member

Erik Horn

New Member

We value your privacy