Abysmally slow restore from backup

PwrBank · Mar 24, 2025

That's exactly why we went with flash, so we could do live restores and still have a functioning VM. We were running them off of 10GbE bonded connected with a RAIDz2 with about 16 hard drives - The live restore would take so long to boot that it was functionally useless. Flash? No problems.

So while we are unlikely to actually restore in the traditional sense, it's still concerning that restores take 2x longer than Veeam on the exact same hardware, and with no actual hardware bottlenecks that the Proxmox team seems to imply is the issue. Clearly there is some software optimizations that need to be done.

PwrBank · Mar 24, 2025

Oh, to add, after updating the Proxmox plugin for Veeam; the backup speeds are basically on par for fulls. They are both maxing out the SATA SSD. I will see if I can find an NVMe to throw into the server to get a more accurate backup result.

lucius_the · Mar 24, 2025

guruevi said:
Question is what are you trading in with Veeam in regards performance and quality of the backups. I have a hard time keeping my PBS busy, but it can backup and restore 21 PVE servers, during the CrowdStrike problem, we restored dozens of Windows machines simultaneously using live-restore. It can also do very fast incremental backups and individual file restores because it doesn’t need to access an entire disk image and it has significant de-duplication which is important both on spinning disk and flash - spinning disk in server has become more expensive in data center because flash is faster and requires less space and energy for the same IOPS.

To me 23 minutes to restore is slow, PBS live restore is measured in seconds to restore, but most of the time the transfer to Ceph is completed in a matter of 10-15 minutes.

Yes, valid points. PBS has very strong points, I agree.
Incremental backups are very fast (well, unless you reboot a VM or a host, then it looses dirty bitmaps and has to re-read every drive -> Veeam's CBT - Changed Block Tracking - is better in that regard).
Live restore is a great feature but, I'm not sure how well that would work if PBS datastore is not all flash. I wouldn't dare using it on my (HDD-backed ZFS datastore) setups.
Deduplication with PBS is... perfect, really. I don't think it gets any better than what they have done.

guruevi said:
Parametrizing your infrastructure is also important when you need to know about performance things. It’s slow is not a problem statement when you can’t tell me where your bottlenecks are.

On the contrary, I spent quite some time to prove/explain what I see, the bottlenecks are actually coming from PBS. Please read through my previous posts in this thread. Restore and verify speed is very slow in my case - and hardware is not the bottleneck. I even bought two servers with much newer gen CPUs, only to get about the same results as with the old machines... And 10x slower restores with PBS than with Veeam, on same hardware (actually Veeam was running in a VM compared to PBS that was running on host).

guruevi said:
update: benchmarking my setup - backing up is 6-7Gbps on the wire (this is between data-centers ~100km apart with 4x10Gbps in bond), writing to storage fluctuates between a few Mbps to ~2Gbps. I think that has mostly to do with the fact it is compressing and deduplicating (it uses 3-6 cores for this). I guess Veeam compresses and deduplicates on the client side, which would have a signficant performance impact on the hypervisor side.
The same happens in return, it starts pretty high (I am booting live with is barely noticeable as if it were local) then slows down "write" toabout 1-2Gbps to storage but a lot of the disk is 'empty blocks' which is being thin provisioned, but I get the expected wirespeed for a single connection.

Those are nice numbers !
Veeam has several modes of operation. Unless you have a supported SAN (in which case the host doesn't spend any resources), the next best way is to use a proxy VM. In that case host resources are being used.
I believe PBS also uses host resources, as compression and hashing is done on the client side. Deduplication is a simple task, as PBS has implemented it. It just compares the hashes of chunks/blobs and skips saving (and/or sending) the same data to PBS. So processing is mostly: reading the snapshot + hashing. Plus sending through HTTP/2 via TLS is also something that uses CPU (that could be skipped if an option was provided to turn it off in local/securely configured network environments, but the option doesn't exist in PBS).

But all in all: backups are fast, no problem there. It's the restores (and verify operations) that could get some attention.
My biggest problem is restore times, it's the RTO for multi-TB VMs to be precise. Any verify speeds. It just takes way too long, on hardware that can deliver much more. Everything else is solid. I tried with NVMe, but got no better restore speeds than with HDDs... read my previous posts, it's all there.

PwrBank · Mar 25, 2025

Alrighty, found a faster SSD. Still not the fastest. PBS is maxing out the read on it still, but that's good.

Test results:

And if you compare them to the last results from this post:

P

Post in thread 'Abysmally slow restore from backup'

Mar 14, 2025

Okay, was on the phone with Veeam for about 2 hours diagnosing some performance issues, it seems to not want to use the preferred network for backups. So for now, all Proxmox nodes are being added to Veeam via our 10/25GbE network, so no bottle necks should be there. There's a concern that the way I have the storage setup may be a bottleneck for the Worker. I'll report back with different results if I get better than this. It may be a limitation of how Proxmox is backing up and restoring via API calls vs through some other method.

I installed a 1.92TB SATA3 SSD from Samsung so I could...

It doesn't seem to matter the speed of the PBS server or the host, it just maxes out at around 160MB/s.

And after updating Veeam, it's getting close to closing the gap on the backup to restore time. The fact that it can restore almost 65% faster than PBS is kinda silly too.

PwrBank · Mar 26, 2025

Threw about 1.5TB of more backups on the PBS server and kicked off a verification. Kind of an odd behavior when watching the cores.

4 cores will be around 20-25% use, while a fifth one is maxing out at 80%, not 100%, just ~80%.

Again, here are the specs:

Storage can be read at 25GB/s.

fabian · Mar 26, 2025

the verify task uses 4 threads for the actual verification (hash calculation) part (likely the 4 threads), and 1 thread for iterating and loading chunks (whether that is your 81% one or something completely different I can't say

)

PwrBank · Mar 26, 2025

fabian said:
the verify task uses 4 threads for the actual verification (hash calculation) part (likely the 4 threads), and 1 thread for iterating and loading chunks (whether that is your 81% one or something completely different I can't say )

Is there a way to bump that up for users that have a stupid amount of cores?

fabian · Mar 26, 2025

not without recompiling at the moment.

PwrBank · Mar 26, 2025

Is it possible for testing to tie it to a system variable? That'd be a pretty easy way to test it out on the nightly branch.

fabian · Mar 28, 2025

that would be possible in a test build maybe, but the comments here:

https://lore.proxmox.com/pbs-devel/e5a2dcac-630e-4797-bbbf-f38bc260c2ca@proxmox.com/

still apply - just blindly throwing more threads at "hot paths" very quickly causes negative results, what we actually need is a way to allocate the "concurrency budget" in a smart fashion across all running PBS tasks, which is quite the tricky problem.

PwrBank · Mar 28, 2025

That's totally understandable, but I believe even Veeam leaves it up to the administrator to find those constraints. When setting up a storage repo and a worker, you can set how many concurrent tasks can be executed from that storage and/or the worker. You can set way too many tasks and have a negative impact on performance, but it does at least give you an option to tune it correctly - Which can significantly speed up the task.

By default a Veeam worker and repo is limited to 4 concurrent tasks, which could be 2 backups and 2 restores, or 4 backups, or 4 integrity checks, etc. But you can change it to whatever number you want. In testing we've found that generally 1 task per core assigned to the worker is pretty good, up to a certain point where the storage itself is the bottleneck.

Long story short, this may be a case of at least let the users have a choice vs a semi-artificial limit.

fabian · Mar 31, 2025

the problem is we currently don't have the option to limit tasks like this (it is on our TODO list!). so enabling higher concurrency within tasks without at the same time offering a way to limit the number of active tasks while queuing the rest, makes this really unpredictable.

kaliszad · May 31, 2025

Together with Mathias Rumbolt we have done extensive benchmarking and performance debugging, when restoring a virtual machine on Proxmox VE (current version as of today) from PBS (also current).
The obvious bottleneck seems to be the single connection to PBS that is slowing down the restore between these high-frequency 8 core, 64 GB RAM, 2x 1TB NVMe SSD servers with 10 Gbps NICs (Hetzner AX52 + 10 Gbps NIC).

The restore is even slower if you have a number of incremental backups like the 14 in our case. 90% of those images is always new urandom generated data. Before we had more backups, I was seeing ~500 MBps restore speed. Currently I am at ~430 MBps when using ZFS datasets with recordsize=16k on Proxmox VE.

I will try to consult some Rust developers if they could help with a patch to: https://github.com/proxmox/proxmox-backup-qemu/blob/master/src/restore.rs

PwrBank · Jun 2, 2025

kaliszad said:
The obvious bottleneck seems to be the single connection to PBS that is slowing down the restore between these high-frequency 8 core, 64 GB RAM, 2x 1TB NVMe SSD servers with 10 Gbps NICs (Hetzner AX52 + 10 Gbps NIC).

Is it a connection bottleneck (network) or a single threaded bottleneck (processing)?

kaliszad · Jun 2, 2025

PwrBank said:
Is it a connection bottleneck (network) or a single threaded bottleneck (processing)?

It neither and it's not the storage. There are many interrupts and context switches as you can see:

Code:

root@hel1:~# vmstat -an 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 46864032    232 3228096    0    0    81   402   56   64  0  0 100  0  0
30  0      0 46864032    232 3228096    0    0     0 232708 23617 19531  2  7 91  0  0
 2  0      0 46864032    232 3228096    0    0     0 604412 29622 42943  2 11 87  0  0
27  1      0 46864032    232 3228096    0    0     0 763160 31703 54043  2 14 83  0  0
 1  0      0 46864032    232 3228096    0    0     0 74024 20939 20559  2  2 95  0  0
 2  0      0 46864032    232 3228096    0    0     0 833304 33712 48540  2 15 82  0  0
 1  0      0 46864032    232 3228096    0    0     0     0 20358 14156  2  2 96  0  0
 1  0      0 46864032    232 3228096    0    0     0 837320 32619 41248  2 15 83  0  0
 2  0      0 46864032    232 3228096    0    0     0     0 19027 13234  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0     8 789308 30680 37651  2 16 82  0  0
 2  0      0 46864032    232 3228096    0    0     0     0 19937 14067  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0     0 837200 31936 44507  2 15 83  0  0
 2  0      0 46864032    232 3228096    0    0     0     0 19497 13530  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0     0 837304 32894 49707  2 15 83  0  0
 1  0      0 46864032    232 3228096    0    0     0     0 19559 13414  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0    32 833472 33400 46012  3 15 82  0  0
 1  0      0 46864032    232 3228096    0    0     0     0 19261 13633  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0     4 829836 32690 67172  2 15 83  0  0
 1  0      0 46864032    232 3228096    0    0     0     0 19386 13619  2  2 96  0  0
 1  0      0 46864032    232 3228096    0    0    12 833020 34105 59878  2 15 82  0  0
 1  0      0 46864032    232 3228096    0    0  1228     0 20160 13997  2  2 96  0  0
 2  0      0 46864032    232 3228096    0    0     0 834792 33166 52499  2 15 83  0  0
16  0      0 46864032    232 3228096    0    0     0 237020 24222 29115  2  6 91  0  0
 2  0      0 46864032    232 3228096    0    0     0 600128 28561 41124  2 10 87  0  0
19  0      0 46864032    232 3228096    0    0     0 96236 23200 17466  2  4 94  0  0
 2  0      0 46864032    232 3228096    0    0     0 736988 32987 46142  3 13 85  0  0
 2  0      0 46864032    232 3228096    0    0     0 837508 32929 42028  2 15 83  0  0
 2  0      0 46864032    232 3228096    0    0     0     0 19774 13973  2  2 96  0  0
 1  0      0 46864032    232 3228096    0    0    20 539856 38418 68691  1  7 93  0  0
 1  0      0 46864032    232 3228096    0    0   524     0  179  257  0  0 100  0  0
 1  0      0 46864032    232 3228096    0    0     0     0  281  241  0  0 100  0  0
 1  0      0 46864032    232 3228096    0    0     0  5984 1254 1831  0  1 99  0  0
 1  0      0 46864032    232 3228096    0    0     0     0  634  388  0  0 100  0  0
 3  0      0 46864032    232 3228096    0    0     0     0  139  177  0  0 100  0  0
 1  0      0 46864032    232 3228096    0    0     0     0  225  241  0  0 100  0  0
 1  0      0 46864032    232 3228096    0    0    24     0 1279  805  0  0 99  0  0
 1  0      0 46864032    232 3228096    0    0     0     0  212  204  0  0 100  0  0
 1  0      0 46864032    232 3228096    0    0     0  1464  741 1242  0  0 100  0  0

kaliszad · Jun 8, 2025

So the problem seems to be in the loop in restore_image in https://github.com/proxmox/proxmox-backup-qemu/blob/master/src/restore.rs because the ReadChunk::read_chunk(&chunk_reader, digest)?; function seems to block, the same with write_data_callback(offset, &raw_data); below. So I wrapped both in simple time measurement like this:

Code:

let start_time2 = std::time::Instant::now();
let raw_data = ReadChunk::read_chunk(&chunk_reader, digest)?;
let end_time2 = std::time::Instant::now();
let start_time3 = std::time::Instant::now();
let res = write_data_callback(offset, &raw_data);
let end_time3 = std::time::Instant::now();
println!("Fetch chunk {} write data {}",
    end_time2.duration_since(start_time2).as_secs_f64(),
    end_time3.duration_since(start_time3).as_secs_f64()
);

And got the results in the attached file, which clearly shows reading the chunk over the network takes about 8 ms in my setup and 1 ms to write it to storage. Also, it seems to be the case that there is no pipelining here and both blocking calls wait for each other clearly not utilizing even a single core or the TCP/ HTTP2 connection to the maximum potential. @fabian could somebody please improve the loop in such a way that it fetches multiple chunks at once/ concurrently?

kaliszad · Jun 9, 2025

Success! My colleague Daniel Škarda and I have managed to produce a patch. Now the restore completely saturates a 10 Gbps link (we don't currently have anything faster to test).

If you want to test, please let me know and I can send you the compiled .so binary that you can place in /usr/lib/libproxmox_backup_qemu.so.0

PwrBank · Jun 9, 2025

kaliszad said:
Success! My colleague Daniel Škarda and I have managed to produce a patch. Now the restore completely saturates a 10 Gbps link (we don't currently have anything faster to test).

If you want to test, please let me know and I can send you the compiled .so binary that you can place in /usr/lib/libproxmox_backup_qemu.so.0

I'm game, shoot me the update. I have 25GbE links + stupid amount of RAM and cores on the PBS.

kaliszad · Jun 9, 2025

PwrBank said:
I'm game, shoot me the update. I have 25GbE links + stupid amount of RAM and cores on the PBS.

UPDATE: As @_gabriel noticed, this is a huge file [187 MB] because it contains the debug info and is not stripped. I will leave this here for reference.
Try this: https://notnullmakers.com/public/media/libproxmox_backup_qemu.so_09062025
the SHA256 is: 641b9a223f67603890c57d576e7e0a1a7d0da266db76b5505564d4cf02118128

Of course I recommend you do a backup of the original /usr/lib/libproxmox_backup_qemu.so.0 binary and check e.g. with VirusTotal...

_gabriel · Jun 9, 2025

kaliszad said:
Try this: https://notnullmakers.com/public/media/libproxmox_backup_qemu.so_09062025
the SHA256 is: 641b9a223f67603890c57d576e7e0a1a7d0da266db76b5505564d4cf02118128

Of course I recommend you do a backup of the original /usr/lib/libproxmox_backup_qemu.so.0 binary and check e.g. with VirusTotal...

That's a big patch ! your binary size is 187 MB where original is 6.8 MB put me off to try.

Abysmally slow restore from backup

Active Member

Active Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Member

Active Member

Member

Member

Attachments

Member

Active Member

Member

Famous Member

We value your privacy