Resuming broken PBS restore process

kocio

Member
May 4, 2021
11
2
8
Warsaw
Hello,

I have a big VM snapshot to restore and it sometimes breaks for some reason at different points (probably disk does not catch up or temporary network problems).

I'd like to know if it's possible to resume it, to avoid going again through this time consuming and not fully reliable process?

Does PBS retry reading when data are not ready? Is it possible to define how many times should it try?
 
wow I really wish someone would help in cases like this. This is the only result for the problem we are experiencing on Google and not a single person has replied
 
wow I really wish someone would help in cases like this. This is the only result for the problem we are experiencing on Google and not a single person has replied
We really try to help, but you have to keep in mind that maybe no one has "the answer".... Also it's extremely hard to guess what's going on without logs or any technical detail about the issue.

I'd like to know if it's possible to resume it, to avoid going again through this time consuming and not fully reliable process?
There's no automatic way to resume a broken restore.

Does PBS retry reading when data are not ready? Is it possible to define how many times should it try?
It depends on what caused the restore to fail, because maybe the error wasn't recoverable at all (i.e. a missing chunk in the PBS datastore). Check logs both in the PBS server and on PVE and report more details.
 
Thanks for the response!

There's no automatic way to resume a broken restore.
I'm fine with something more manual if it's available. Are you aware of something like that or maybe if there are some plans to address this problem in the future? Maybe existing bug report to keep an eye on it?

I imagine something like "an error occurred during recovery, missing chunk(s) will be checked again at the end of the process/now" or "would you like to try missing chunks again or delete the image? (y/n)". The first and the most obvious obstacle is automatic deleting not complete image, no matter what.

It depends on what caused the restore to fail, because maybe the error wasn't recoverable at all (i.e. a missing chunk in the PBS datastore). Check logs both in the PBS server and on PVE and report more details.
Nothing fancy - since it went OK on a next try, there was no solid problem, just a temporary lack of communication. It just took a lot of time.
 
In my case its usually a network cut since the PBS server is off-site over a VPN and the VM has a 12TB disk. My particular log was so big it would crash the page trying to load it because it had been going for a week before some ceph osds went down causing it to slow writes and it never started again. If it happens again, what logs would be useful to help with diagnosis? I ended up nuking pretty much everything ceph-wise and restarting and am starting a new live-recovery. Thank you for taking the time to help, it really does mean a lot!
 
There is no "incremental restore", unfortunately. That may help in cases like this or when doing daily restores (i.e. in a DR site to have some "warm" VMs).

I'm fine with something more manual if it's available. Are you aware of something like that or maybe if there are some plans to address this problem in the future? Maybe existing bug report to keep an eye on it?

I imagine something like "an error occurred during recovery, missing chunk(s) will be checked again at the end of the process/now" or "would you like to try missing chunks again or delete the image? (y/n)". The first and the most obvious obstacle is automatic deleting not complete image, no matter what.
If the disk(s) already restored are still in the destination PVE, you may be able to move them (how exactly depends on which storage backend you use in PVE) and then manually restore other disk(s) from the backup using CLI [1]. Cleaning up the half restored disk when the restore fails is the right thing to do, otherwise you'll be left with a bunch of unusable 0's and 1's wasting storage in your PVE.

Would be nice though to have some way to set retries in PBS to make it more resilient to network glitches... Feel free to fill an enhancement request in the bugzilla [1].

before some ceph osds went down causing it to slow writes and it never started again
So you know what happened and you can take measures to avoid it in the future.

the VM has a 12TB disk. [...] am starting a new live-recovery
IMHO live restore for such a big VM through VPN is not a good idea: missing blocks in PVE storage not yet restored from PBS will have to be read through the VPN from the PBS, restored asap in PVE and then read by the VM: that's slow because you are using your remote PBS as the storage backend to run your VMs. Also, any changes made to the VM's disks while the live-restore is being made will be lost if the restore fails.


[1] https://bugzilla.proxmox.com/describecomponents.cgi?product=pbs
[2] https://forum.proxmox.com/threads/restore-single-virtual-disk-from-pbs.95868/post-415847
 
As addition to the already provided advice in this threaed: To avoid the VPN bottleneck during restore, as a workaround you might setup a local PBS instance and pull the contents from that VM from the remote datastore to the local instance via a sync job. See https://pbs.proxmox.com/docs/managing-remotes.html#sync-jobs.
 
I've done some further looking with the advice given here. My setup goes as follows:

3 local nodes connected with 10gb/s, total of 11 ceph SSD osd's, and 7 ceph HDD osds. This is the only storage in my "datacenter" aside from the mandatory boot disk which usually has about 400GB available on each node. I run 16 VMs that do various things, 2 being camera servers that write about 6MB/s each to disk, and 1 being that massive Plex server with relatively low disk usage. Some DNS servers, Hass.io, torrent tracker cluster, web servers, and a couple other niche things, all mostly compute and ram based. I have them all on the same subnet, and they all have at least 2 different links to the network via either 802.3ad LACP or active backup or some combo.

The off-site is over a 500Mb/s pfsense VPn tunnel with 30ms of latency and both sides being fiber. The remote server is a single proxmox node (used to be part of the local cluser) that runs some Linux VMs, PBS, and another Hass.io instance. It has 4x Seagate exos 8TB HDDs in a zfs raid 10 and a 1tb boot nvme SSD.

I am able to backup to the PBS servers and saturate the link, but restoring (which I just learned is not ideal with HDDs) is about 8-10MB/s or about 80Mb/s. I am able to yoink network interfaces and interrupt the backups and they will usually continue, but my problem stems from the unreliability of the restore. I can wait 3 weeks for it to restore, as once about 100GB of data is restored the Plex server runs like a champ. I am just struggling with why something like a temporary disk pool slowdown from an OSD dying and on the HDD pool and the pool recovering relatively quickly, or something as simple as reconnecting the vpn tunnel (simulating a very brief network interruption) is enough to have the task continue to run and enter logs but refuse to keep pulling from the available PBS server. I only need to pull from this server like once and thats because I lost too much data to recover my Plex server (yay ceph). Thats why we have backups, etc. Does my question of why restoring is a 1-shot fuse operation instead of like the backup process which is very resilient make sense?

Thanks in advance,
Cody
 
  • Like
Reactions: kocio
has anyone had a chance to look at this? it just broke again for the third time after waiting for a week seemingly for no reason.
 
Was this last failure during a Live Restore or an offline restore? In the former, any quirk in the chain will break the restore, as explained before:

IMHO live restore for such a big VM through VPN is not a good idea: missing blocks in PVE storage not yet restored from PBS will have to be read through the VPN from the PBS, restored asap in PVE and then read by the VM: that's slow because you are using your remote PBS as the storage backend to run your VMs. Also, any changes made to the VM's disks while the live-restore is being made will be lost if the restore fails.
 
Was this last failure during a Live Restore or an offline restore? In the former, any quirk in the chain will break the restore, as explained before:
The latest one was live but I've had problems with both types. In my latest currently running attempt I physically went and got my backup server from off-site to see if that'll help but aside from a little higher speed it seems about the same at 20MB/s.
 
I am able to backup to the PBS servers and saturate the link, but restoring (which I just learned is not ideal with HDDs) is about 8-10MB/s or about 80Mb/s. I am able to yoink network interfaces and interrupt the backups and they will usually continue, but my problem stems from the unreliability of the restore. I can wait 3 weeks for it to restore, as once about 100GB of data is restored the Plex server runs like a champ.

This is exactly what I would like to have eventually in PBS - reliability for restore, just like we have reliability of backup, even at the expense of long waiting, because I can cancel it anyway if I don't want to wait.

In my case adding 3rd HDD to a ZFS pool did the job (no more problems with restore), so it was not a network glitch, but a disk one.
 
This is exactly what I would like to have eventually in PBS - reliability for restore, just like we have reliability of backup, even at the expense of long waiting, because I can cancel it anyway if I don't want to wait.

In my case adding 3rd HDD to a ZFS pool did the job (no more problems with restore), so it was not a network glitch, but a disk one.
I ended up doing the figurative act of blowing everything up and putting it back together. destroyed my entire hard drive pool and gave every new hard drive osd and SSD DB/WAL drive of appropriate size and physically went and got my backup server so that it was local and didn't have that 30 milliseconds of latency. I ended up running into the issue of hard drive OSDs having permanent DB corruption and I assume it's from limited IOPS performance causing sync Issues when the DB, write ahead log, and actual drive data was all on a spinner and eventually it led to corruption. this has bit me in the ass more times than I want to admit so I fixed it. I am now pushing about 55 - 60 megabytes per second from the backup server instead of 10. my HDD OSD latencies have gone from hundreds/thousands of milliseconds to tens. it doesn't fix the reliability issue with PBS but at least my 12 terabytes will copy in less than 4 weeks which means less chance to fail. my PBS is a ZFS raid 10 with 4 drives for reference. it runs 8tb exos 7200rpm bad-a$$ datacenter drives. it *shouldn't* be a performance bottleneck. Glad to hear adding that drive fixed yours though! what numbers are you pushing?
 
Last edited:
About my numbers - it was just a one time general recovery test on cloud environment and I don't remember exactly, but that was similar, like 60 megabytes per second from other physical server after expanding ZFS pool.

With just 2 disk mirror, the speed was like 40 MB/s (which would be still OK for me if not the restore process halting abruptly).
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!