Resuming broken PBS restore process

kocio · Sep 11, 2024

Hello,

I have a big VM snapshot to restore and it sometimes breaks for some reason at different points (probably disk does not catch up or temporary network problems).

I'd like to know if it's possible to resume it, to avoid going again through this time consuming and not fully reliable process?

Does PBS retry reading when data are not ready? Is it possible to define how many times should it try?

codym · Tuesday at 00:20

wow I really wish someone would help in cases like this. This is the only result for the problem we are experiencing on Google and not a single person has replied

VictorSTS · Tuesday at 01:18

codym said:
wow I really wish someone would help in cases like this. This is the only result for the problem we are experiencing on Google and not a single person has replied

We really try to help, but you have to keep in mind that maybe no one has "the answer".... Also it's extremely hard to guess what's going on without logs or any technical detail about the issue.

kocio said:
I'd like to know if it's possible to resume it, to avoid going again through this time consuming and not fully reliable process?

There's no automatic way to resume a broken restore.

kocio said:
Does PBS retry reading when data are not ready? Is it possible to define how many times should it try?

It depends on what caused the restore to fail, because maybe the error wasn't recoverable at all (i.e. a missing chunk in the PBS datastore). Check logs both in the PBS server and on PVE and report more details.

kocio · Tuesday at 01:37

Thanks for the response!

VictorSTS said:
There's no automatic way to resume a broken restore.

I'm fine with something more manual if it's available. Are you aware of something like that or maybe if there are some plans to address this problem in the future? Maybe existing bug report to keep an eye on it?

I imagine something like "an error occurred during recovery, missing chunk(s) will be checked again at the end of the process/now" or "would you like to try missing chunks again or delete the image? (y/n)". The first and the most obvious obstacle is automatic deleting not complete image, no matter what.

VictorSTS said:
It depends on what caused the restore to fail, because maybe the error wasn't recoverable at all (i.e. a missing chunk in the PBS datastore). Check logs both in the PBS server and on PVE and report more details.

Nothing fancy - since it went OK on a next try, there was no solid problem, just a temporary lack of communication. It just took a lot of time.

codym · Tuesday at 05:34

In my case its usually a network cut since the PBS server is off-site over a VPN and the VM has a 12TB disk. My particular log was so big it would crash the page trying to load it because it had been going for a week before some ceph osds went down causing it to slow writes and it never started again. If it happens again, what logs would be useful to help with diagnosis? I ended up nuking pretty much everything ceph-wise and restarting and am starting a new live-recovery. Thank you for taking the time to help, it really does mean a lot!

VictorSTS · Tuesday at 09:49

There is no "incremental restore", unfortunately. That may help in cases like this or when doing daily restores (i.e. in a DR site to have some "warm" VMs).

kocio said:
I'm fine with something more manual if it's available. Are you aware of something like that or maybe if there are some plans to address this problem in the future? Maybe existing bug report to keep an eye on it?

I imagine something like "an error occurred during recovery, missing chunk(s) will be checked again at the end of the process/now" or "would you like to try missing chunks again or delete the image? (y/n)". The first and the most obvious obstacle is automatic deleting not complete image, no matter what.

If the disk(s) already restored are still in the destination PVE, you may be able to move them (how exactly depends on which storage backend you use in PVE) and then manually restore other disk(s) from the backup using CLI [1]. Cleaning up the half restored disk when the restore fails is the right thing to do, otherwise you'll be left with a bunch of unusable 0's and 1's wasting storage in your PVE.

Would be nice though to have some way to set retries in PBS to make it more resilient to network glitches... Feel free to fill an enhancement request in the bugzilla [1].

codym said:
before some ceph osds went down causing it to slow writes and it never started again

So you know what happened and you can take measures to avoid it in the future.

codym said:
the VM has a 12TB disk. [...] am starting a new live-recovery

IMHO live restore for such a big VM through VPN is not a good idea: missing blocks in PVE storage not yet restored from PBS will have to be read through the VPN from the PBS, restored asap in PVE and then read by the VM: that's slow because you are using your remote PBS as the storage backend to run your VMs. Also, any changes made to the VM's disks while the live-restore is being made will be lost if the restore fails.

[1] https://bugzilla.proxmox.com/describecomponents.cgi?product=pbs
[2] https://forum.proxmox.com/threads/restore-single-virtual-disk-from-pbs.95868/post-415847

Chris · Tuesday at 10:03

As addition to the already provided advice in this threaed: To avoid the VPN bottleneck during restore, as a workaround you might setup a local PBS instance and pull the contents from that VM from the remote datastore to the local instance via a sync job. See https://pbs.proxmox.com/docs/managing-remotes.html#sync-jobs.

codym · Tuesday at 16:26

I've done some further looking with the advice given here. My setup goes as follows:

3 local nodes connected with 10gb/s, total of 11 ceph SSD osd's, and 7 ceph HDD osds. This is the only storage in my "datacenter" aside from the mandatory boot disk which usually has about 400GB available on each node. I run 16 VMs that do various things, 2 being camera servers that write about 6MB/s each to disk, and 1 being that massive Plex server with relatively low disk usage. Some DNS servers, Hass.io, torrent tracker cluster, web servers, and a couple other niche things, all mostly compute and ram based. I have them all on the same subnet, and they all have at least 2 different links to the network via either 802.3ad LACP or active backup or some combo.

The off-site is over a 500Mb/s pfsense VPn tunnel with 30ms of latency and both sides being fiber. The remote server is a single proxmox node (used to be part of the local cluser) that runs some Linux VMs, PBS, and another Hass.io instance. It has 4x Seagate exos 8TB HDDs in a zfs raid 10 and a 1tb boot nvme SSD.

I am able to backup to the PBS servers and saturate the link, but restoring (which I just learned is not ideal with HDDs) is about 8-10MB/s or about 80Mb/s. I am able to yoink network interfaces and interrupt the backups and they will usually continue, but my problem stems from the unreliability of the restore. I can wait 3 weeks for it to restore, as once about 100GB of data is restored the Plex server runs like a champ. I am just struggling with why something like a temporary disk pool slowdown from an OSD dying and on the HDD pool and the pool recovering relatively quickly, or something as simple as reconnecting the vpn tunnel (simulating a very brief network interruption) is enough to have the task continue to run and enter logs but refuse to keep pulling from the available PBS server. I only need to pull from this server like once and thats because I lost too much data to recover my Plex server (yay ceph). Thats why we have backups, etc. Does my question of why restoring is a 1-shot fuse operation instead of like the backup process which is very resilient make sense?

Thanks in advance,
Cody

Search

Search

Resuming broken PBS restore process

kocio

Member

codym

Member

VictorSTS

Renowned Member

kocio

Member

codym

Member

VictorSTS

Renowned Member

Chris

Proxmox Staff Member

codym

Member