[solved, not reproducible] Tape backup restore behaviour

wbk · Sep 29, 2023

Hi all,

TL;WR:

I had a snapshot that failed verification
Restore from tape did not correct any chunks
On re-verification it turned out that the verification had incorrectly flagged a number of chunks as failed
Lessons learned: use server grade hardware or exercise patience (ie, don't run multiple I/O intensive tasks in parallel on a single SAS HDD in case of an underpowered CPU and lack of ECC memory)

~~~~~~~~~~~~~~~~~~~~~~~~~~
Original post:
~~~~~~~~~~~~~~~~~~~~~~~~~~

I am restoring from tape. I think it is the whole media set, but I am not quite sure.

I ran the task from the GUI, by selecting the date and then clicking "Restore" :

The restore-button opens a window to first select a media set; the single available snapshot is selected.

There are 5 backups of the container, only the first of which is written to tape.

I had expected the chunks to be read and synchronously been written to PBS. Instead the tape spins through at relatively high speeds, but the task monitor only says "File n, restored 0 B, register nnnn chunks" :

Disk writes are just about 100kb/s, with hardly any CPU being involved.

What should I expect here?

I only thought of checking the help with respect to restoring after the process did not match my expectation. In the help there is mention of options for retrieving only partial data from tape, but I think in my case it is the whole media set.

My expectation for this media set with only this single backup of one container:

Press restore
Select the media set to restore (all contents, which is only one item in this case)
See chunks being read from tape
See chunks being written to storage
Switch tape if when needed
... repeat ...
Until the last chunk is read and written
Done

Seeing what is happening, and faintly remembering the actual writing to tape, I guess the chunks are written to a temporary location, and only afterwards converted to overwrite the existing first snapshot of the container in question.

At what data speed is the tape running now? Because the PBS tape driver runs separate from the default tape driver in Linux, I am not sure whether I can use tape tools while the tape is being run by PBS. The backup itself ran with 30-40 MB/s:

I had the option of replacing it with an LTO4 drive so I'd expect at least comparable performance. The restore-task window (PBS 2.4) shows the number of chunks and the timestamp, but not the data rate. Comparing the screenshot above (about 110 seconds for 1000 chunks giving 35 MB/s) with the earlier screenshot, I'd expect at least comparable datarates.

Seeing there's a mismatch between my expectation and the actual process: what should I expect to see while restoring?

My main 'worry' now is that I have to sit and wait an indefinite time for the tape to have been scanned/indexed fully, to switch tapes a couple of times, and then go through the tapes again for the actual restore.

The actual reason for the restore will find another ticket (looking at the clock, probably tomorrow ;-) ) , my four earlier backups of the container as well as the backup I just made all fail verification (where these four used to pass verification). I hope this backup from tape will overwrite the first. If a dedeuplicated chunk in the first backup got corrupt, will it cause all subsequent backups to fail?

dcsapak · Sep 29, 2023

did you restore to a datastore where the backups still exist? if the chunks already exist on the datastore, there is no need to read them from tape or write them to disk.
i agree we could be more verbose here in the logs if that happens

wbk · Sep 29, 2023

Hi dcsapak,

Thanks again for helping me out!

dcsapak said:
did you restore to a datastore where the backups still exist? if the chunks already exist on the datastore, there is no need to read them from tape or write them to disk.

Yes, I did. The existing snapshot does not pass validation anymore, I hoped that by restoring from tape I would overwrite the broken chunk. ("all four snapshots fail verification", my other thread.)

Do I understand correctly that tape restore will only have any effect when writing to a datastore that does not contain the backup (or snapshot included) I have on that tape / media set?

Unfortunately, the restore finished in an error:

The logged in user (root@pam) does not match the owner of the snapshot (verjaardag@pbs). Should I have checked this beforehand? Could PBS check this for me when initiating the restore, or are these details not available at that moment?

I am not sure whether this error occurred only when writing some metadata, or that it tried to write other data after registering all chunks. The backup consists of 3 tapes. Below is the last bit of the last tape, where you see the number of registered chunks 'tapering off':

I'd say the tapering off implies PBS restored whatever it intended to restore from this media set.What is it supposed to write in this case?

Sorry for opening up another can of worms, thank you for your patience!

dcsapak · Oct 2, 2023

wbk said:
Do I understand correctly that tape restore will only have any effect when writing to a datastore that does not contain the backup (or snapshot included) I have on that tape / media set?

it should restore chunks that are missing/broken on the target datastore, regardless if the snapshot still exists or not

wbk said:
The logged in user (root@pam) does not match the owner of the snapshot (verjaardag@pbs). Should I have checked this beforehand? Could PBS check this for me when initiating the restore, or are these details not available at that moment?

because tape restore can take a long time, checking this at the beginning still has the possibility that during restore such an error happens, and so we skip the initial check for that

wbk said:
I am not sure whether this error occurred only when writing some metadata, or that it tried to write other data after registering all chunks. The backup consists of 3 tapes. Below is the last bit of the last tape, where you see the number of registered chunks 'tapering off':

it happened during the restore of the snapshot indices (which are seperate from the chunks) so the restore stopped at that point in time

wbk · Oct 9, 2023

Hi Dominic,

Sorry for not getting back any sooner. Backups are important, but sometimes life has no attention for that and throws urgent matters in the way ;-)

dcsapak said:
because tape restore can take a long time, checking this at the beginning still has the possibility that during restore such an error happens, and so we skip the initial check for that

I had to read that twice. Do I understand correctly that you mean this case: upon starting restore, a check is (now not) made, which shows that the restore-user matches the backup-owner. A day later, the first cycle of restoring has passed, but in the mean time the owner of the backup has changed, creating a mismatch (no matter what the not existing initial check found).

I intended to run the restore once more, now through the `correct` login. I realize that I use one login for the GUI, and the client systems use their own login to create the backups.

~~Which path is preferable, to change ownership of the backup (possibly preventing a client PVE system to build on that snapshot?), or to use a client login for PBS?~~

I just notice the `Owner` field in the `Target` tab of the `Restor Media-Set` popup. It has `Current Auth ID` as default selection, and allows me to switch to a PVE system client ID when logged in to PBS as a user with sufficient rights.

I will initiate a tape restore with the Auth ID of the owner of the backup, ie, the PBS-user that is reserved for that PVE system.

While I'm listening to the steady acceleration and deceleration of the tape drive, what is it supposed to do? I was unable to deduce the steps from documentation at https://pbs.proxmox.com/docs-2/tape-backup.html#restore-from-tape. Will it go through each of the tapes twice,

once to build an index (like it did last week),
then match the index with the snapshot to be restored (which hopefully succeeds, now that I choose the correct owner and I am not changing it on the datastore-side),
and finally request the tapes that are needed to create / fill in the snapshot in the requested datastore?

I am not trying to restore only a single snapshot or specific namespace, as I only have a tape backup of the first (corrupt) snapshot of my backup. So I can restore the whole media set.

The process starts like this:

Code:

2023-10-09T12:51:34+02:00: Mediaset '8e8a53fb-7b10-4ca1-a224-b57b451d59b7'
2023-10-09T12:51:34+02:00: Pool: osba
2023-10-09T12:51:34+02:00: Datastore(s): bak_sdc
2023-10-09T12:51:34+02:00: Drive: hp_lto4_ext
2023-10-09T12:51:34+02:00: Required media list: osbas0;osbas2;osbas3
2023-10-09T12:51:34+02:00: Checking for media 'osbas0' in drive 'hp_lto4_ext'
2023-10-09T12:52:30+02:00: found media label osbas0 (7a83d40e-75f9-4db9-8b8a-afa5061b7b7c)
2023-10-09T12:52:30+02:00: File 2: chunk archive for datastore 'bak_sdc'
2023-10-09T12:53:46+02:00: restored 0 B (0 B/s)
2023-10-09T12:53:46+02:00: register 1981 chunks
2023-10-09T12:53:46+02:00: File 3: chunk archive for datastore 'bak_sdc'
2023-10-09T12:55:02+02:00: restored 0 B (0 B/s)

It has read 13 files at about 1 file per minute, each of them 1000-2000 chunks, all of them restoring 0 B. Since I restore to an existing snapshot that failed verification, should it recognize and correct a corrupt chunk while performing this task?

There are three tapes of fewer than 100 files per tape, so the first round of reading the set should take some 5 hours but chances are I only can have a look at the results tomorrow night. I'll post back 'ASAP'!

dcsapak · Oct 9, 2023

ok, so there are generally 2 types of archives ("files") on the tape:

chunk archives and snapshot archives (both contain what you would assume, chunks and snapshots respectively)
where a chunk archive bundles chunks together into larger blocks and snapshot archives contain a single snapshot

during backup, for each snapshot, the chunks will be written first and then the snapshot, so when restoring
we can restore the chunk archives and when we encounter a snapshot archive we can also simple restore that, because all necessary chunks were restored before that
we hold a chunk index on disk on the pbs, so we know which chunk are where and thus the bulk of the chunks are at the beginning of the tape

wbk said:
While I'm listening to the steady acceleration and deceleration of the tape drive

this is because it reads the chunks but has to verify if they exist on disk, so how fast it can restore does depend on your storage speed

does that make it clearer?

wbk · Oct 9, 2023

Hi Dominic,

Thank you for your explanation and for replying to my question so quickly after I didn't have an update for a week!

dcsapak said:
we can restore the chunk archives and when we encounter a snapshot archive we can also simple restore that

Do I understand correctly that a snapshot archive is a bundling of (a subset of) chunks from the chunk archive? In that case it is in line with my expectation / understanding from the docs.

dcsapak said:
we hold a chunk index on disk on the pbs

I expect the same chunks to be part of regular (disk) backups, that is, as long as they remain in datastore on disk. Is that correct?

dcsapak said:
(acceleration) because it reads the chunks but has to verify if they exist on disk

It sounds more as if it reads a single loop of tape before slowing down, turning around, and reading the next loop. Its an LTO4 drive, so I think the demands on storage are not too heavy. The tapes for this dataset are even LTO3.

The restore finished while having dinner. This is the tail of the task output,

Code:

2023-10-09T18:48:51+02:00: File 75: chunk archive for datastore 'bak_sdc'
2023-10-09T18:49:59+02:00: restored 0 B (0 B/s)
2023-10-09T18:49:59+02:00: register 2903 chunks
2023-10-09T18:49:59+02:00: File 76: chunk archive for datastore 'bak_sdc'
2023-10-09T18:50:04+02:00: restored 0 B (0 B/s)
2023-10-09T18:50:04+02:00: register 82 chunks
2023-10-09T18:50:04+02:00: File 77: snapshot archive bak_sdc:ct/104/2023-01-27T21:41:19Z
2023-10-09T18:50:04+02:00: File 78: skip catalog '908bf295-ebc8-4d46-b7c8-58cde215dd93'
2023-10-09T18:50:19+02:00: detected EOT after 79 files
2023-10-09T18:50:19+02:00: Restore mediaset '8e8a53fb-7b10-4ca1-a224-b57b451d59b7' done
2023-10-09T18:50:19+02:00: TASK OK

Compared to the previous run (screenshot earlier in the conversation): that time it ended in an error after file 77, this time it got to file 78 of tape 3 where it skipped the catalog. The mediaset seems to be restored.

Apart from parsing the task output for details, is there an overview of changes made to the backup on disk by the restore from tape?
Using ctrl-f on the downloaded task log, I got:

243 occurrences of "chunk archive for datastore 'bak_sdc'"
243 occurences of "chunks"
243 occurrences of "restored 0 B"

Apart from the happy ending, the results are frighteningly similar to the previous restore run for this backup.

I am running a new verify task on the on-disk backup (that last week failed verification). Looking back at the log from last week, it will run for around 12 hours. Am I wrong to suspect that the tape restore will not have had any impact on the corrupt chunks in the data store, since only 0 B have been restored?

Any chance the "digest store" itself got corrupted, with the chunks as such still being fine (and thus no reason to restore any chunk from tape)? The backup has passed verification previously, after the tape backup was made (so either that verification passed when it should not have, or the chunks on tape have had the same degradation as the chunks on disk).

By the way, the verification log from last week mentions which 8 corrupted chunks have been renamed to hex.0.bad, can I cherry pick those from tape?

I am looking forward to your interpretation. Please let me know whether I overlooked something that could cause this behaviour or when I can supply information that shines a light on the situation

dcsapak · Oct 10, 2023

wbk said:
Do I understand correctly that a snapshot archive is a bundling of (a subset of) chunks from the chunk archive? In that case it is in line with my expectation / understanding from the docs.

a snapshot is a collection of indices which are files referencing the chunks by its hash (iow. the id) so it's only metadata

wbk said:
I expect the same chunks to be part of regular (disk) backups, that is, as long as they remain in datastore on disk. Is that correct?

yes the chunks on disk are the same as on tape, we simply restructure the overall layout a bit (because of tape limitations, i.e. no fast random access)

wbk said:
It sounds more as if it reads a single loop of tape before slowing down, turning around, and reading the next loop. Its an LTO4 drive, so I think the demands on storage are not too heavy. The tapes for this dataset are even LTO3.

on a normal (whole media-set) restore, the tapes will only be read once, but as i said how fast that can be done depends on the disk speed (maybe also on cpu + memory)

wbk said:
Apart from parsing the task output for details, is there an overview of changes made to the backup on disk by the restore from tape?

not yet, but this sounds like a worthwhile improvement to have at the end of the task log (and probably not too hard to do), would you mind opening an enhancement request for that: https://bugzilla.proxmox.com

wbk said:
Apart from the happy ending, the results are frighteningly similar to the previous restore run for this backup.

I am running a new verify task on the on-disk backup (that last week failed verification). Looking back at the log from last week, it will run for around 12 hours. Am I wrong to suspect that the tape restore will not have had any impact on the corrupt chunks in the data store, since only 0 B have been restored?

Any chance the "digest store" itself got corrupted, with the chunks as such still being fine (and thus no reason to restore any chunk from tape)? The backup has passed verification previously, after the tape backup was made (so either that verification passed when it should not have, or the chunks on tape have had the same degradation as the chunks on disk).

normally restoring the chunks should repair them (if they were on the tape backup in the first place). Bad chunks should not be backed up since they will be read + verified during that. Sans any bugs, this basically should work as expected, if you found that it does not, more details would be nice so that we can try to reproduce that and fix the bugs

wbk said:
By the way, the verification log from last week mentions which 8 corrupted chunks have been renamed to hex.0.bad, can I cherry pick those from tape?

no, not directly, but restoring the affected snapshots should be equivalent to that

wbk · Oct 21, 2023

Sorry for the long silence. Apart from the run time for the tasks, life happened (and besides, I was really looking forward to the new PVE and PBS versions, and gave upgrading priority over politeness, and postponed posting until (easily and successfully, wohoo) finishing the upgrades).

"All is well".

* the second restore run from tape did not restore any chunks either
* sequentially verifying my snapshots one by one did not find any irregularities in the original set (of 4 snapshots), explaining why restore from tape did not correct any chunks
* (unrelated to the tape story) the latest snapshot, that was created while running the initial verification, did have corrupt chunks on re-verification

My next actions are running a new backup, and writing the snapshot to tape after (successful) verification.

I'll mark this post as 'solved, not reproducible'.

Thank you for your patient explanations!

dcsapak · Oct 23, 2023

in that case i'd probably check the disks/memory/cpu etc. because something seems to corrupt you chunks (and that is not "normal" behaviour)

Search

Search

[solved, not reproducible] Tape backup restore behaviour

wbk

Renowned Member

dcsapak

Proxmox Staff Member

wbk

Renowned Member

dcsapak

Proxmox Staff Member

wbk

Renowned Member

dcsapak

Proxmox Staff Member

wbk

Renowned Member

dcsapak

Proxmox Staff Member

wbk

Renowned Member

dcsapak

Proxmox Staff Member

We value your privacy