[solved, not reproducible] Tape backup restore behaviour

wbk

Active Member
Oct 27, 2019
188
22
38
Hi all,

TL;WR:
  • I had a snapshot that failed verification
  • Restore from tape did not correct any chunks
  • On re-verification it turned out that the verification had incorrectly flagged a number of chunks as failed
  • Lessons learned: use server grade hardware or exercise patience (ie, don't run multiple I/O intensive tasks in parallel on a single SAS HDD in case of an underpowered CPU and lack of ECC memory)
~~~~~~~~~~~~~~~~~~~~~~~~~~
Original post:
~~~~~~~~~~~~~~~~~~~~~~~~~~

I am restoring from tape. I think it is the whole media set, but I am not quite sure.

I ran the task from the GUI, by selecting the date and then clicking "Restore" :

1695939587668.png

The restore-button opens a window to first select a media set; the single available snapshot is selected.

There are 5 backups of the container, only the first of which is written to tape.

I had expected the chunks to be read and synchronously been written to PBS. Instead the tape spins through at relatively high speeds, but the task monitor only says "File n, restored 0 B, register nnnn chunks" :

1695940051810.png

Disk writes are just about 100kb/s, with hardly any CPU being involved.

What should I expect here?

I only thought of checking the help with respect to restoring after the process did not match my expectation. In the help there is mention of options for retrieving only partial data from tape, but I think in my case it is the whole media set.

My expectation for this media set with only this single backup of one container:
  • Press restore
  • Select the media set to restore (all contents, which is only one item in this case)
  • See chunks being read from tape
  • See chunks being written to storage
  • Switch tape if when needed
  • ... repeat ...
  • Until the last chunk is read and written
  • Done
Seeing what is happening, and faintly remembering the actual writing to tape, I guess the chunks are written to a temporary location, and only afterwards converted to overwrite the existing first snapshot of the container in question.

At what data speed is the tape running now? Because the PBS tape driver runs separate from the default tape driver in Linux, I am not sure whether I can use tape tools while the tape is being run by PBS. The backup itself ran with 30-40 MB/s:

1695941347934.png

I had the option of replacing it with an LTO4 drive so I'd expect at least comparable performance. The restore-task window (PBS 2.4) shows the number of chunks and the timestamp, but not the data rate. Comparing the screenshot above (about 110 seconds for 1000 chunks giving 35 MB/s) with the earlier screenshot, I'd expect at least comparable datarates.

Seeing there's a mismatch between my expectation and the actual process: what should I expect to see while restoring?

My main 'worry' now is that I have to sit and wait an indefinite time for the tape to have been scanned/indexed fully, to switch tapes a couple of times, and then go through the tapes again for the actual restore.

The actual reason for the restore will find another ticket (looking at the clock, probably tomorrow ;-) ) , my four earlier backups of the container as well as the backup I just made all fail verification (where these four used to pass verification). I hope this backup from tape will overwrite the first. If a dedeuplicated chunk in the first backup got corrupt, will it cause all subsequent backups to fail?
 
Last edited:
did you restore to a datastore where the backups still exist? if the chunks already exist on the datastore, there is no need to read them from tape or write them to disk.
i agree we could be more verbose here in the logs if that happens
 
Hi dcsapak,

Thanks again for helping me out!

did you restore to a datastore where the backups still exist? if the chunks already exist on the datastore, there is no need to read them from tape or write them to disk.

Yes, I did. The existing snapshot does not pass validation anymore, I hoped that by restoring from tape I would overwrite the broken chunk. ("all four snapshots fail verification", my other thread.)

Do I understand correctly that tape restore will only have any effect when writing to a datastore that does not contain the backup (or snapshot included) I have on that tape / media set?

Unfortunately, the restore finished in an error:

1695978188540.png

The logged in user (root@pam) does not match the owner of the snapshot (verjaardag@pbs). Should I have checked this beforehand? Could PBS check this for me when initiating the restore, or are these details not available at that moment?

I am not sure whether this error occurred only when writing some metadata, or that it tried to write other data after registering all chunks. The backup consists of 3 tapes. Below is the last bit of the last tape, where you see the number of registered chunks 'tapering off':

1696009241967.png

I'd say the tapering off implies PBS restored whatever it intended to restore from this media set.What is it supposed to write in this case?

Sorry for opening up another can of worms, thank you for your patience!
 
Do I understand correctly that tape restore will only have any effect when writing to a datastore that does not contain the backup (or snapshot included) I have on that tape / media set?
it should restore chunks that are missing/broken on the target datastore, regardless if the snapshot still exists or not

The logged in user (root@pam) does not match the owner of the snapshot (verjaardag@pbs). Should I have checked this beforehand? Could PBS check this for me when initiating the restore, or are these details not available at that moment?
because tape restore can take a long time, checking this at the beginning still has the possibility that during restore such an error happens, and so we skip the initial check for that

I am not sure whether this error occurred only when writing some metadata, or that it tried to write other data after registering all chunks. The backup consists of 3 tapes. Below is the last bit of the last tape, where you see the number of registered chunks 'tapering off':
it happened during the restore of the snapshot indices (which are seperate from the chunks) so the restore stopped at that point in time
 
  • Like
Reactions: wbk
Hi Dominic,

Sorry for not getting back any sooner. Backups are important, but sometimes life has no attention for that and throws urgent matters in the way ;-)

because tape restore can take a long time, checking this at the beginning still has the possibility that during restore such an error happens, and so we skip the initial check for that
I had to read that twice. Do I understand correctly that you mean this case: upon starting restore, a check is (now not) made, which shows that the restore-user matches the backup-owner. A day later, the first cycle of restoring has passed, but in the mean time the owner of the backup has changed, creating a mismatch (no matter what the not existing initial check found).

I intended to run the restore once more, now through the `correct` login. I realize that I use one login for the GUI, and the client systems use their own login to create the backups.

Which path is preferable, to change ownership of the backup (possibly preventing a client PVE system to build on that snapshot?), or to use a client login for PBS?

I just notice the `Owner` field in the `Target` tab of the `Restor Media-Set` popup. It has `Current Auth ID` as default selection, and allows me to switch to a PVE system client ID when logged in to PBS as a user with sufficient rights.

I will initiate a tape restore with the Auth ID of the owner of the backup, ie, the PBS-user that is reserved for that PVE system.

While I'm listening to the steady acceleration and deceleration of the tape drive, what is it supposed to do? I was unable to deduce the steps from documentation at https://pbs.proxmox.com/docs-2/tape-backup.html#restore-from-tape. Will it go through each of the tapes twice,
  • once to build an index (like it did last week),
  • then match the index with the snapshot to be restored (which hopefully succeeds, now that I choose the correct owner and I am not changing it on the datastore-side),
  • and finally request the tapes that are needed to create / fill in the snapshot in the requested datastore?

I am not trying to restore only a single snapshot or specific namespace, as I only have a tape backup of the first (corrupt) snapshot of my backup. So I can restore the whole media set.

The process starts like this:
Code:
2023-10-09T12:51:34+02:00: Mediaset '8e8a53fb-7b10-4ca1-a224-b57b451d59b7'
2023-10-09T12:51:34+02:00: Pool: osba
2023-10-09T12:51:34+02:00: Datastore(s): bak_sdc
2023-10-09T12:51:34+02:00: Drive: hp_lto4_ext
2023-10-09T12:51:34+02:00: Required media list: osbas0;osbas2;osbas3
2023-10-09T12:51:34+02:00: Checking for media 'osbas0' in drive 'hp_lto4_ext'
2023-10-09T12:52:30+02:00: found media label osbas0 (7a83d40e-75f9-4db9-8b8a-afa5061b7b7c)
2023-10-09T12:52:30+02:00: File 2: chunk archive for datastore 'bak_sdc'
2023-10-09T12:53:46+02:00: restored 0 B (0 B/s)
2023-10-09T12:53:46+02:00: register 1981 chunks
2023-10-09T12:53:46+02:00: File 3: chunk archive for datastore 'bak_sdc'
2023-10-09T12:55:02+02:00: restored 0 B (0 B/s)

It has read 13 files at about 1 file per minute, each of them 1000-2000 chunks, all of them restoring 0 B. Since I restore to an existing snapshot that failed verification, should it recognize and correct a corrupt chunk while performing this task?

There are three tapes of fewer than 100 files per tape, so the first round of reading the set should take some 5 hours but chances are I only can have a look at the results tomorrow night. I'll post back 'ASAP'!
 
ok, so there are generally 2 types of archives ("files") on the tape:

chunk archives and snapshot archives (both contain what you would assume, chunks and snapshots respectively)
where a chunk archive bundles chunks together into larger blocks and snapshot archives contain a single snapshot

during backup, for each snapshot, the chunks will be written first and then the snapshot, so when restoring
we can restore the chunk archives and when we encounter a snapshot archive we can also simple restore that, because all necessary chunks were restored before that
we hold a chunk index on disk on the pbs, so we know which chunk are where and thus the bulk of the chunks are at the beginning of the tape

While I'm listening to the steady acceleration and deceleration of the tape drive
this is because it reads the chunks but has to verify if they exist on disk, so how fast it can restore does depend on your storage speed

does that make it clearer?
 
  • Like
Reactions: wbk
Hi Dominic,

Thank you for your explanation and for replying to my question so quickly after I didn't have an update for a week!

we can restore the chunk archives and when we encounter a snapshot archive we can also simple restore that
Do I understand correctly that a snapshot archive is a bundling of (a subset of) chunks from the chunk archive? In that case it is in line with my expectation / understanding from the docs.

we hold a chunk index on disk on the pbs
I expect the same chunks to be part of regular (disk) backups, that is, as long as they remain in datastore on disk. Is that correct?

(acceleration) because it reads the chunks but has to verify if they exist on disk
It sounds more as if it reads a single loop of tape before slowing down, turning around, and reading the next loop. Its an LTO4 drive, so I think the demands on storage are not too heavy. The tapes for this dataset are even LTO3.

The restore finished while having dinner. This is the tail of the task output,


Code:
2023-10-09T18:48:51+02:00: File 75: chunk archive for datastore 'bak_sdc'
2023-10-09T18:49:59+02:00: restored 0 B (0 B/s)
2023-10-09T18:49:59+02:00: register 2903 chunks
2023-10-09T18:49:59+02:00: File 76: chunk archive for datastore 'bak_sdc'
2023-10-09T18:50:04+02:00: restored 0 B (0 B/s)
2023-10-09T18:50:04+02:00: register 82 chunks
2023-10-09T18:50:04+02:00: File 77: snapshot archive bak_sdc:ct/104/2023-01-27T21:41:19Z
2023-10-09T18:50:04+02:00: File 78: skip catalog '908bf295-ebc8-4d46-b7c8-58cde215dd93'
2023-10-09T18:50:19+02:00: detected EOT after 79 files
2023-10-09T18:50:19+02:00: Restore mediaset '8e8a53fb-7b10-4ca1-a224-b57b451d59b7' done
2023-10-09T18:50:19+02:00: TASK OK

Compared to the previous run (screenshot earlier in the conversation): that time it ended in an error after file 77, this time it got to file 78 of tape 3 where it skipped the catalog. The mediaset seems to be restored.

Apart from parsing the task output for details, is there an overview of changes made to the backup on disk by the restore from tape?
Using ctrl-f on the downloaded task log, I got:
  • 243 occurrences of "chunk archive for datastore 'bak_sdc'"
  • 243 occurences of "chunks"
  • 243 occurrences of "restored 0 B"

Apart from the happy ending, the results are frighteningly similar to the previous restore run for this backup.

I am running a new verify task on the on-disk backup (that last week failed verification). Looking back at the log from last week, it will run for around 12 hours. Am I wrong to suspect that the tape restore will not have had any impact on the corrupt chunks in the data store, since only 0 B have been restored?

Any chance the "digest store" itself got corrupted, with the chunks as such still being fine (and thus no reason to restore any chunk from tape)? The backup has passed verification previously, after the tape backup was made (so either that verification passed when it should not have, or the chunks on tape have had the same degradation as the chunks on disk).

By the way, the verification log from last week mentions which 8 corrupted chunks have been renamed to hex.0.bad, can I cherry pick those from tape?

I am looking forward to your interpretation. Please let me know whether I overlooked something that could cause this behaviour or when I can supply information that shines a light on the situation :)
 
Last edited:
Do I understand correctly that a snapshot archive is a bundling of (a subset of) chunks from the chunk archive? In that case it is in line with my expectation / understanding from the docs.
a snapshot is a collection of indices which are files referencing the chunks by its hash (iow. the id) so it's only metadata

I expect the same chunks to be part of regular (disk) backups, that is, as long as they remain in datastore on disk. Is that correct?
yes the chunks on disk are the same as on tape, we simply restructure the overall layout a bit (because of tape limitations, i.e. no fast random access)

It sounds more as if it reads a single loop of tape before slowing down, turning around, and reading the next loop. Its an LTO4 drive, so I think the demands on storage are not too heavy. The tapes for this dataset are even LTO3.
on a normal (whole media-set) restore, the tapes will only be read once, but as i said how fast that can be done depends on the disk speed (maybe also on cpu + memory)

Apart from parsing the task output for details, is there an overview of changes made to the backup on disk by the restore from tape?
not yet, but this sounds like a worthwhile improvement to have at the end of the task log (and probably not too hard to do), would you mind opening an enhancement request for that: https://bugzilla.proxmox.com

Apart from the happy ending, the results are frighteningly similar to the previous restore run for this backup.

I am running a new verify task on the on-disk backup (that last week failed verification). Looking back at the log from last week, it will run for around 12 hours. Am I wrong to suspect that the tape restore will not have had any impact on the corrupt chunks in the data store, since only 0 B have been restored?

Any chance the "digest store" itself got corrupted, with the chunks as such still being fine (and thus no reason to restore any chunk from tape)? The backup has passed verification previously, after the tape backup was made (so either that verification passed when it should not have, or the chunks on tape have had the same degradation as the chunks on disk).
normally restoring the chunks should repair them (if they were on the tape backup in the first place). Bad chunks should not be backed up since they will be read + verified during that. Sans any bugs, this basically should work as expected, if you found that it does not, more details would be nice so that we can try to reproduce that and fix the bugs

By the way, the verification log from last week mentions which 8 corrupted chunks have been renamed to hex.0.bad, can I cherry pick those from tape?
no, not directly, but restoring the affected snapshots should be equivalent to that
 
  • Like
Reactions: wbk
Sorry for the long silence. Apart from the run time for the tasks, life happened (and besides, I was really looking forward to the new PVE and PBS versions, and gave upgrading priority over politeness, and postponed posting until (easily and successfully, wohoo) finishing the upgrades).

"All is well".

* the second restore run from tape did not restore any chunks either
* sequentially verifying my snapshots one by one did not find any irregularities in the original set (of 4 snapshots), explaining why restore from tape did not correct any chunks
* (unrelated to the tape story) the latest snapshot, that was created while running the initial verification, did have corrupt chunks on re-verification

My next actions are running a new backup, and writing the snapshot to tape after (successful) verification.

I'll mark this post as 'solved, not reproducible'.

Thank you for your patient explanations!
 
in that case i'd probably check the disks/memory/cpu etc. because something seems to corrupt you chunks (and that is not "normal" behaviour)
 
  • Like
Reactions: wbk

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!