Verification Job Failures -- What Happens?

Jan 21, 2022
37
3
13
50
Hi All,

Sorry if my meager googling and forum searching skills have let us all down. . . but would someone please answer:

What should we expect to happen if/when "bit rot" takes place and a verify job detects it on a backup/snapshot generated long ago? In other words, what is the by-design behavior of pbs when a verification job detects an issue?

We run our dataset on zfs raidz2 so ideally the routine scrub would correct it. But what about situation where no parity is available?

Best Regards,

Brian
 
Bit rot is problematic because of deduplication. If a single chunk degrades and that chunk is used by all backups of a guest, then all backup snapshots will fail. As far as I know PBS only detects bit rot but won't try to fix it. So indeed not a bad idea to use ZFS with parity for the datastore to add the bit rot protection that way on filesystem level.
 
there are multiple ways to fix broken chunks though:
* make a backup again that contains the same chunk (though if a backup contains the exact chunk is just a matter of luck)
* you can sync back an affected snapshot from a good source, e.g. an offsite pbs or a tape backup
 
Thank you for the replies. Since confidence in a backup solution is paramount, I hope nobody minds if I follow up with some further, more detailed questions. Thankfully, our pbs datastore is on raidz2 but for the purposes of this discussion, let's pretend that it isn't!

Okay, so. . . hypothetically. . .
  • We have a "fileserver" VM with a 10 TB drive. It is backed up to PBS every Monday.
  • The PBS datastore is not on ZFS nor does it have any form of parity.
  • There is a "verify new snapshots" running nightly.
  • There is a "re-verify old snapshots" job set to run on the first of every month.
Let's say that "bit rot" strikes "fileserver" and wipes out a few blocks of a file at /local/homefolders/staff/bribribri/myfile.txt at some point last month. This is a file that was backed up during the very first snapshot taken of "fileserver" and has been untouched/unmodified since.

Question 1: The next "verify new snapshots" won't note anything wrong since the bit rot has struck blocks backed up on earlier snapshots, correct?

Question 2: The "re-verify snapshots" runs on the first of this month and eventually notes the bad blocks on "fileserver." Correct?

Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow restores from those snapshots "around" the affected/corrupt blocks or are all affected snapshots of "fileserver" rendered un-useable in total?

Question 4: The next nightly backup, will those bad blocks (having been flagged as bad) be automatically re-grabbed by pbs from "fileserver". . . and will all the snapshots then be re-validated and made available upon next verification pass?

Thank you for your time and patience,

Brian
 
ok, so pbs does not (and can not) detect bit rot on the *source* side of a backup.
what happens when bitrot changes a file in your fileserver and backup again is that the file simply looks different for the pbs client (it cannot know if the change was deliberate or not) and will backup the 'new' file
no bit rot in this case will be detected by pbs

pbs can only detect bitrot in it's chunks, meaning an already existing backup changes it's data

for this (among other) reasons it's important to have proper backup & restore concept that includes restore tests
 
  • Like
Reactions: Dunuin
Thank you Dominik! Understood!

That really just leaves "Question 3" in my above post. . .

Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow file-level restores from those snapshots "around" the affected/corrupt blocks or are all downstream affected snapshots of "fileserver" rendered un-useable in total?

Any insight here? I wonder how I'd go about accessing a snapshot and manually "flipping a few bits" in a lab environment to see for myself what happens in those situations?

Thanks!

--Brian
 
Last edited:
in your given scenario, from pbs' view none of the backups is 'corrupted' since that happened on the source side where pbs cannot detect that. each backup will restore as it was backed up (assuming no data corruption on pbs itself)
a short example:

source: file a with content 'aaa'
that is backed up with pbs and results in the backup snapshot 'host/foo/2023-01-01T00:00:01Z'

on the source the file corrupts and now contains 'aab'
that is again backed up with pbs and results in another backup snapshot 'host/foo/2023-01-01T00:01:01Z'

the backup 'host/foo/2023-01-01T00:00:01Z' contains the file with content 'aaa'
and the backup 'host/foo/2023-01-01T00:01:01Z' contains the file with content 'aab'

also maybe this documentation helps a bit more to understand the underlying technology: https://pbs.proxmox.com/docs/technical-overview.html
 
Hi Dominik,

I must humbly apologize. My questions have (in my mind!) been about bit rot/corruption of the PBS datastore itself rather than on the VM being backed up. I laid out all the context about the pbs datastore but then actually made my question unclear! In fact, I stated it downright wrong. . .

Okay, so. . . hypothetically. . .
  • We have a "fileserver" VM with a 10 TB drive. It is backed up to PBS every Monday.
  • The PBS datastore is not on ZFS nor does it have any form of parity.
  • There is a "verify new snapshots" running nightly.
  • There is a "re-verify old snapshots" job set to run on the first of every month.
Let's say that "bit rot" strikes "fileserver" and wipes out a few blocks of a file at /local/homefolders/staff/bribribri/myfile.txt at some point last month. This is a file that was backed up during the very first snapshot taken of "fileserver" and has been untouched/unmodified since.

I meant to say:

Let's say that 'bit rot' strikes the initial snapshot of fileserver stored on pbs and wipes out a few blocks. . .

. . . and I hope my questions then asking how the failed verification jobs would appear make more sense with that sentence fixed!

Again, my humble apologies for that brain hiccup/typo at such a critical point in the question. I can imagine how frustrating it was to be reading that and thinking: "Why does this guy not get that verification jobs won't catch bit rot on the VM itself."

If you still have any patience left (again, sorry!), would it be possible to re-visit the questions asked with this fixed/adjusted context in mind?

Thank you very much for your time!

--Brian
 
ah ok yes, now the questions make more sense :)

in that case, let me answer your initial questions

Question 1: The next "verify new snapshots" won't note anything wrong since the bit rot has struck blocks backed up on earlier snapshots, correct?
depends, if the new snapshot references the old corrupted chunk, then the verify new would actually see that

Question 2: The "re-verify snapshots" runs on the first of this month and eventually notes the bad blocks on "fileserver." Correct?
yes if the chunk wasn't marked as bad in the meantime by e.g. verify new

Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow restores from those snapshots "around" the affected/corrupt blocks or are all affected snapshots of "fileserver" rendered un-useable in total?
what do you mean with "downstream" exactly in this case? newer snapshots? snapshots on a different synced pbs?

Question 4: The next nightly backup, will those bad blocks (having been flagged as bad) be automatically re-grabbed by pbs from "fileserver". . . and will all the snapshots then be re-validated and made available upon next verification pass?
if the identical chunk is again contained in the backup, it will be reuploaded and automatically 'heal' the old corrupted snapshots

generally corrupt chunk handling is the following:
* if a verify/read/etc. operation detect a bad chunk, it gets marked as bad
* all snapshots referencing that bad chunk will not be able to be read completely
* if a new backup contains the same chunk (iow. the same checksum) it will overwrite the bad chunk
 
what do you mean with "downstream" exactly in this case? newer snapshots? snapshots on a different synced pbs?

I think the term I should be using is "referencing". . . So to translate my unfortunate/inappropriate use of "downstream". . . what I meant is that snapshots that are taken later and reference the initial snapshot are "downstream" of that initial snapshot.

But, I think you already answered my question by providing additional/helpful details in your other answers.

if the identical chunk is again contained in the backup, it will be reuploaded and automatically 'heal' the old corrupted snapshots
This is the very good news I was hoping to hear! If I'm understanding correctly, corrupted snapshots on the pbs server discovered by the verification job(s) do indeed "self-heal" (assuming the chunks needed to fix the snapshots are still available on the backed up VM). Before confirming here, I wasn't sure if PBS would merely alert you to the fact that the snapshots (perhaps going back years) are now bad, or actually "fix" them. This is good news and will inform our decisions about how to configure our pbs storage going forward (ie., how much redundancy/parity is needed. . . though for many reasons, I think we'll always want some).

Once again, apologies for the miscommunication earlier. It was entirely my fault! And I truly appreciate your patience while we worked through it!

Best Regards,

Brian
 
Last edited:
I think the term I should be using is "referencing". . . So to translate my unfortunate/inappropriate use of "downstream". . . what I meant is that snapshots that are taken later and reference the initial snapshot are "downstream" of that initial snapshot.
just to clarify, the snapshots don't reference each other, they only ever reference a list of chunks, thus each snapshot is completely independent (logically) while sharing the chunks
 
Bit rot is problematic because of deduplication. If a single chunk degrades and that chunk is used by all backups of a guest, then all backup snapshots will fail. As far as I know PBS only detects bit rot but won't try to fix it. So indeed not a bad idea to use ZFS with parity for the datastore to add the bit rot protection that way on filesystem level.

Actually any good RAID system will take care of bit rot during scrubs. I use ZFS whenever I can. Sometimes on servers I don't have that option without flashing the controller into IT mode.
 
Very good points in this thread. Although it's really up to the backup admins to monitor these backup and verification jobs daily if possible. I also configured my PBS servers to send me e-mail on the backup / verification jobs so I know what is going on.

Final piece is do manual restores on few VMs/CTs on a schedule to make sure they can be restored properly. PBS by design is great for what it is but nothing is 100% perfect.

All of my servers are either running on RAID or ZFS so at the file level it's being checked. So I'm not really worried too much about it. Non-ECC memory in servers have higher chance of bit flip than saving data to disks. Although rare but it does happen.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!