Verification Job Failures -- What Happens?

BriBriBri · Nov 10, 2022

Hi All,

Sorry if my meager googling and forum searching skills have let us all down. . . but would someone please answer:

What should we expect to happen if/when "bit rot" takes place and a verify job detects it on a backup/snapshot generated long ago? In other words, what is the by-design behavior of pbs when a verification job detects an issue?

We run our dataset on zfs raidz2 so ideally the routine scrub would correct it. But what about situation where no parity is available?

Best Regards,

Brian

Dunuin · Nov 11, 2022

Bit rot is problematic because of deduplication. If a single chunk degrades and that chunk is used by all backups of a guest, then all backup snapshots will fail. As far as I know PBS only detects bit rot but won't try to fix it. So indeed not a bad idea to use ZFS with parity for the datastore to add the bit rot protection that way on filesystem level.

dcsapak · Nov 11, 2022

there are multiple ways to fix broken chunks though:
* make a backup again that contains the same chunk (though if a backup contains the exact chunk is just a matter of luck)
* you can sync back an affected snapshot from a good source, e.g. an offsite pbs or a tape backup

BriBriBri · Nov 14, 2022

Thank you for the replies. Since confidence in a backup solution is paramount, I hope nobody minds if I follow up with some further, more detailed questions. Thankfully, our pbs datastore is on raidz2 but for the purposes of this discussion, let's pretend that it isn't!

Okay, so. . . hypothetically. . .

We have a "fileserver" VM with a 10 TB drive. It is backed up to PBS every Monday.
The PBS datastore is not on ZFS nor does it have any form of parity.
There is a "verify new snapshots" running nightly.
There is a "re-verify old snapshots" job set to run on the first of every month.

Let's say that "bit rot" strikes "fileserver" and wipes out a few blocks of a file at /local/homefolders/staff/bribribri/myfile.txt at some point last month. This is a file that was backed up during the very first snapshot taken of "fileserver" and has been untouched/unmodified since.

Question 1: The next "verify new snapshots" won't note anything wrong since the bit rot has struck blocks backed up on earlier snapshots, correct?

Question 2: The "re-verify snapshots" runs on the first of this month and eventually notes the bad blocks on "fileserver." Correct?

Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow restores from those snapshots "around" the affected/corrupt blocks or are all affected snapshots of "fileserver" rendered un-useable in total?

Question 4: The next nightly backup, will those bad blocks (having been flagged as bad) be automatically re-grabbed by pbs from "fileserver". . . and will all the snapshots then be re-validated and made available upon next verification pass?

Thank you for your time and patience,

Brian

dcsapak · Nov 15, 2022

ok, so pbs does not (and can not) detect bit rot on the *source* side of a backup.
what happens when bitrot changes a file in your fileserver and backup again is that the file simply looks different for the pbs client (it cannot know if the change was deliberate or not) and will backup the 'new' file
no bit rot in this case will be detected by pbs

pbs can only detect bitrot in it's chunks, meaning an already existing backup changes it's data

for this (among other) reasons it's important to have proper backup & restore concept that includes restore tests

BriBriBri · Nov 15, 2022

Thank you Dominik! Understood!

That really just leaves "Question 3" in my above post. . .

Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow file-level restores from those snapshots "around" the affected/corrupt blocks or are all downstream affected snapshots of "fileserver" rendered un-useable in total?

Any insight here? I wonder how I'd go about accessing a snapshot and manually "flipping a few bits" in a lab environment to see for myself what happens in those situations?

Thanks!

--Brian

dcsapak · Nov 16, 2022

in your given scenario, from pbs' view none of the backups is 'corrupted' since that happened on the source side where pbs cannot detect that. each backup will restore as it was backed up (assuming no data corruption on pbs itself)
a short example:

source: file a with content 'aaa'
that is backed up with pbs and results in the backup snapshot 'host/foo/2023-01-01T00:00:01Z'

on the source the file corrupts and now contains 'aab'
that is again backed up with pbs and results in another backup snapshot 'host/foo/2023-01-01T00:01:01Z'

the backup 'host/foo/2023-01-01T00:00:01Z' contains the file with content 'aaa'
and the backup 'host/foo/2023-01-01T00:01:01Z' contains the file with content 'aab'

also maybe this documentation helps a bit more to understand the underlying technology: https://pbs.proxmox.com/docs/technical-overview.html

BriBriBri · Nov 16, 2022

Hi Dominik,

I must humbly apologize. My questions have (in my mind!) been about bit rot/corruption of the PBS datastore itself rather than on the VM being backed up. I laid out all the context about the pbs datastore but then actually made my question unclear! In fact, I stated it downright wrong. . .

Okay, so. . . hypothetically. . .

We have a "fileserver" VM with a 10 TB drive. It is backed up to PBS every Monday.

The PBS datastore is not on ZFS nor does it have any form of parity.

There is a "verify new snapshots" running nightly.

There is a "re-verify old snapshots" job set to run on the first of every month.

Let's say that "bit rot" strikes "fileserver" and wipes out a few blocks of a file at /local/homefolders/staff/bribribri/myfile.txt at some point last month. This is a file that was backed up during the very first snapshot taken of "fileserver" and has been untouched/unmodified since.

I meant to say:

Let's say that 'bit rot' strikes the initial snapshot of fileserver stored on pbs and wipes out a few blocks. . .

. . . and I hope my questions then asking how the failed verification jobs would appear make more sense with that sentence fixed!

Again, my humble apologies for that brain hiccup/typo at such a critical point in the question. I can imagine how frustrating it was to be reading that and thinking: "Why does this guy not get that verification jobs won't catch bit rot on the VM itself."

If you still have any patience left (again, sorry!), would it be possible to re-visit the questions asked with this fixed/adjusted context in mind?

Thank you very much for your time!

--Brian

dcsapak · Nov 16, 2022

ah ok yes, now the questions make more sense

in that case, let me answer your initial questions

BriBriBri said:
Question 1: The next "verify new snapshots" won't note anything wrong since the bit rot has struck blocks backed up on earlier snapshots, correct?

depends, if the new snapshot references the old corrupted chunk, then the verify new would actually see that

BriBriBri said:
Question 2: The "re-verify snapshots" runs on the first of this month and eventually notes the bad blocks on "fileserver." Correct?

yes if the chunk wasn't marked as bad in the meantime by e.g. verify new

BriBriBri said:
Question 3: Any snapshots "downstream" of that now corrupted "snapshot" are then affected? What happens to them? Will PBS allow restores from those snapshots "around" the affected/corrupt blocks or are all affected snapshots of "fileserver" rendered un-useable in total?

what do you mean with "downstream" exactly in this case? newer snapshots? snapshots on a different synced pbs?

BriBriBri said:
Question 4: The next nightly backup, will those bad blocks (having been flagged as bad) be automatically re-grabbed by pbs from "fileserver". . . and will all the snapshots then be re-validated and made available upon next verification pass?

if the identical chunk is again contained in the backup, it will be reuploaded and automatically 'heal' the old corrupted snapshots

generally corrupt chunk handling is the following:
* if a verify/read/etc. operation detect a bad chunk, it gets marked as bad
* all snapshots referencing that bad chunk will not be able to be read completely
* if a new backup contains the same chunk (iow. the same checksum) it will overwrite the bad chunk

BriBriBri · Nov 16, 2022

dcsapak said:
what do you mean with "downstream" exactly in this case? newer snapshots? snapshots on a different synced pbs?

I think the term I should be using is "referencing". . . So to translate my unfortunate/inappropriate use of "downstream". . . what I meant is that snapshots that are taken later and reference the initial snapshot are "downstream" of that initial snapshot.

But, I think you already answered my question by providing additional/helpful details in your other answers.

dcsapak said:
if the identical chunk is again contained in the backup, it will be reuploaded and automatically 'heal' the old corrupted snapshots

This is the very good news I was hoping to hear! If I'm understanding correctly, corrupted snapshots on the pbs server discovered by the verification job(s) do indeed "self-heal" (assuming the chunks needed to fix the snapshots are still available on the backed up VM). Before confirming here, I wasn't sure if PBS would merely alert you to the fact that the snapshots (perhaps going back years) are now bad, or actually "fix" them. This is good news and will inform our decisions about how to configure our pbs storage going forward (ie., how much redundancy/parity is needed. . . though for many reasons, I think we'll always want some).

Once again, apologies for the miscommunication earlier. It was entirely my fault! And I truly appreciate your patience while we worked through it!

Best Regards,

Brian

dcsapak · Nov 17, 2022

BriBriBri said:
I think the term I should be using is "referencing". . . So to translate my unfortunate/inappropriate use of "downstream". . . what I meant is that snapshots that are taken later and reference the initial snapshot are "downstream" of that initial snapshot.

just to clarify, the snapshots don't reference each other, they only ever reference a list of chunks, thus each snapshot is completely independent (logically) while sharing the chunks

Darkk · Jun 8, 2023

Dunuin said:
Bit rot is problematic because of deduplication. If a single chunk degrades and that chunk is used by all backups of a guest, then all backup snapshots will fail. As far as I know PBS only detects bit rot but won't try to fix it. So indeed not a bad idea to use ZFS with parity for the datastore to add the bit rot protection that way on filesystem level.

Actually any good RAID system will take care of bit rot during scrubs. I use ZFS whenever I can. Sometimes on servers I don't have that option without flashing the controller into IT mode.

Darkk · Jun 8, 2023

Very good points in this thread. Although it's really up to the backup admins to monitor these backup and verification jobs daily if possible. I also configured my PBS servers to send me e-mail on the backup / verification jobs so I know what is going on.

Final piece is do manual restores on few VMs/CTs on a schedule to make sure they can be restored properly. PBS by design is great for what it is but nothing is 100% perfect.

All of my servers are either running on RAID or ZFS so at the file level it's being checked. So I'm not really worried too much about it. Non-ECC memory in servers have higher chance of bit flip than saving data to disks. Although rare but it does happen.

Verification Job Failures -- What Happens?

BriBriBri

Active Member

Dunuin

Distinguished Member

dcsapak

Proxmox Staff Member

BriBriBri

Active Member

dcsapak

Proxmox Staff Member

BriBriBri

Active Member

dcsapak

Proxmox Staff Member

BriBriBri

Active Member

dcsapak

Proxmox Staff Member

BriBriBri

Active Member

dcsapak

Proxmox Staff Member

Darkk

Renowned Member

Darkk

Renowned Member

We value your privacy