[SOLVED] Is verify task needed with ZFS-backed datastore ?

Altinea

Active Member
Jan 12, 2018
33
7
28
41
Hello,
On PBS, when using a ZFS datastore, should we really enable verification job ?

AFAIK, when there's some kind of redundancy (multiple copies, RAIDZ or mirroring), checksum not disabled (why would you do this ?) AND ZFS scrubbing is done on a regular basis (once a week, a month ?), ZFS is able to detect and recover from bit rot by itself.
With no redundancy (you like risks huh ?), bit rot could be detected but not corrected.

So the verify job seems to be no-use is this particular case, right ?

This is probably the same thing with other checksumming filesystems (xfs, btrfs ?).

I forgot to mention : don't delete random chunks in your datastore, just let the GC do the job. But again, who do this ?

Thanks for your advices about this idea.
 
Hi,

yeah, if the storage level already has enough redundancy and does complete and full checks against bitrot (not only metadata but everything) then Snapshot verification on the PBS level may be an overkill.

ZFS scrubbing is done on a regular basis (once a week, a month ?),

ZFS scrubbing is done once a month by default, so a higher frequency like per-week may be more ideal in such a case.

This is probably the same thing with other checksumming filesystems (xfs, btrfs ?).
XFS only checksums metadata, not all data and such not equal to verify.

In general IMO one may get a better safety net by:
* having an off-site archive mirror of the relevant PBS snapshots
* doing frequent end to end restore tests to ensure all relevant data and disks is actually backed up and the restore process works and is familiar, which can be really good for outages, as one can get more easily stressed then.
 
I forgot to mention : don't delete random chunks in your datastore, just let the GC do the job. But again, who do this ?

Nobody in their right mind would do this with intend, but a mistype command or multiple open SSH connections and typing something in the wrong console can happen to any (maybe tired) "meatbag" - you only need to have bad luck once.

But yeah, verification won't really help against that either, but off-site copy will.
 
Hi,

yeah, if the storage level already has enough redundancy and does complete and full checks against bitrot (not only metadata but everything) then Snapshot verification on the PBS level may be an overkill.



ZFS scrubbing is done once a month by default, so a higher frequency like per-week may be more ideal in such a case.


XFS only checksums metadata, not all data and such not equal to verify.

In general IMO one may get a better safety net by:
* having an off-site archive mirror of the relevant PBS snapshots
* doing frequent end to end restore tests to ensure all relevant data and disks is actually backed up and the restore process works and is familiar, which can be really good for outages, as one can get more easily stressed then.
Thanks for thoses advices.

That's another subject but restorations can't be done on another VMID so testing them is a bit painful (aka complicated). Perhaps in a future version ?

Best regards
 
Restoration can be done just fine do any VMID since pretty much always.

Just start the restore from the storage content panel, not the VMs/CTs Backup panel and you can select a new VMID.

You may want to disable the network links of such test restored guest, not that they come up and mess with production.
 
Restoration can be done just fine do any VMID since pretty much always.

Just start the restore from the storage content panel, not the VMs/CTs Backup panel and you can select a new VMID.

You may want to disable the network links of such test restored guest, not that they come up and mess with production.
Shame on me, you're absolutely right. I only tested from the VM backup panel, didn't know about this limitation.

I don't think it's even needed to start the VM to test restoration but that's better, yeah.

This thread can now be closed, thanks a gain for sharing your toughts.
 
Nobody in their right mind would do this with intend, but a mistype command or multiple open SSH connections and typing something in the wrong console can happen to any (maybe tired) "meatbag" - you only need to have bad luck once.

But yeah, verification won't really help against that either, but off-site copy will.
The case where I used that verification, was when I needed to bootstrap a remote ('cross atlantic) PBS, and the single connection synchronizations was just.... plain molasses on a freezinglingly cold winters day...

The easiest was to spin up multiple rsync sessions of each of the directories of datatores, and then used the verification process to find the backups who's index files was transfered, but the relevant blocks already removed on the source, and then after those invalid backups was cleaned out, I have a decent backup to continue with synchronizing, pruning and GC ... the single connection synchronization would've taken about 10x times longer (LAtency links, and the CPUs that's old and slow on SSL the contributors to the bad performance)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!