PBS Verification and ZFS Scrubbing and ....

arubenstein · Nov 19, 2023

I try to be pragmatic with how I approach solutions and make sure that I am not "over-solving problems" or "providing solutions for problems that don't exist" ...

Environment is a PVE 8.x cluster of 3 nodes, using CEPH (SSD underlying) for storage. This works perfection, large nodes (40 cores, 768gb RAM, hundreds of VM's) with lots of SSD drives makes for an excellent environment of computer with good redundancy (CEPH 3 copies).

Then, I have a PBS server running with ZFS, large rotational drives (20 TB) and then some SSD for a ZFS special device. Backups work well, very zippy. We keep 7 days, 4 weeks, and three months of retention.

My question surrounds the need for verification and on top of ZFS scrubbing. I should mention that the rotational drives are set up as a RAID-Z2.

Wikipedia, about ZFS, says (https://en.wikipedia.org/wiki/ZFS#RAID_("RAID-Z")), among other things:

"In addition to handling whole-disk failures, RAID-Z can also detect and correct silent data corruption, offering "self-healing data": when reading a RAID-Z block, ZFS compares it against its checksum, and if the data disks did not return the right answer, ZFS reads the parity and then figures out which disk returned bad data. Then, it repairs the damaged data and returns good data to the requestor.[36]"

The question: So, if at the file system level there is that level of bit-rot and/or sector failure and/or drive level protection, is it really needed to also then perform PBS level data verification?

Dunuin · Nov 20, 2023

arubenstein said:
The question: So, if at the file system level there is that level of bit-rot and/or sector failure and/or drive level protection, is it really needed to also then perform PBS level data verification?

Not needed for checking the data integrity of your chunk files. But a scrub will only check if those files did not get corrupted. It can't know if a whole file is missing for some reason. There are cases where backup snapshots won't work anymore because the virus scanner quarantined some chunk files or the atime wasn't enabled and the GC deleted too much chunk files. To be protected against such cases you would still need to run verify jobs in PBS.

Search

Search

PBS Verification and ZFS Scrubbing and ....

arubenstein

New Member

Dunuin

Distinguished Member