PBS integrity check: is the hash computed on PVE or on PBS?

Kurgan

Well-Known Member
Apr 27, 2018
44
8
48
55
I'm looking for an information I cannot find. When a backup is made to PBS, I understand that all the blocks have an hash that allows for integrity checks to be done later, and they indeed are done on the PBS host multiple times. (there is a schedule for that)

But the part I miss is this: is this hash calculated for the first time on the PVE host while the blocks are being created and sent to the PBS host, or is this hash calculated for the first time on the PBS host once the data has already been written for the first time on its storage?

There is quite a difference in these two ways of doing it, because the first one protects against corruption before leaving the PBS host, so in the network path (defective NICs, etc) or in the PBS host (RAM failure, controller failure, etc). The second one protects only against data rot AFTER the backup has been made and saved to PBS storage, so it does not protect against network issues or memory issues on the PBS host.

I'm asking this because I'd like to understand where the points of failure are on the integrity check system.

Thanks
 
Last edited:
But the part I miss is this: is this hash calculated for the first time on the PVE host while the blocks are being created and sent to the PBS host, or is this hash calculated for the first time on the PBS host once the data has already been written for the first time on its storage?
It is (first) calculated on PVE.

Then this checksum is transmitted to PBS - only the checksum, no data.

If a chunk with that checksum already exists on PBS then some references are updated to reflect this new backup.

Only if that checksum is not already present on PBS then the data is transferred too.


A "verify" running on PBS would read the actual, physical data of a chunk from disk, (re-) calculate the checksum and compares it with the already known checksum from the moment in time the backup was done. Of course both checksum should be identical...


Staff, did I miss a step? There was some additional calculation...?
 
Last edited:
Thanks a lot for the explanation, it's very clear and I like the fact that this protects the best part of the backup chain of hardware and software. This means that if the PVE host does not corrupt data, then any further corruption, even in the moment the backup is written to PBS for the first time, will not go undetected.
 
  • Like
Reactions: UdoB
there are actually multiple checksums.

chunks contain a CRC (verified on each read of the chunk) and have a digest (that the chunk is referenced by). both are calculated on the client side, the CRC is verified by the server upon upload, the digest can only be verified for non-encrypted chunks, since calculating the digest requires the plain text data which the server doesn't have in case of encrypted chunks. verification will verify both for plain-text chunks, and just the CRC for encrypted chunks. the client will verify both CRC and digest whenever it downloads and parses a chunk (such as during a restore).

blobs contain a CRC (verified on each read of the blob), and are referenced by a digest

indices reference chunks by digest, and themselves also have a checksum. when creating a new backup snapshot, the index is constructed in parallel by the client and the server, and the respective checksum is compared when the index writer is closed to ensure they match.

the manifest ("index.json") references indices and blobs by their checksums, and itself optionally has a signature that protects most of its contents.