Hey there,
I'm wondering how Proxmox Backup Server deals with hash collisions?
Yes, I know, there's the technical documentation - but from how I read it, the documentation says "it doesn't".
Followed by some number crunching about winning the lottery several times.
However if we're talking about a 4 MB chunk, which is mapped to a 32 B hash, and you assume the hashes are uniformly distributed, then it's obvious that multiple versions of said chunk must map to the same hash value, otherwise there would only be 32 bytes of information in said chunk.
What's even worse is that unless you compare the data, verifying the backup wouldn't indicate an issue, because affected data would map to the same hash value, indicating that everything is A-OK. So whether you won the data loss lottery, you'll only know if you need to do a restore, and even then only once you run into the corrupted file.
Yes, it's true, collisions aren't likely, but there is a lot of data to back up, and a lot of chunks you generate. And just like in real life, every once in a while people win the lottery - except that time it's not the grand prize you're getting.
I'm wondering how Proxmox Backup Server deals with hash collisions?
Yes, I know, there's the technical documentation - but from how I read it, the documentation says "it doesn't".
Followed by some number crunching about winning the lottery several times.
However if we're talking about a 4 MB chunk, which is mapped to a 32 B hash, and you assume the hashes are uniformly distributed, then it's obvious that multiple versions of said chunk must map to the same hash value, otherwise there would only be 32 bytes of information in said chunk.
What's even worse is that unless you compare the data, verifying the backup wouldn't indicate an issue, because affected data would map to the same hash value, indicating that everything is A-OK. So whether you won the data loss lottery, you'll only know if you need to do a restore, and even then only once you run into the corrupted file.
Yes, it's true, collisions aren't likely, but there is a lot of data to back up, and a lot of chunks you generate. And just like in real life, every once in a while people win the lottery - except that time it's not the grand prize you're getting.