Deduplication factor depending on ?

Adrigrou · Jan 10, 2024

Hi!

I some specific questions.

I am testing out PBS for a specific workflow consisiting of backing up debian hosted NFS server nearly 1 PB total.

i want to know if i have to invest a little or many money for that.

What i see is that i only achieve nearly 1.2 ~ 1.3 deduplication factor for around 1TB of data ( no incremental )

1) does my deduplication factor will increase as the volume of data backed up increase ? i am not talking about incremental backup in this case ( like if there is a better probability for the chunk to exist on multiple place )

2) Is there some kind of mathematical rule for the dedup factor value depending on the volume of data backed up , like if no CPU or RAM constrain are taking in count. ( for different set of data again, no incremental)

3) does the filesystem have an impact on my deduplication factor, i would say no as PBS do not use the ZFS deduplication options, but never know

3) What reasonable deduplication factor can i hope to have with 1PB of similar file format, mostly .EXR ?

Folke · Jan 11, 2024

Hi,
the deduplication factor depends mostly on the files that are being saved. I'm not especially familiar with the EXR format, but from what I have read, it consists mostly of image data which has been losslessly compressed. Usually you would want the data to be deduplicated before being compressed, but when dealing with predefined file formats, that's going to be difficult.

To answer your questions in order:

Probably, but with those initial numbers, not by much.
Likelihood of your file type to contain the same blocks times the amount of stored data.
No, you won't get any benefits of deduplicating data again. You may profit from ZFS compression, but that's enabled by default. You can look up the compression ratio by running zfs get compressratio
Again, no experience with EXR on my side

VictorSTS · Jan 12, 2024

Just to add: de-duplication advantage on PBS is beneficial because the second and following backups will not use space if the chunks already exists in the datastore, not just because it is able to de-duplicate data in the same backup.
Say you want to backup your 1PB and it uses like 700TB in the datastore. If origin data doesn't change much, a second backup will reuse most of the initial 700TB and add just i.e. 30TB to the datastore.

Dunuin · Jan 12, 2024

I think a problem with the deduplication of data could be that block devices and big files are stored as 4M chunks. So if a single byte of this 4M differs it won't be deduplicatable.
So there is not the fine granularity of for example ZFS that could deduplicate blocks on a 8K level.
So probably not great to deduplicate small repeating data like entries of a DB.

Search

Search

Deduplication factor depending on ?

Adrigrou

New Member

Folke

Proxmox Retired Staff

VictorSTS

Famous Member

Dunuin

Distinguished Member

We value your privacy