Deduplication factor depending on ?

Adrigrou

New Member
Dec 15, 2023
3
0
1
Hi!

I some specific questions.

I am testing out PBS for a specific workflow consisiting of backing up debian hosted NFS server nearly 1 PB total.

i want to know if i have to invest a little or many money for that.

What i see is that i only achieve nearly 1.2 ~ 1.3 deduplication factor for around 1TB of data ( no incremental )

1) does my deduplication factor will increase as the volume of data backed up increase ? i am not talking about incremental backup in this case ( like if there is a better probability for the chunk to exist on multiple place )

2) Is there some kind of mathematical rule for the dedup factor value depending on the volume of data backed up , like if no CPU or RAM constrain are taking in count. ( for different set of data again, no incremental)

3) does the filesystem have an impact on my deduplication factor, i would say no as PBS do not use the ZFS deduplication options, but never know

3) What reasonable deduplication factor can i hope to have with 1PB of similar file format, mostly .EXR ?
 
Hi,
the deduplication factor depends mostly on the files that are being saved. I'm not especially familiar with the EXR format, but from what I have read, it consists mostly of image data which has been losslessly compressed. Usually you would want the data to be deduplicated before being compressed, but when dealing with predefined file formats, that's going to be difficult.

To answer your questions in order:
  1. Probably, but with those initial numbers, not by much.
  2. Likelihood of your file type to contain the same blocks times the amount of stored data.
  3. No, you won't get any benefits of deduplicating data again. You may profit from ZFS compression, but that's enabled by default. You can look up the compression ratio by running zfs get compressratio
  4. Again, no experience with EXR on my side
 
Just to add: de-duplication advantage on PBS is beneficial because the second and following backups will not use space if the chunks already exists in the datastore, not just because it is able to de-duplicate data in the same backup.
Say you want to backup your 1PB and it uses like 700TB in the datastore. If origin data doesn't change much, a second backup will reuse most of the initial 700TB and add just i.e. 30TB to the datastore.
 
I think a problem with the deduplication of data could be that block devices and big files are stored as 4M chunks. So if a single byte of this 4M differs it won't be deduplicatable.
So there is not the fine granularity of for example ZFS that could deduplicate blocks on a 8K level.
So probably not great to deduplicate small repeating data like entries of a DB.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!