Deduplication question

tsumaru720

Renowned Member
May 1, 2016
66
2
73
46
Hello

I've set up a couple of new LXC containers recently, all using the same base OS and they have all been initialised with similar packages.

I noticed however that during the first backup of my most recent container - the output seemed to suggest that it had to back up all of it, and didnt reuse any existing chunks - How does the dedupe work in this case?

I'd have expected it to have found some matching chunks from the existing containers of the same OS/version with similar packages? I know the dedupe works well for further snapshots of the same container, but I was confused as to the initial backup state reported through PVE
 
HI,
the PBS client gets the index of the previous snapshot and does not re-upload chunks already present in that index file. This is the incremental nature of the PBS client.
However, if chunks are not present in the previous index file or if there is no previous index, these chunks will have to be uploaded to the server.

On the server side however, these chunks can be de-duplicated in the datastore, as hash and therefore content is the same.
Please check the deduplication factor of your datastore in the PBS WebUI. This tells you how well data chunks could be re-referenced.
 
Last edited:
oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?
 
oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?
Yes, the client must not know about the chunks in the datastore and deduplication happening there.
 
However, if chunks are not present in the previous index file or if there is no previous index,[B] these chunks will have to be uploaded to the server.[/B]

Problem example:
Clone 100x100GB VM's (identical OR near identical)
Make backup for all of them via PBS
It would keep uploading same chunks over and over even tho they already exist in PBS.

Expected:
Creating backups wouldn't have to upload chunks over and over again. When PBS already have them.

Possible solution:
For (restic or borg?) they solve it somehow better? I believe they use local cache of all chunk hashes or something like that. This way 100 backups would be done in no time as nothing has to be re-uploaded 100 times again.

I'm well aware that this might not be default/main usage of PBS. But still I suggest something like that :P
 
It would keep uploading same chunks over and over even tho they already exist in PBS.
Did you actually verify this behavior?

If yes: please report over there: https://bugzilla.proxmox.com/enter_bug.cgi

In my understanding it will upload it only once, of course. But I did not verify that!

Btw: do you have a question? (There is none in your post...)

PS: resurrecting a three years old thread (without a specific reason) is usually not recommended..., it is better to create a new one ;-)
 
  • Like
Reactions: Johannes S
And your used and under laying Filesystem is?
Did it support the Flags, that Proxmox BS use and need?
 
Did you actually verify this behavior?
Yes I've verified while creating backups via proxmox-backup-client (because I'm backing up not lxc, not vm, but files). Not sure if lxc/vm is same behavior. I can only assume that it's yes.


Btw: do you have a question? (There is none in your post...)
No it's just a suggestion (which would be beneficial to users who still have to use separate/secondary backup solutions(restic/borg) besides pbs).


And your used and under laying Filesystem is?
Not sure if question is about a) PBS b) computer where I'm using proxmox-backup client
Either way it's zfs


Did it support the Flags
Not sure which flags?
 
Just as a side note: I have some (single digit) PBS'. All of them report a "Deduplication Factor" of 24 to 34. They do so because of my specific usecase and scheduling. I did not verify that in detail, but if "feels" as if "storing each chunk only once" does actually work. (I did check the actual space used, for example.)

You may have found a bug. But you need to document it in a way, that we (or better: the developers) can reproduce and verify it. This includes commands you executed and log files / task logs.

Thank you in advance.
 
  • Like
Reactions: Johannes S
"storing each chunk only once" does actually work.
I agree it does. But my suggestion was about slow initial backup (which sends chunks to pbs which aren't stored in pbs because they already exist there)
I will try to give better steps once this becomes slightly bigger issue for me or other fellows like thread OP
 
  • Like
Reactions: UdoB
the behaviour is intentional and a security feature - each newly created snapshot is only allowed to re-use chunks of the previous snapshot, for other chunks it has to "prove" that it actually has that chunk by uploading it. PBS (as opposed to borg and restic) has a multi-user architecture, a datastore is shared by many users.

if this is an issue for you (e.g., your example with the cloned VMs), you can "preseed" the group on the PBS side with an initial snapshot (for example, with a sync job/pull task)
 
  • Like
Reactions: UdoB and Johannes S
Thank you @fabian very insightful reply

a datastore is shared by many users.
Interesting. As I've understood it's better to have different datastores for different users. This way it's less CPU strain on proxmox server + faster backups? (example 10x1TB datastores vs 1x10TB. This way I can create datastore for windows VM's+Linux VMs+datastore for photos e.t.c Where data doesn't mix)Or do I have wrong assumption here? And one big monolith datastore wouldn't slow down backups?

If my datastore idea is good. Then such "prove" isn't needed? Maybe possible to disable "prove" mechanism that I really have those chunks
What do you think?
 
Last edited:
backups will not be faster if you have multiple datastores. some tasks will of course finish quicker if their scope is reduced (e.g., a garbage collection or datastore-wide verification will be faster if a datastore is smaller.. they might also run a bit faster on ten datastores in parallel if your current bottle neck is lack of concurrency, but that is not the case for most setups - and we are adding tuning knobs to allow more parallelism for those as well).
 
backups will not be faster if you have multiple datastores.
I come from restic background. So there splitting one big monolith(3TB for example) into smaller(1TB+100GB+600GB+e.t.c) was always faster. Basically the smaller the store the faster it was backing up (And sometimes that can lead to backups being like 20x faster or so). Because it had to burn less CPU cycles or something like that during backup.

I will soon verify PBS behavior. Interesting
 
However, if chunks are not present in the previous index file or if there is no previous index,[B] these chunks will have to be uploaded to the server.[/B]

Problem example:
Clone 100x100GB VM's (identical OR near identical)
Make backup for all of them via PBS
It would keep uploading same chunks over and over even tho they already exist in PBS.

Expected:
Creating backups wouldn't have to upload chunks over and over again. When PBS already have them.

Possible solution:
For (restic or borg?) they solve it somehow better? I believe they use local cache of all chunk hashes or something like that. This way 100 backups would be done in no time as nothing has to be re-uploaded 100 times again.

I'm well aware that this might not be default/main usage of PBS. But still I suggest something like that :P
This will take toll on client having all chunk hashes locally there is always a trade off somewhere in the pipeline it is eather CPU, disk space or networking. Default PBS behaviour is same as for example Commvault, TSM and DP which are 3 enterprise backup solutions I worked with in the past