Deduplication question

tsumaru720

Renowned Member
May 1, 2016
66
2
73
46
Hello

I've set up a couple of new LXC containers recently, all using the same base OS and they have all been initialised with similar packages.

I noticed however that during the first backup of my most recent container - the output seemed to suggest that it had to back up all of it, and didnt reuse any existing chunks - How does the dedupe work in this case?

I'd have expected it to have found some matching chunks from the existing containers of the same OS/version with similar packages? I know the dedupe works well for further snapshots of the same container, but I was confused as to the initial backup state reported through PVE
 
HI,
the PBS client gets the index of the previous snapshot and does not re-upload chunks already present in that index file. This is the incremental nature of the PBS client.
However, if chunks are not present in the previous index file or if there is no previous index, these chunks will have to be uploaded to the server.

On the server side however, these chunks can be de-duplicated in the datastore, as hash and therefore content is the same.
Please check the deduplication factor of your datastore in the PBS WebUI. This tells you how well data chunks could be re-referenced.
 
Last edited:
oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?
 
oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?
Yes, the client must not know about the chunks in the datastore and deduplication happening there.
 
However, if chunks are not present in the previous index file or if there is no previous index,[B] these chunks will have to be uploaded to the server.[/B]

Problem example:
Clone 100x100GB VM's (identical OR near identical)
Make backup for all of them via PBS
It would keep uploading same chunks over and over even tho they already exist in PBS.

Expected:
Creating backups wouldn't have to upload chunks over and over again. When PBS already have them.

Possible solution:
For (restic or borg?) they solve it somehow better? I believe they use local cache of all chunk hashes or something like that. This way 100 backups would be done in no time as nothing has to be re-uploaded 100 times again.

I'm well aware that this might not be default/main usage of PBS. But still I suggest something like that :P
 
It would keep uploading same chunks over and over even tho they already exist in PBS.
Did you actually verify this behavior?

If yes: please report over there: https://bugzilla.proxmox.com/enter_bug.cgi

In my understanding it will upload it only once, of course. But I did not verify that!

Btw: do you have a question? (There is none in your post...)

PS: resurrecting a three years old thread (without a specific reason) is usually not recommended..., it is better to create a new one ;-)
 
  • Like
Reactions: Johannes S
And your used and under laying Filesystem is?
Did it support the Flags, that Proxmox BS use and need?
 
Did you actually verify this behavior?
Yes I've verified while creating backups via proxmox-backup-client (because I'm backing up not lxc, not vm, but files). Not sure if lxc/vm is same behavior. I can only assume that it's yes.


Btw: do you have a question? (There is none in your post...)
No it's just a suggestion (which would be beneficial to users who still have to use separate/secondary backup solutions(restic/borg) besides pbs).


And your used and under laying Filesystem is?
Not sure if question is about a) PBS b) computer where I'm using proxmox-backup client
Either way it's zfs


Did it support the Flags
Not sure which flags?
 
Just as a side note: I have some (single digit) PBS'. All of them report a "Deduplication Factor" of 24 to 34. They do so because of my specific usecase and scheduling. I did not verify that in detail, but if "feels" as if "storing each chunk only once" does actually work. (I did check the actual space used, for example.)

You may have found a bug. But you need to document it in a way, that we (or better: the developers) can reproduce and verify it. This includes commands you executed and log files / task logs.

Thank you in advance.
 
  • Like
Reactions: Johannes S
"storing each chunk only once" does actually work.
I agree it does. But my suggestion was about slow initial backup (which sends chunks to pbs which aren't stored in pbs because they already exist there)
I will try to give better steps once this becomes slightly bigger issue for me or other fellows like thread OP
 
  • Like
Reactions: UdoB
the behaviour is intentional and a security feature - each newly created snapshot is only allowed to re-use chunks of the previous snapshot, for other chunks it has to "prove" that it actually has that chunk by uploading it. PBS (as opposed to borg and restic) has a multi-user architecture, a datastore is shared by many users.

if this is an issue for you (e.g., your example with the cloned VMs), you can "preseed" the group on the PBS side with an initial snapshot (for example, with a sync job/pull task)