Deduplication question

tsumaru720 · May 24, 2023

Hello

I've set up a couple of new LXC containers recently, all using the same base OS and they have all been initialised with similar packages.

I noticed however that during the first backup of my most recent container - the output seemed to suggest that it had to back up all of it, and didnt reuse any existing chunks - How does the dedupe work in this case?

I'd have expected it to have found some matching chunks from the existing containers of the same OS/version with similar packages? I know the dedupe works well for further snapshots of the same container, but I was confused as to the initial backup state reported through PVE

Chris · May 25, 2023

HI,
the PBS client gets the index of the previous snapshot and does not re-upload chunks already present in that index file. This is the incremental nature of the PBS client.
However, if chunks are not present in the previous index file or if there is no previous index, these chunks will have to be uploaded to the server.

On the server side however, these chunks can be de-duplicated in the datastore, as hash and therefore content is the same.
Please check the deduplication factor of your datastore in the PBS WebUI. This tells you how well data chunks could be re-referenced.

tsumaru720 · May 25, 2023

oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?

Chris · May 25, 2023

tsumaru720 said:
oh ok, so even though the client reported that it had to back up the entire container, thats just the upload to the PBS server and it was likely deduplicated as it was written? I just dont get visibility of how much dedupe happened because there's no previous index for this CT i guess?

Yes, the client must not know about the chunks in the datastore and deduplication happening there.

windowsdesxtop · May 24, 2026

However, if chunks are not present in the previous index file or if there is no previous index,[B] these chunks will have to be uploaded to the server.[/B]

Problem example:
Clone 100x100GB VM's (identical OR near identical)
Make backup for all of them via PBS
It would keep uploading same chunks over and over even tho they already exist in PBS.

Expected:
Creating backups wouldn't have to upload chunks over and over again. When PBS already have them.

Possible solution:
For (restic or borg?) they solve it somehow better? I believe they use local cache of all chunk hashes or something like that. This way 100 backups would be done in no time as nothing has to be re-uploaded 100 times again.

I'm well aware that this might not be default/main usage of PBS. But still I suggest something like that

UdoB · May 24, 2026

windowsdesxtop said:
It would keep uploading same chunks over and over even tho they already exist in PBS.

Did you actually verify this behavior?

If yes: please report over there: https://bugzilla.proxmox.com/enter_bug.cgi

In my understanding it will upload it only once, of course. But I did not verify that!

Btw: do you have a question? (There is none in your post...)

PS: resurrecting a three years old thread (without a specific reason) is usually not recommended..., it is better to create a new one ;-)

news · May 24, 2026

And your used and under laying Filesystem is?
Did it support the Flags, that Proxmox BS use and need?

windowsdesxtop · May 24, 2026

UdoB said:
Did you actually verify this behavior?

Yes I've verified while creating backups via proxmox-backup-client (because I'm backing up not lxc, not vm, but files). Not sure if lxc/vm is same behavior. I can only assume that it's yes.

UdoB said:
Btw: do you have a question? (There is none in your post...)

No it's just a suggestion (which would be beneficial to users who still have to use separate/secondary backup solutions(restic/borg) besides pbs).

news said:
And your used and under laying Filesystem is?

Not sure if question is about a) PBS b) computer where I'm using proxmox-backup client
Either way it's zfs

news said:
Did it support the Flags

Not sure which flags?

UdoB · May 24, 2026

Just as a side note: I have some (single digit) PBS'. All of them report a "Deduplication Factor" of 24 to 34. They do so because of my specific usecase and scheduling. I did not verify that in detail, but if "feels" as if "storing each chunk only once" does actually work. (I did check the actual space used, for example.)

You may have found a bug. But you need to document it in a way, that we (or better: the developers) can reproduce and verify it. This includes commands you executed and log files / task logs.

Thank you in advance.

windowsdesxtop · May 24, 2026

UdoB said:
"storing each chunk only once" does actually work.

I agree it does. But my suggestion was about slow initial backup (which sends chunks to pbs which aren't stored in pbs because they already exist there)
I will try to give better steps once this becomes slightly bigger issue for me or other fellows like thread OP

fabian · May 26, 2026

the behaviour is intentional and a security feature - each newly created snapshot is only allowed to re-use chunks of the previous snapshot, for other chunks it has to "prove" that it actually has that chunk by uploading it. PBS (as opposed to borg and restic) has a multi-user architecture, a datastore is shared by many users.

if this is an issue for you (e.g., your example with the cloned VMs), you can "preseed" the group on the PBS side with an initial snapshot (for example, with a sync job/pull task)

windowsdesxtop · May 26, 2026

Thank you @fabian very insightful reply

a datastore is shared by many users.
Interesting. As I've understood it's better to have different datastores for different users. This way it's less CPU strain on proxmox server + faster backups? (example 10x1TB datastores vs 1x10TB. This way I can create datastore for windows VM's+Linux VMs+datastore for photos e.t.c Where data doesn't mix)Or do I have wrong assumption here? And one big monolith datastore wouldn't slow down backups?

If my datastore idea is good. Then such "prove" isn't needed? Maybe possible to disable "prove" mechanism that I really have those chunks
What do you think?

fabian · May 26, 2026

backups will not be faster if you have multiple datastores. some tasks will of course finish quicker if their scope is reduced (e.g., a garbage collection or datastore-wide verification will be faster if a datastore is smaller.. they might also run a bit faster on ten datastores in parallel if your current bottle neck is lack of concurrency, but that is not the case for most setups - and we are adding tuning knobs to allow more parallelism for those as well).

windowsdesxtop · May 26, 2026

fabian said:
backups will not be faster if you have multiple datastores.

I come from restic background. So there splitting one big monolith(3TB for example) into smaller(1TB+100GB+600GB+e.t.c) was always faster. Basically the smaller the store the faster it was backing up (And sometimes that can lead to backups being like 20x faster or so). Because it had to burn less CPU cycles or something like that during backup.

I will soon verify PBS behavior. Interesting

Krakish · May 26, 2026

windowsdesxtop said:
However, if chunks are not present in the previous index file or if there is no previous index,[B] these chunks will have to be uploaded to the server.[/B]

Problem example:
Clone 100x100GB VM's (identical OR near identical)
Make backup for all of them via PBS
It would keep uploading same chunks over and over even tho they already exist in PBS.

Expected:
Creating backups wouldn't have to upload chunks over and over again. When PBS already have them.

Possible solution:
For (restic or borg?) they solve it somehow better? I believe they use local cache of all chunk hashes or something like that. This way 100 backups would be done in no time as nothing has to be re-uploaded 100 times again.

I'm well aware that this might not be default/main usage of PBS. But still I suggest something like that

This will take toll on client having all chunk hashes locally there is always a trade off somewhere in the pipeline it is eather CPU, disk space or networking. Default PBS behaviour is same as for example Commvault, TSM and DP which are 3 enterprise backup solutions I worked with in the past

fabian · May 27, 2026

PBS also has a way to avoid the hashing overhead via metadata-based change detection (for file based backups, i.e. container and host). but it also requires a baseline snapshot to exist as source of information of the "previous" metadata to compare against.

Deduplication question

tsumaru720

Renowned Member

Chris

Proxmox Staff Member

tsumaru720

Renowned Member

Chris

Proxmox Staff Member

windowsdesxtop

New Member

UdoB

Distinguished Member

news

Famous Member

windowsdesxtop

New Member

UdoB

Distinguished Member

windowsdesxtop

New Member

fabian

Proxmox Staff Member

windowsdesxtop

New Member

fabian

Proxmox Staff Member

windowsdesxtop

New Member

Krakish

New Member

fabian

Proxmox Staff Member

We value your privacy