[SOLVED] Deduplication?

devilkin · Oct 21, 2020

Hi,

I'm wondering how the dedup works - is this cross datastore, or just for the backup itself (so between backups of the same source)

Tia

fabian · Oct 21, 2020

the deduplication domain is a datastore. note that encryption creates another layer on top, so the same data encrypted with different keys will not be deduplicated even if stored in the same datastore (since the stored encrypted data is different)

devilkin · Oct 21, 2020

Yeah, that's logical. Thanks!

turnicus · Nov 3, 2020

Hello Fabian, just a quick clarification to grasp the full potential of deduplication. Can you tell if the following statements are true or false in an unencrypted common datastore:

data is deduplicated between snapshots of a same client (between "snapshot1 from client1" and "snapshot2 from client1" for example)
data is deduplicated between snapshots of multiple clients (between "snapshot1 from client1" and "snapshot1 from client2" for example)
data is deduplicated within a same snapshot (within "snapshot1 from client1" if the client has duplicate files in his drive for example)

?

Thanks again for your amazing product!

Cookiefamily · Nov 3, 2020

turnicus said:
data is deduplicated between snapshots of a same client (between "snapshot1 from client1" and "snapshot2 from client1" for example)

data is deduplicated between snapshots of multiple clients (between "snapshot1 from client1" and "snapshot1 from client2" for example)

data is deduplicated within a same snapshot (within "snapshot1 from client1" if the client has duplicate files in his drive for example)

yes, as long as it is on the same datastore
yes, as long as it is on the same datastore
yes, as long as it is on the same datastore
Basically: Everything on the same datastore is deduplicated with other snapshots on the datastore.

turnicus · Nov 3, 2020

Waow, it's what I thought but I was not sure. That is awesome

fabian · Nov 3, 2020

what @Cookiefamily said is basically correct. just be aware that the deduplication is not on the file level, but on a chunk level (for fidx/VM/block device backups, chunk size is static 4MB, for didx/CT/file level/pxar backups, it's dynamically calculacted within certain limits). so an identical file A in two containers can end up being represented as one or more chunks and be deduplicated, or end up in a chunk with file B in one container, and file C in another, and not be deduplicated. so deduplication will work more reliably with snapshots of one source (chances for the chunker to split the input stream into chunks at similar boundaries are higher).

Squ1sh · Nov 5, 2020

Hi ! Does it use ZFS as backend for deduplication ?

tom · Nov 5, 2020

Squ1sh said:
Hi ! Does it use ZFS as backend for deduplication ?

No.

Squ1sh · Nov 6, 2020

May i ask if there's a particular technical reason why you reinvented (reimplemented

) that wheel and did not just use ZFSs dedup. I mean pve already uses ZFS so heavy, also for sending deltas of volumes to other servers. That would have just sounded obvious to me to use it and let it manage the dedup !?

fabian · Nov 6, 2020

because ZFS dedup works rather badly for almost all use cases (it needs to keep the dedup info in memory!), and we did not want to tie us to one specific storage implementation.

tom · Nov 6, 2020

Many setups to not use ZFS.

PBS dedup does not need a special file-system, so its fully flexible.

Squ1sh · Nov 6, 2020

Hmm, sad. Now that you proxmox guys brought me into reading and exploring all this cool stuff about ZFS and also their dedup feature.

> it needs to keep the dedup info in memory.
Yes it needs to cache it in memory to be fast. But you can store the dedup table also on a special device (i.e. a ssd)

> Many setups to not use ZFS
Could't you just let pbs create one or more big files on any storage backend and create a zpool->zfs over it (all done on the fly by pbs).

Cookiefamily · Nov 6, 2020

ZFS only works as long as you have access to a block device. If you are using NFS or other network file systems you are once again limited by this.

I don't really see a disadvantage with PBS doing deduplication itself, it keeps it a lot more flexible.

fabian · Nov 6, 2020

Squ1sh said:
Hmm, sad. Now that you proxmox guys brought me into reading and exploring all this cool stuff about ZFS and also their dedup feature.

> it needs to keep the dedup info in memory.
Yes it needs to cache it in memory to be fast. But you can store the dedup table also on a special device (i.e. a ssd)

that's still orders of magnitudes slower than RAM, so it will still take a massive performance hit (and use RAM that you could have used for caching metadata or data). if you are not 100% sure it's a perfect fit for your hardware and workload combination, the recommendation is still to not enable ZFS deduplication anywhere.

Squ1sh said:
> Many setups to not use ZFS
Could't you just let pbs create one or more big files on any storage backend and create a zpool->zfs over it (all done on the fly by pbs).

while ZFS does support using file-based vdevs, that is mainly a thing for testing purposes. deduplication would still be affected by all the memory and hardware requirements/drawbacks. it would also introduce another layer of indirection, without any of the flexibility that we get with our custom chunk store implementation.

Squ1sh · Nov 8, 2020

fabian said:
if you are not 100% sure it's a perfect fit for your hardware and workload combination, the recommendation is still to not enable ZFS deduplication anywhere.

Yes, i can aggree with that after i experienced a bit with it... I just wonder what you'd do better in your implemantion to not run into the same problems. Different block sizes / better block alginments ?

Thx for the comprehensive answers !

fabian · Nov 9, 2020

Squ1sh said:
Yes, i can aggree with that after i experienced a bit with it... I just wonder what you'd do better in your implemantion to not run into the same problems. Different block sizes / better block alginments ?

Thx for the comprehensive answers !

for one, we can already deduplicate client-side in a lot of cases. we don't need to keep the full dedup info in memory, but just those parts that are currently relevant for the operation that happens. also for us, dedup info is not 320 bytes per (usually small) block, but 32 bytes per (large) chunk, so even when we do cache stuff in memory, it's much more efficient.

Earthwalker · Nov 25, 2020

Is there any way to fast recover VM from backups in PBS?

With ZFS snapshot and replication work as backup, I can simply copy the snapshot on backup server to a new dataset name and create a new VM with it. This whole process can finish within 10 minutes.

With PBS, I can map disk backup to a lookback device using proxmox-backup-client, mount this device and access the files, but can't boot VM from this. I can restore the backup to a new VM but all data need read and write which will cost a lot of time. Suppose I have a VM with 20TB of data, speed of read and restore from backup are 100MB/s, then the restore process needs more than 58 hours which definitely is not fast recovery.

Is there any way to improve VM recovery speed?

Thanks.

che · Nov 25, 2020

Earthwalker said:
Is there any way to fast recover VM from backups in PBS?

With ZFS snapshot and replication work as backup, I can simply copy the snapshot on backup server to a new dataset name and create a new VM with it. This whole process can finish within 10 minutes.

With PBS, I can map disk backup to a lookback device using proxmox-backup-client, mount this device and access the files, but can't boot VM from this. I can restore the backup to a new VM but all data need read and write which will cost a lot of time. Suppose I have a VM with 20TB of data, speed of read and restore from backup are 100MB/s, then the restore process needs more than 58 hours which definitely is not fast recovery.

Is there any way to improve VM recovery speed?

Thanks.

Hi, please don't hijack threads, this question should go in a thread on its own I would suggest.
For your question, I think you are mixing two completely different usecases here. Of course an approach where you take a snapshot of the filesystem, continuously send the difference over the network and use that to recover is way faster than copying an possibly encrypted image/archive backed up to a different machine over the network.
But that is not the use case of a backup server, that is failover.
You are probably limited by hardware performance, especially disk and network IO for the recovery?

[SOLVED] Deduplication?

Well-Known Member

Proxmox Staff Member

Well-Known Member

Active Member

Renowned Member

Active Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Renowned Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Active Member

We value your privacy