[SOLVED] Deduplication?

devilkin

Member
May 11, 2020
24
3
8
Hi,

I'm wondering how the dedup works - is this cross datastore, or just for the backup itself (so between backups of the same source)

Tia
 
  • Like
Reactions: wahmed
the deduplication domain is a datastore. note that encryption creates another layer on top, so the same data encrypted with different keys will not be deduplicated even if stored in the same datastore (since the stored encrypted data is different)
 
  • Like
Reactions: devilkin
Hello Fabian, just a quick clarification to grasp the full potential of deduplication. Can you tell if the following statements are true or false in an unencrypted common datastore:
  • data is deduplicated between snapshots of a same client (between "snapshot1 from client1" and "snapshot2 from client1" for example)
  • data is deduplicated between snapshots of multiple clients (between "snapshot1 from client1" and "snapshot1 from client2" for example)
  • data is deduplicated within a same snapshot (within "snapshot1 from client1" if the client has duplicate files in his drive for example)
?

Thanks again for your amazing product!
 
  • data is deduplicated between snapshots of a same client (between "snapshot1 from client1" and "snapshot2 from client1" for example)
  • data is deduplicated between snapshots of multiple clients (between "snapshot1 from client1" and "snapshot1 from client2" for example)
  • data is deduplicated within a same snapshot (within "snapshot1 from client1" if the client has duplicate files in his drive for example)
  • yes, as long as it is on the same datastore
  • yes, as long as it is on the same datastore
  • yes, as long as it is on the same datastore
  • Basically: Everything on the same datastore is deduplicated with other snapshots on the datastore.
 
Last edited:
what @Cookiefamily said is basically correct. just be aware that the deduplication is not on the file level, but on a chunk level (for fidx/VM/block device backups, chunk size is static 4MB, for didx/CT/file level/pxar backups, it's dynamically calculacted within certain limits). so an identical file A in two containers can end up being represented as one or more chunks and be deduplicated, or end up in a chunk with file B in one container, and file C in another, and not be deduplicated. so deduplication will work more reliably with snapshots of one source (chances for the chunker to split the input stream into chunks at similar boundaries are higher).
 
May i ask if there's a particular technical reason why you reinvented (reimplemented :) ) that wheel and did not just use ZFSs dedup. I mean pve already uses ZFS so heavy, also for sending deltas of volumes to other servers. That would have just sounded obvious to me to use it and let it manage the dedup !?
 
because ZFS dedup works rather badly for almost all use cases (it needs to keep the dedup info in memory!), and we did not want to tie us to one specific storage implementation.
 
Many setups to not use ZFS.

PBS dedup does not need a special file-system, so its fully flexible.
 
  • Like
Reactions: wahmed
Hmm, sad. Now that you proxmox guys brought me into reading and exploring all this cool stuff about ZFS and also their dedup feature.

> it needs to keep the dedup info in memory.
Yes it needs to cache it in memory to be fast. But you can store the dedup table also on a special device (i.e. a ssd)

> Many setups to not use ZFS
Could't you just let pbs create one or more big files on any storage backend and create a zpool->zfs over it (all done on the fly by pbs).
 
ZFS only works as long as you have access to a block device. If you are using NFS or other network file systems you are once again limited by this.

I don't really see a disadvantage with PBS doing deduplication itself, it keeps it a lot more flexible.
 
  • Like
Reactions: Earthwalker
Hmm, sad. Now that you proxmox guys brought me into reading and exploring all this cool stuff about ZFS and also their dedup feature.

> it needs to keep the dedup info in memory.
Yes it needs to cache it in memory to be fast. But you can store the dedup table also on a special device (i.e. a ssd)

that's still orders of magnitudes slower than RAM, so it will still take a massive performance hit (and use RAM that you could have used for caching metadata or data). if you are not 100% sure it's a perfect fit for your hardware and workload combination, the recommendation is still to not enable ZFS deduplication anywhere.

> Many setups to not use ZFS
Could't you just let pbs create one or more big files on any storage backend and create a zpool->zfs over it (all done on the fly by pbs).

while ZFS does support using file-based vdevs, that is mainly a thing for testing purposes. deduplication would still be affected by all the memory and hardware requirements/drawbacks. it would also introduce another layer of indirection, without any of the flexibility that we get with our custom chunk store implementation.
 
  • Like
Reactions: Squ1sh
if you are not 100% sure it's a perfect fit for your hardware and workload combination, the recommendation is still to not enable ZFS deduplication anywhere.
Yes, i can aggree with that after i experienced a bit with it... I just wonder what you'd do better in your implemantion to not run into the same problems. Different block sizes / better block alginments ?

Thx for the comprehensive answers !
 
Yes, i can aggree with that after i experienced a bit with it... I just wonder what you'd do better in your implemantion to not run into the same problems. Different block sizes / better block alginments ?

Thx for the comprehensive answers !

for one, we can already deduplicate client-side in a lot of cases. we don't need to keep the full dedup info in memory, but just those parts that are currently relevant for the operation that happens. also for us, dedup info is not 320 bytes per (usually small) block, but 32 bytes per (large) chunk, so even when we do cache stuff in memory, it's much more efficient.
 
Is there any way to fast recover VM from backups in PBS?

With ZFS snapshot and replication work as backup, I can simply copy the snapshot on backup server to a new dataset name and create a new VM with it. This whole process can finish within 10 minutes.

With PBS, I can map disk backup to a lookback device using proxmox-backup-client, mount this device and access the files, but can't boot VM from this. I can restore the backup to a new VM but all data need read and write which will cost a lot of time. Suppose I have a VM with 20TB of data, speed of read and restore from backup are 100MB/s, then the restore process needs more than 58 hours which definitely is not fast recovery.

Is there any way to improve VM recovery speed?

Thanks.
 
Is there any way to fast recover VM from backups in PBS?

With ZFS snapshot and replication work as backup, I can simply copy the snapshot on backup server to a new dataset name and create a new VM with it. This whole process can finish within 10 minutes.

With PBS, I can map disk backup to a lookback device using proxmox-backup-client, mount this device and access the files, but can't boot VM from this. I can restore the backup to a new VM but all data need read and write which will cost a lot of time. Suppose I have a VM with 20TB of data, speed of read and restore from backup are 100MB/s, then the restore process needs more than 58 hours which definitely is not fast recovery.

Is there any way to improve VM recovery speed?

Thanks.
Hi, please don't hijack threads, this question should go in a thread on its own I would suggest.
For your question, I think you are mixing two completely different usecases here. Of course an approach where you take a snapshot of the filesystem, continuously send the difference over the network and use that to recover is way faster than copying an possibly encrypted image/archive backed up to a different machine over the network.
But that is not the use case of a backup server, that is failover.
You are probably limited by hardware performance, especially disk and network IO for the recovery?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!