Untar vzdump file for better deduplication

Newlife · May 11, 2019

Hello everyone,

I'm testing borgbackup to copy our backups to some off-site storage. With the normal tar file and following ones, the deduplication isnt as effective as it could be. When I extract the vzdump and inserting every file one by one its much more space efficient. My question is now, could I run into problems when trying to tar them back into an archive (with tar cvpf) to restore them in the proxmox ui?

Thanks in advance

Greetings

Newlife · May 12, 2019

I'll stick to just uploading the tar file, the other method is probably just too complicated

lhorace · May 12, 2019

Hello there,

I wonder if you can do deduplication at the block level instead, is that possible?

Cheers

Newlife · May 12, 2019

Hello lhorace,

unfortunately my off-site backup is a cloud solution so I'm not able to deduplicate at the block level.

Cheers

Pakkeay · May 12, 2019

Newlife said:
Hello lhorace,

unfortunately my off-site backup is a cloud solution so I'm not able to deduplicate at the block level.

Cheers

Are there any other options？

Newlife · May 12, 2019

You mean other off-site options? Not really. because that is as cheap as it can get. I still have the backups on a raid1 server so this is not the only backup. Was just curious how other people may use borgbackup (or other methods) to save on (cloud)storage space.

Cheers

LnxBil · May 13, 2019

Because you say tar, it suggest you're talking about containers. The logic used for qemu-based VMs is vma, which can also be "uncompressed" and then used for deduplication purposes. It works fine if you have a solution that is capable of doing so, but you always have to use the uncompressed backup.

I talked a lot about this already in other threads so here is a quick summary on what we do:
* ZFS on backup server
* Backup is written to backup server via NFS (PVE-way)
* Backup is then uncompressed and synced via rsync (bad), better own block sync program that only writes new blocks and uses then the CoW features of ZFS.
* ZFS snapshot
* delete old backup

This results in a CoW or VM-deduplication-on-used-block-storage setup that really only stores what has changed in between backups, but stores it as full backups. You can further increase the efficiency by using deduplication, but for VMs you need a deduplication that matches the block size of your VMs and all data has to be aligned properly. This means often 4K blocks on your storage and that is extremely inefficient for deduplication and compression.

Newlife · May 13, 2019

Thank you for your very informative post.

My bad on not saying what I use. I did mean Container backups yes. I mainly use containers and just one or two qemu machines on my proxmox hosts. On the host that runs ZFS, do you explicity use ECC-RAM ? The hosts that I have dont use ECC-RAM, so I'm kinda afraid to run ZFS on them.

LnxBil · May 14, 2019

Newlife said:
On the host that runs ZFS, do you explicity use ECC-RAM ? The hosts that I have dont use ECC-RAM, so I'm kinda afraid to run ZFS on them.

The remark about ECC-RAM does apply to every filesystem, not just ZFS - and you should always have a good backup strategy.
Every server I know does come with ECC-RAM, so yes. I also run PVE with ZFS on desktop hardware with ordinary RAM without any problems for years, but those systems are no 24/7.

guletz · May 14, 2019

And I do need to use zfs mostly on systems without ECC RAM in 24h/24 environment for many years without problems at all. I know this is not recommended, but I can not do else

orsiris · Mar 13, 2025

For whoever might find this in 2025:

I've used zfs 2.3.0 with fast_dedup to try to get vzdump generated data deduplicated.
So far so good, disabling zstd compression still leaves vzdump lzo compression on (cannot be disabled it seems).

After some trial and error with a couple of vma (non zstd compressed) files created by vzdump that comes with PVE 8.3:
zfs recordsize and rough deduplication percentage:
- 1M recordsize = 0%
- 128K recordsize = 1%
- 64K recordsize = 19%
- 32K recordsize = 65%
- 16k recordsize = 79%

So yes, data is deduplicable, but 16k recordsize offers like 3% compression ratio (zfs zstd), so still not ideal.

I wonder if I can just instruct vzdump to send non lzo compressed files.

fabian · Mar 14, 2025

VMA itself is not compressed.. but the content inside will be out of order, so the alignment between ZFS records and VMA extents must be there for deduplication to work (this is also the reason why just overwriting the vma file in place and relying on ZFS snapshots/CoW doesn't work).

VMA internally uses 64k-sized clusters, and 59 of those + a 512 byte header make up one extent. there's also a header up front that is variable size (containing the metadata of the backup and things like contained config file data). now you can probably guess why deduplication doesn't work very well with your approach

if you want to deduplicate, you probably want to do it on the original data (like PBS does -> use PBS!

)

orsiris · Mar 24, 2025

Thank you for your answer fabian.
As far as I can understand, this also means that piping vzdump output directly into your favorite backup tool will also prevent deduplication to work efficiently, due to random shifting of the header?

I actually want to use my own backup tool (open source) and render it compatible with Proxmox, so I'm doing some research here.
Is there any way (without writing a blockcopy of the current data) to export a quiesced raw or qcow2 disk image from proxmox, including the vm description ?

Splitting VMAs would require to know the content-length offset in order for deduplication to work properly with other tools, am I right ?
As much as I can think about this, it looks like an ecosystem that's not easy to work with.

fabian · Tuesday at 08:30

it's not just the header(s) that messes with alignment, but also the data itself that is out of order, so yeah, you'd have to use a tool that has just the right offsets for splitting the data to get an input stream that works well for deduplication purposes.

we are currently working on a backup API for external backup providers that should allow implementing both backup and restore without too much effort:

https://lore.proxmox.com/pve-devel/.../T/#m163633ab2ca4a124a9d8549a76df413f6a1d4bb4

if you want to play around with it, there are test packages for that purpose here:

http://download.proxmox.com/temp/backup-provider-api-v5/

SHA256SUMS:

Code:

b30ef35447310e4f92e6ed434fce902160a76b64b54432a43696b853e2f073ad  libpve-common-perl_8.2.9+backupproviderapiv5_all.deb
e2f3b9d2217bbb0bb75c94a73be1a9b76cda688028234fb529f4a25d1006a4ad  libpve-storage-perl_8.3.4+backupproviderapiv5_all.deb
efb4cb6c3928e20a5c297533e44da93f913b474f7dfb27274c822071890b8975  pve-container_5.2.5+backupproviderapiv5_all.deb
138324eafd889a5e90e69c9b5aedd9df3ed7db09be7305d979bdd5a306e1858f  pve-manager_8.3.5+backupproviderapiv5_all.deb
bdb871a0c57347607aec955f771cdb7ad24245cb1c4bf2517fbb78d734228ccd  pve-qemu-kvm_9.2.0-2+backupproviderapiv5_amd64.deb
12eb9034819364d0057b80d827515832589ee0848934b27be4039dbbc78cb060  qemu-server_8.3.9+backupproviderapiv5_amd64.deb
04ca13927bfd70800c1855022642e6db88b05b711b44877a120a84949d208c2e  qemu-server-dbgsym_8.3.9+backupproviderapiv5_amd64.deb

feedback would be highly appreciated!

orsiris · Saturday at 09:45

Fabian, any docs (even swagger like) for that API ?

Search

Search

Untar vzdump file for better deduplication

Newlife

Active Member

Newlife

Active Member

lhorace

Renowned Member

Newlife

Active Member

Pakkeay

New Member

Newlife

Active Member

LnxBil

Distinguished Member

Newlife

Active Member

LnxBil

Distinguished Member

guletz

Famous Member

orsiris

New Member

fabian

Proxmox Staff Member

orsiris

New Member

fabian

Proxmox Staff Member

orsiris

New Member

We value your privacy