Backups Real Size

Gianks

Member
Mar 17, 2020
14
0
21
36
Hi,
is there a way to know how much real space is a backup occupying on a datastore?
Right now i can see only the full disk size of the VM for each backup which in contrast are incremental and, thanks to your nice job, not taking as much space as a full backup would... but without knowing how much real space each takes makes the feature a problem if the datastore is running out of space and a manual decision has to be made.

Thanks a lot!
 
Hi,
is there a way to know how much real space is a backup occupying on a datastore?
Right now i can see only the full disk size of the VM for each backup which in contrast are incremental and, thanks to your nice job, not taking as much space as a full backup would... but without knowing how much real space each takes makes the feature a problem if the datastore is running out of space and a manual decision has to be made.
There are no incremental backups. All backups are full backups. They need so less space because of the deduplication. Everything is stored as small (< 4MB) chunks files and these get deduplicated so that no chunk needs to be stored twice. So its hard to tell how much space a single guests backup consumes because different guest backups can share the same chunk. As far as I know there is no easy way to see how much space a backup consumes. Best thing is to look in the backup log to estimate the size:
Code:
2022-03-27T05:04:35+02:00: Size: 107374182400
2022-03-27T05:04:35+02:00: Chunk count: 25600
2022-03-27T05:04:35+02:00: Upload size: 5423235072 (5%)
2022-03-27T05:04:35+02:00: Duplicates: 24307+399 (96%)
2022-03-27T05:04:35+02:00: Compression: 16%
 
I don't want to talk about things I did not study fully yet but as far as i can see from the logs your statements are not correct for PBS but just for Proxmox VE. While VE has no command that i am aware of to perform non-full backups, PBS works with incremental backups (dirty maps) and transfers just those, similar to a snapshot. A full backup is performed once, the first time only (or every time if retention states only one backup has to be kept, obviously).
I guess that the deduplication happens after all of that.

Back to sizes it should be relatively easy to calculate the real occupation just counting the references to each chunk... in the end it would be what i expect the GC to do when asked to check for reclaimable space.
 
Yes, PBS checks for existing blocks ,and then it only copies the one missing. But all backups are full backups, as in the data is there,only maybe references 2-100 times depending on your backups.
 
Yes, PBS checks for existing blocks ,and then it only copies the one missing. But all backups are full backups, as in the data is there,only maybe references 2-100 times depending on your backups.
That is the definition of an incremental backup in essence...
I hope that this log line out of the backup JOB will make clear these are incremental backups...

Code:
INFO: using fast incremental mode (dirty-bitmap), 420.0 MiB dirty of 120.0 GiB total

Now i guess we must clarify what we mean by "incremental" since it seems to me there is some confusion. If we want to be precise, these are differential backups more than incremental, since there is no need to traverse all the previous to the first backup to reconstruct (but here i am guessing since i did not look at the internals of PBS, but I am pretty sure to not be far from the objective based upon observation of the timings and space growth).
While it is true, you can restore any of the backups as they were full backups it does not imply they were not acquired as incremental/differential.
I have been using for a long time differential backups done via hardlinks for Linux, obtaining a full version of the entire FS/tree in separated folders, as they were full backups... and yet sharing the same inodes for unchanged files (see rsync for reference).
Simply using du you can absolutely calculate the weight of the shared areas as of those that are unique to each individual backup, the trick is in the number of references (in my case hardlinks) from each file in each folder in the same tree.

I guess the difference in our interpretation in good part is between how the backup is performed vs how it's stored.

Hope we now agree that PBS, in conjunction with PVE, uses incremental backups and not full backups as that it should be absolutely possible without much effort to provide the info i was asking for initiallly ;)
 
Last edited:
while it *could* be possible to calculate some value for *how much space is the backup using* there are a few issues with that:

* it can be really expensive: since the chunks are about 4MiB in size for vms and variable for ct (pre compression), there are many for a given backup (e.g. a 100GiB vm there are ~25000 chunks) so stating them all would take a while (depending on the storage)
* even if we'd had that readily available, this size is quite misleading, since it tells the user not really anything. deleting such a snapshot would not necessarily free up that amount of space..
* the reverse, the 'unique' space used, is even harder to calculate, since we'd have to iterate over *all* indices (that can be many) to find the unique chunks

did you have any other metric in mind ?

for your initial question:

. but without knowing how much real space each takes makes the feature a problem if the datastore is running out of space and a manual decision has to be made.
for that we recommend monitoring the storage and even include some estimate in the dashboard
 
* even if we'd had that readily available, this size is quite misleading, since it tells the user not really anything. deleting such a snapshot would not necessarily free up that amount of space..
Hi, can you please expand on this point? Before proceeding on the wrong path i guess i need to understand why that would not be the case. I understand the garbage collector must probably run for the space to be "freed" but, afte that, shall we not reach the expected space occupation?

I would suggest using the same approach used by shared pointers in C++, using a simple counter per usage on the chunks.
Even if that would cause an overhead in terms of space (negligible in fact, even using a large type, 64bit/4M), adding a counter for each chunk would address the need to scan the indexes every time a chunk count does not change from 1 to 2 or vice-versa (single vs shared, to give a name to such states). Adding a third reference (or superior) would not impact at all all other backups space estimation (since you brought up compression i wont say space, as in the DS) since the state would remain shared, and the same would apply moving from N to N-1 when N > 2.

An additional reference to the single backup (using the ID i guess) only referencing a chunk, would solve already the case when moving from 1 (single) to 2 (shared) since the space estimation could be arithmetically adjusted on the fly for almost no cost on the target.

The inverse, from shared to single, will require a more complex mapping since we would need to know which is the "last remaining backup" for each chunk. The risk i see here is to incur in a possibly large overhead if the number of backups sharing a reference becomes large over time, but could be resolved using multiple layers of references and ranges, and anyway limited with optional parameters. I don't know almost nothing about the actual architecture of PBS so i can't tell from here how much expensive is such lookup (chunk -> backup) right now without such new/additional mapping.

Speaking of overall targets, i guess a generic sys admin might wish to:
  • Understand how much space can be recovered deleting what, when needed (unique space on the DS)
  • Understand/Monitor the growth trend of backups for each VM and for the overall DS, meaning observing if today backup required additional 10 GB or the usual 1GB avg of the "whatever schedule you like for your recurrent backups". (I'm guessing, the dirty map size would do it if stored along with backup metadata)
  • Out of topic, but not too much :) Understand which VM is what! No kidding, please make PBS read the VM conf file and extrapolate/display the VM name along with the ID!
I look forward to hear your comments.
Thanks a lot
 
Hi, can you please expand on this point? Before proceeding on the wrong path i guess i need to understand why that would not be the case. I understand the garbage collector must probably run for the space to be "freed" but, afte that, shall we not reach the expected space occupation?
because i did not talk about the "unique" space but the more simple space of the sum of all referenced chunks of a snapshot. this would also count the chunks used by other backups...

we did consider reference counting the chunks, but this has a different set of problems:
* we could have some big file with all chunks+refcounts: updating this would be very costly as it would have to be done on every backup/prune/etc. for every chunk. since that operation has to be locked, the different jobs block each other
* we could have an extra file for each chunk with the refcount, but that has also multiple problems:
- we double the amount of files in the datastore. for big datastores the number of files for the chunks are already very high, and adding another would double that. (e.g. 100TiB datastore => has now 26 million chunks would become 52 million...) on slower storages (especially high latency ones like hdds) this makes such datastores unusable
- each update of the refcount needs to lock the chunk count, so different jobs will block each other too
- pruning would take much longer, since it'd now have to update the refcount for each chunk instead of simply deleting the index files
- we still have no cheap way to get the unique space, since we now have iterate over all chunks, check if they have a refcount of 1 and count it then so big backup still need to lock->open->read->close many files which is costly
- less robust, since if i delete a snapshot manually in the datastore, the refcounts in the chunks are not updated then

holding the count in-memory is not really an options, 1. because it would blow up the memory usage of the daemon quite significantly, and 2. we must persist that to disk (for restarts, etc.) and also there can be multiple daemons of different versions running (when a task is ongoing but an update triggers a reload. in that case the old daemon lives until the task is finished)

maybe i'm overlooking some way, but we often do discuss such things internally in detail

  • Understand how much space can be recovered deleting what, when needed (unique space on the DS)
i get it, but as i said not easy ;)

Understand/Monitor the growth trend of backups for each VM and for the overall DS, meaning observing if today backup required additional 10 GB or the usual 1GB avg of the "whatever schedule you like for your recurrent backups". (I'm guessing, the dirty map size would do it if stored along with backup metadata)
that metric can be found in the overall datastore usage as well the client task log should contain the amount of data uploaded to the server

  • Out of topic, but not too much :) Understand which VM is what! No kidding, please make PBS read the VM conf file and extrapolate/display the VM name along with the ID!
could be done with comments + vzdump hooks
 
  • Like
Reactions: Neobin
Good to know, I see you guys are taking seriously the job so a big thanks.
I knew my comments were something intuitive, the telephone was invented more than twice in less than 10 years by people not even talking to each other (possibly because there was no telephone yet) :p

maybe i'm overlooking some way, but we often do discuss such things internally in detail
I don't think so, your analysis shows concrete concerns which can't be solved by a magic wand (I wish I could be more helpful here), sure can be mitigated to some extent but it would require a deeper knowledge of the currently in place structures and strategies than I have (and might imply a complete shift in approach, not necessarily so quick and simple). Btw i will give some thoughts to the problem, maybe something will pop in my mind.

i get it, but as i said not easy ;)
Agreed! Anyway, as expensive as it can be, a manual "compute real space size" could be implemented without much change, just a lot of free time ahead for the user to get an answer ;)

that metric can be found in the overall datastore usage as well the client task log should contain the amount of data uploaded to the server
I know, that's why i was suggesting to just report the information in the Contents pane of the DS along with each backup, I might be more chatty in truth (backup total time, transferred data, dirty map size (if used), mode (snapshot, stop, etc), backup task name, backup run by scheduler/manual. something else i don't have on top of my mind...).
I would stick with the dirty map size just because the DS available space change might be affected by multiple backup jobs running in parallel.

could be done with comments + vzdump hooks
I get that but i will wait for fabian to complete his task :p
 
Out of curiosity:
* Isn't the garbage collector doing essentially a reference count for all chunks and deleting the orphans?

And a consideration: why deduplicate data between different VM? Or more precisely, is it really doing something after complicating things?
I was observing the behavior of PBS today and it came to my mind a simple question: how easy is to find blocks of 4 MB which are identical between VMS? Is data alignement, for example. a limitation making it almost impossible? I guess it depends also on how the data is stored on the source VM, fragmentation for example can mess up a lot things in such regard...
My point is: while i see a huge impact of this strategy for the same VM, backed up over time, with a certain amount of data which is never touched, how does it work when the VM disks are different (and still "sharing" a good amount of identical data like libraries, kernels and so on...)?

I said I was gonna keep the topic in the background of my mind... just thinking...
 
Last edited:
Out of curiosity:
* Isn't the garbage collector doing essentially a reference count for all chunks and deleting the orphans?

And a consideration: why deduplicate data between different VM? Or more precisely, is it really doing something after complicating things?
I was observing the behavior of PBS today and it came to my mind a simple question: how easy is to find blocks of 4 MB which are identical between VMS? Is data alignement, for example. a limitation making it almost impossible? I guess it depends also on how the data is stored on the source VM, fragmentation for example can mess up a lot things in such regard...
My point is: while i see a huge impact of this strategy for the same VM, backed up over time, with a certain amount of data which is never touched, how does it work when the VM disks are different (and still "sharing" a good amount of identical data like libraries, kernels and so on...)?

I said I was gonna keep the topic in the background of my mind... just thinking...
Might be useful with VM clones.
 
Initially it might... but over time the divergence is unavoidable and in the last years the increase in quick deep changes in the OS at all levels makes this explanation possibly "not so much true, anymore"... I see it as similar as the limitation posed by templates over "shared" updates, each VM must be updated independently and there is no such process to reconcile them (their drives) at the end, which would be a really great feature ;)

I guess would be good to test in a development environment how much cross deduplication it's really worth while considering, on the other hand, how feasible would be to create a chunks store per VM, especially if such solution could help with the original problem (or other withstanding issues): could be something the admin might want then to configure manually while defaulting to the current behavior.
 
Out of curiosity:
* Isn't the garbage collector doing essentially a reference count for all chunks and deleting the orphans?
not really, the gc works like this:
1. iterate over all indices, set the atime of all referenced chunks
2. iterate over all chunks, remove all with an atime older than 24h + 5min (because of relatime semantics) relative to the start of the gc, or the startime of the oldest running backup (whichever is farther in the past)

this way new backups/pruning/etc. can all run in parallel to the gc without breaking it
(e.g. if we would use a chunks list in memory, there could be no new backups starting during a garbage collection, because it would not know to keep them..)

And a consideration: why deduplicate data between different VM? Or more precisely, is it really doing something after complicating things?
for vms it might help for cloned vms (as already mentions) or for *very* similar vms (i.e. when you install the same packages on two clones of the same vm, it is not unlikely that the same blocks will get written, since there is not that much nondeterministic behaviour involved here aside from different configs)
for ct, it absolutely helps, since the way we chunk it there should be consistent for 2 directories that have the same content

the other question is, why should we have a chunk store for each guest? by simply using the same one, we can gain quite some space savings, with only some minor/cosmetic downsides
 
to have something useful , we would need to have

for each VM backup (as a whole) or a VM snapshot :
- how many chunks are unique to that backup/snapshot
- how many chunks being shared with other VMs
- to sum up the size of those unique and shared chunks

with that information, we could get an insight , how much space each VM backup really takes and which backups/VMs are well compressing/duplicating or which are space eaters, because of accumulating too much additional , badly compressed or deduplicated data with each backup (which could for example be moved to a different datastore / backup job with a different retention policy)
 
Last edited:
i started to spend some thoughts on this. here is some python code which extracts the chunk hashes of a vm snapshot from an index file



#!/usr/bin/python3 import binascii import sys,getopt filename = "drive-scsi0.img.fidx" blocksize = 32 import os def is_eof(f): cur = f.tell() # save current position f.seek(0, os.SEEK_END) end = f.tell() # find the size of file f.seek(cur, os.SEEK_SET) return cur == end with open(filename,"rb") as f: f.seek(4096) while not is_eof(f): block = f.read(blocksize) hexdata = binascii.hexlify(block) print(hexdata.decode())
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!