How to get the exactly backup size in proxmox backup

parker0909

Member
Aug 5, 2019
81
0
11
35
Hi All,

We have a question, how to know exactly backup size for proxmox vm. I try to find the information in the proxmox VE but it seem the size should not correct and it showing the size bigger than vm hdd size, on the other hand, it seem the size also not showing proxmox backup server as well. May i know any idea how to get how many disk space using for particular vm. Thank you.

Parker
 
It is not possible to get the 'exact' backup size for a backup in pbs, due to the deduplication.
the backup is split in chunks and deduplicated for the whole datastore (so other backups can reuse it), thus it does not really contribute to a single backup but possibly many
 
  • Like
Reactions: Tmanok
Hi All,

We have a question, how to know exactly backup size for proxmox vm. I try to find the information in the proxmox VE but it seem the size should not correct and it showing the size bigger than vm hdd size, on the other hand, it seem the size also not showing proxmox backup server as well. May i know any idea how to get how many disk space using for particular vm. Thank you.

Parker
As dominik says, due to deduplication it's a bit difficult. However, depending on what you need to know, I find it very good info to know how much data was backuped on each snapshot. I wish the amount of transferred data during backup was easily available on both PVE and PBS gui afterwards.
If you run a backup, it ends with telling you "INFO: transferred 1.07 GiB in 5 seconds (218.4 MiB/s)". This number is what your backup transferred to PBS, and it gives a hint of the size added to the backup storage. This number is available on the PBS for each snapshot in the index.json.blob file you see under each snapshot. Again, this is not the size of that snapshot, it relates to all the other snapshots of the VM, but it is a good hint depending on why you want to know the size of the backup.
 
Some information is better than none, no?
I would find 2 size fields very useful:
  • Backup Size - Size of the backup ignoring dedup: sum of all the chunks in use by the backup. This would be a more appropriate size to be returned to Proxmox VE, rather than a useless Virtual HD size definition.
  • Delta Size - Size of the delta at the time of the backup: new chunks written.
I would think both should be easily calculated during the backup, unless I'm missing something...
 
Backup Size - Size of the backup ignoring dedup: sum of all the chunks in use by the backup. This would be a more appropriate size to be returned to Proxmox VE, rather than a useless Virtual HD size definition.
I agree that the VHD size is not really useful, but this metric is hard to gather due to the deduplication. PVE / proxmox-backup-client would need to measure how much (non-zero) data was actually read and provide this metric to PBS, because PBS is unable(?) to calculate this on its own as the deduplicated pool is shared across backups.
And what about dirty-bitmaps? That'd ruin this metric instantly, because the backup client does not need to read everything.
Maybe some tricky math can still lead to a useful value, but still... makes things even more complicate.

Delta Size - Size of the delta at the time of the backup: new chunks written.
Assuming there is an existing backup, this would be rather easy to calculate then.
But counting the amounts of new chunks written by PBS would be another extra metric, because PBS only writes new chunks if the pool does not already have it stored - which means no other backup already created that chunk.
Thats also making this metric kinda useless, because other backups could also reference the same chunks in the future and therefore report way less "new chunks written" in its backup job, which makes the "new chunks written" value a heavily misleading information imho.


I'd prefer something like the cumulated size of all chunks referenced by an backup as additional "size" metric - this could provide a potentially more accurate number and on top, PBS can calculate *and update* this value on itself and does not rely on valid input data from backup clients to maintain those metrics.
Something like the "referenced" value that ZFS provides.
1629359164296.png
Image is from https://docs.oracle.com/cd/E19253-01/819-5461/gazss/index.html

@dcsapak What do you think about this? Is something like that practicable?

On a sidenote:
When calculating the referenced size of the backup, it'd be interesting to have ZFS compression ignored (or additionally shown) because a high compression rate could also let the backup look way smaller than it actually is.
For example, I have a recordsize of 4M and using zstd I get usual compression ratios between x2.0 and x3.0.
IIRC there was some way to read the uncompressed size of a file, ignoring ZFS' transparent compression and showing the original filesize and there was some way to read the filesize that ZFS reports, which is smaller if file is compressed.

Same applies for PBS built-in compression, that'd also change the displayed size.

I think a "referenced" size would be useful for guessing the required time to restore the VM.
If i have a 2TB VHD, will it take a minute to restore because it is 99% zeroes (which will restore extremely fast on compressing storage like ZFS or Ceph) or will it take like hours because there are actually tons of data inside?
 
Last edited:
I agree that the VHD size is not really useful, but this metric is hard to gather due to the deduplication.
"Backup Size - Size of the backup *ignoring dedup*: sum of all the chunks in use by the backup."
I'm not concerned with dedup. Like everyone says, it's difficult and expensive to calculate, so leave dedup as GC calculated stats for now. I want to know how much disk space the single backup would consume if it were the only backup that exists.

And what about dirty-bitmaps? That'd ruin this metric instantly, because the backup client does not need to read everything.
Good point. Assuming it can't be live-counted during backup, worst case scenario is that it is a relatively inexpensive filesystem metadata calculation of the chunks. If it's still too expensive during backup, leave it blank and have a refresh button on the size field. Or at the very least calculate the field during a Verify or GC.

I'd prefer something like the cumulated size of all chunks referenced by an backup as additional "size" metric - this could provide a potentially more accurate number and on top, PBS can calculate *and update* this value on itself and does not rely on valid input data from backup clients to maintain those metrics.
I'm pretty sure we're now talking about the same thing...Not sure what there is to update, except maybe the Backup Group size?


Re: Delta Size
Assuming there is an existing backup, this would be rather easy to calculate then.
Looks like it's already done! I just checked index.json.blob -- just slap the contents of "chunk_upload_stats" - "compressed_size" into a Delta Size column and call it a day.

Thats also making this metric kinda useless, because other backups could also reference the same chunks in the future and therefore report way less "new chunks written" in its backup job, which makes the "new chunks written" value a heavily misleading information imho.
There's plenty of use-cases when you know you're not dealing with dedup-able data. In general still very useful information:
  • Shows how much disk space was used for this particular backup run.
  • Over a few backups, shows a general growth/change rate for the VM at a glance, even considering far-away dedup data.
  • Helps find a particular backup when you know there were a lot of changes.
It's not misleading; just document what the number represents -- everyone knows about dedup and can take it into consideration.
 
Here is a script that worked to get me the information I was looking for. It shows the cumulative on size of all chunks associated with a .fidx file. It doesn't support .didx files, so it's useless for containers but I'm sure similiar principles would apply for those should someone choose to do the work for it.

Bash:
datastore=/bulkpool/backups
backup=vm/100/2021-09-19T06:00:02Z

cd $datastore/$backup

for x in `xxd -s +4096 -p -c 32 *.img.fidx`
do
  ls -l $datastore/.chunks/${x:0:4}/$x | awk '{print $5}'
done | paste -sd+ | bc

Essentially it get all the chunk hashes from the .fidx, then "ls"'es each of those hashes from the .chunks directory and adds up the size. It's awful and terribly inefficient, but it got me what I wanted to know. Hopefully it will also work for someone else.
 
Last edited:
  • Like
Reactions: krikey
Here is a script that worked to get me the information I was looking for. It shows the cumulative on size of all chunks associated with a .fidx file.
just for your information, the only difference between the result of your script and the 'full size' we show in the webui is the compression of the chunks, since you count duplicate chunks multiple times AFAICT
you'd first have to pipe the chunks through 'sort -u' or something like that to count each chunk only once
 
Since I was restoring a backup over a WAN link and I wanted to know how long it was going to take, the compressed size of the chunks is exactly what I wanted to know. The UI just says “32GB” for the backup size, but the amount of data to transfer was only “6GB”.

Good point about the duplicate chunks, that probably overcounted some empty chunks and maybe a few non-empty ones. It was close enough to be useful, though.
 
  • Like
Reactions: Man67 and Tmanok
I've worked on gercos example above to make this (hopefully) more useful across multiple backups and specifically for a VM (it won't work with CTs yet).

it takes a single VM ID parameter at the start so for example:

./calcbackupsize.sh 111

The script grabs all the relevant .img.fidx files based on the ID of the VM, iterates through them, builts an array of unsorted, unfiltered chunk IDs. They are then filtered using sort -u and the resulting array is then pushed into the ls -l command to get the file sizes which are then formatted a little better using a numfmt --to iec --format "%8.4f" command

The script can take minutes to run as it's quite an intensive and expensive task but might be useful for anyone wanting to get some more info on the size of a backup set every now and again.

I'd appreciate any feedback on how to optimise this. I'll take a look at some point on how to decipher the CT .didx files.

Code:
#!/bin/bash

datastore=/mnt/backup
backup=vm/$1
cd $datastore/$backup

#get all .img.fidx file paths for the chosen VM
readarray -d '' filearray < <(find ~+ -name *.img.fidx -print0)

#iterate through all relevant files and create the unsorted array
for (( i=0; i<${#filearray[@]}; i++ ));
do
        for x in `xxd -s +4096 -p -c 32 ${filearray[$i]}`
        do
                chunkarrayunsorted+=($x)
        done
done

echo "unsorted count: " ${#chunkarrayunsorted[@]}

#sort the chunk array
readarray -td '' chunkarraysorted < <(printf '%s\0' "${chunkarrayunsorted[@]}" | sort -zu)

echo "sorted count: " ${#chunkarraysorted[@]}
arrayLen=${#chunkarraysorted[@]}

#get chunk sizes and calculate total
for (( i=0; i<${arrayLen}; i++ ));
do
        ls -l $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]} | awk '{print $5}'
done | paste -sd+ | bc | numfmt --to iec --format "%8.4f"
 
Last edited:
This link explains the file structures https://pbs.proxmox.com/docs/proxmox-backup.pdf#be and according to this documentation where the .fidx file has a fixed digest size (and presumably makes it simpler to parse through xxd), the .didx files do not, using an offset in between each digest part, meaning that there needs to be some kind of offset after each digest. At least that's the way I understand it.
 
I've tweaked the code somewhat to try to ascertain the sizes of the CTs too using XXD as in gerkos original example. I'm not sure I've got the logic quite right for CTs though when calculating the chunk sizes in the .didx files but perhaps someone would like to try this out and let me know if perhaps I'm getting the right data from the .didx files?

Code:
#!/bin/bash

datastore=/mnt/backup
id=$1

#check to see if its a VM or a CT
if [[ -d "$datastore/ct/$1" ]]
then
        type=ct
elif [[ -d "$datastore/vm/$1" ]]
then
        type=vm
else
        echo "canot find VM/CT"
        exit
fi


path=$datastore/$type/$id
cd $path

echo "Starting to build an array of all chunks for "$type $id". Please note this may take some time..."

#if its a VM then do the following
if [ $type = 'vm' ]
then
        #get all .img.fidx file paths for the chosen VM
        readarray -d '' filearray < <(find ~+ -name *.img.fidx -print0)
        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files and create the unsorted array
                for (( i=0; i<${#filearray[@]}; i++ ));
                do
                        for x in `xxd -s +4096 -p -c 32 ${filearray[$i]}`
                        do
                                chunkarrayunsorted+=($x)
                        done
                done
        fi
fi

#if its a CT then do the following
if [ $type = 'ct' ]
then
        #get all the .didx file paths for the chosen CT
        readarray -d '' filearray < <(find ~+ -name *.didx -print0)

        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files anc create the unsorted array
                for x in `xxd -s +4104 -p -c 40 ${filearray[$i]}`
                do
                        chunkarrayunsorted+=(${x:0:64})
                done
        fi
fi


if [ ${#filearray[@]} -gt 0 ]
then
        #sort the chunks array to remove duplicates
        readarray -td '' chunkarraysorted < <(printf '%s\0' "${chunkarrayunsorted[@]}" | sort -zu)

        let unsortedcount=${#chunkarrayunsorted[@]}
        let sortedcount=${#chunkarraysorted[@]}
        percent=`echo "scale=2 ; (1 - ($sortedcount / $unsortedcount))*100" | bc`
        echo "unsorted chunks: "$unsortedcount
        echo "sorted chunks:   "$sortedcount
        echo "Ignoring chunks: "$percent"%"
        echo "Now iterating through chunks to calculate the total size for "$type $id". This may take some time..."

        #get chunk sizes and calculate total
        for (( i=0; i<${#chunkarraysorted[@]}; i++ ));
        do
                ls -l $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]} | awk '{print $5}'
        done | paste -sd+ | bc | numfmt --to iec --format "%8.2f"
else
        echo "no files found to process"
fi

Simply save the file and make it executable, then to run it as follows...

./calcbackupsize.sh [ID of CT or VM]
 
Last edited:
I could reduce the calculation time by nearly 50% (before: 2min 7s, after: 1min 17sec) by changing the line
Bash:
ls -l $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]} | awk '{print $5}'
to
Bash:
stat -c "%s" $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]}

the output of stat -c "%s" file just outputs the filesize, so the output gets much shorter and it does not need to be filtered by awk.

Still, it is horribly slow...
 
@krikey thx for your work. I tried your code and it works nice for smaller CT/VMs. But for lager VMs its waaay to slow (on my machine).

I came up with a different approach for VMs.

a) VMs are backed up in fixed-sized chunks of 4 MiB. (According to https://pbs.proxmox.com/docs/technical-overview.html#fixed-sized-chunks they are "typically" 4 MiB, but I could not find when it would be a different size). They are then (usually) compressed and maybe encrypted.

b) Backups are incremental and chunks are de-duplicated.

Knowing this, I determined the number of unique chunks for each .img.fidx. This, multiplied by 4, gives me the size of that backup before compression but after de-duplication.

I also sorted the filearray and determined for each file the number of new chunks. A chunk is new if it has not appeared in any file before. This is basically the additional size a backup causes.

Here is an example output:

Bash:
# python3 ~/estiname-size.py 212
-----------------------------------------------------------------------------------------------
| filename                                           | chunks              | size             |
-----------------------------------------------------------------------------------------------
| 2023-03-16T16:53:09Z/drive-scsi0.img.fidx          |       227909 chunks |       911636 MiB |
| 2023-03-16T16:53:09Z/drive-scsi1.img.fidx          |        53461 chunks |       213844 MiB |
| 2023-03-24T23:50:43Z/drive-scsi0.img.fidx          |        48362 chunks |       193448 MiB |
| 2023-03-24T23:50:43Z/drive-scsi1.img.fidx          |            1 chunks |            4 MiB |
| 2023-03-25T23:47:06Z/drive-scsi0.img.fidx          |        52822 chunks |       211288 MiB |
| 2023-03-25T23:47:06Z/drive-scsi1.img.fidx          |            1 chunks |            4 MiB |
| 2023-03-26T22:47:04Z/drive-scsi0.img.fidx          |         1708 chunks |         6832 MiB |
| 2023-03-26T22:47:04Z/drive-scsi1.img.fidx          |            1 chunks |            4 MiB |
-----------------------------------------------------------------------------------------------


This VM has two drives. There where no changes on scsi1 after the initial backup. scsi0 has an uncompressed size of ~900 GB, therefore the first backup has this size. The next two backups each add around 200 GB of changed data and the last backup just added ~7 GB.

The advantage is, my script is fast and provides a good upper-limit. The downside is that these are the values before compression. To determine the size after compression, one would have to check the size of each chunk as @krikey does, which is slow.

To get a better estimation, you could have a look at the compression ratio of your VM storage in PVE and assume that your backup has about the same compression.

This is just a quick&dirty script. If you like it, please feel free to improve it.

Here is my script:
Python:
import os
import sys

datastore = "/mnt/datastore/localHDD/vm/"

vmid = ""
if len(sys.argv) > 1:
    vmid = sys.argv[1]

vm = os.path.join(datastore, vmid)

# Get all .img.fidx file paths for the chosen VM
filearray = []
for root, dirs, files in os.walk(vm, topdown=False):
    for name in files:
        if name.endswith(".img.fidx"):
            filearray.append(os.path.join(root, name))

# sort to obtain filepath sort
# if arg empty then do date sort
if vmid:
    filearray.sort()
else:
    filearray = sorted(filearray, key=lambda x: os.path.relpath(x, vm).split('/',1)[1])

if len(filearray) > 0:
    print("-" * 95)
    print(f"{'| filename'.ljust(52)} | {'chunks'.ljust(19)} | {'size'.ljust(16)} |")
    print("-" * 95)
    chunkarray = set()
    for filepath in filearray:
        with open(filepath, "rb") as f:
            f.seek(4096)
            data = f.read()
            hex_data = data.hex()
            file_chunks = []
            new_unique_chunks = set()
            for i in range(0, len(hex_data), 64):
                file_chunk = hex_data[i:i+64]
                if file_chunk not in chunkarray :
                    new_unique_chunks.add(file_chunk)
                chunkarray.add(file_chunk)
            if new_unique_chunks:
                new_chunks = len(new_unique_chunks)
                filename = os.path.relpath(filepath, vm)  # Remove the datastore prefix from the file path
                print(f"| {filename.ljust(50)} | {new_chunks:>12} chunks | {new_chunks * 4:>12} MiB |")
    print("-" * 95)
 
Last edited:
  • Like
Reactions: peetaur
I've tweaked the code somewhat to try to ascertain the sizes of the CTs too using XXD as in gerkos original example. I'm not sure I've got the logic quite right for CTs though when calculating the chunk sizes in the .didx files but perhaps someone would like to try this out and let me know if perhaps I'm getting the right data from the .didx files?

Code:
#!/bin/bash

datastore=/mnt/backup
id=$1

#check to see if its a VM or a CT
if [[ -d "$datastore/ct/$1" ]]
then
        type=ct
elif [[ -d "$datastore/vm/$1" ]]
then
        type=vm
else
        echo "canot find VM/CT"
        exit
fi


path=$datastore/$type/$id
cd $path

echo "Starting to build an array of all chunks for "$type $id". Please note this may take some time..."

#if its a VM then do the following
if [ $type = 'vm' ]
then
        #get all .img.fidx file paths for the chosen VM
        readarray -d '' filearray < <(find ~+ -name *.img.fidx -print0)
        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files and create the unsorted array
                for (( i=0; i<${#filearray[@]}; i++ ));
                do
                        for x in `xxd -s +4096 -p -c 32 ${filearray[$i]}`
                        do
                                chunkarrayunsorted+=($x)
                        done
                done
        fi
fi

#if its a CT then do the following
if [ $type = 'ct' ]
then
        #get all the .didx file paths for the chosen CT
        readarray -d '' filearray < <(find ~+ -name *.didx -print0)

        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files anc create the unsorted array
                for x in `xxd -s +4104 -p -c 40 ${filearray[$i]}`
                do
                        chunkarrayunsorted+=(${x:0:64})
                done
        fi
fi


if [ ${#filearray[@]} -gt 0 ]
then
        #sort the chunks array to remove duplicates
        readarray -td '' chunkarraysorted < <(printf '%s\0' "${chunkarrayunsorted[@]}" | sort -zu)

        let unsortedcount=${#chunkarrayunsorted[@]}
        let sortedcount=${#chunkarraysorted[@]}
        percent=`echo "scale=2 ; (1 - ($sortedcount / $unsortedcount))*100" | bc`
        echo "unsorted chunks: "$unsortedcount
        echo "sorted chunks:   "$sortedcount
        echo "Ignoring chunks: "$percent"%"
        echo "Now iterating through chunks to calculate the total size for "$type $id". This may take some time..."

        #get chunk sizes and calculate total
        for (( i=0; i<${#chunkarraysorted[@]}; i++ ));
        do
                ls -l $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]} | awk '{print $5}'
        done | paste -sd+ | bc | numfmt --to iec --format "%8.2f"
else
        echo "no files found to process"
fi

Simply save the file and make it executable, then to run it as follows...

./calcbackupsize.sh [ID of CT or VM]
Your script has to be modified to work with namespaces, and I've modified your script to take it in account.

need to pass two parameters to the script ./calcbackupsize.sh [ID of CT or VM] [namespace]

ie: ./calcbackupsize.sh 100 production

Bash:
#!/bin/bash

datastore=/mnt/backup

id=$1
namespace=$2

#check to see if its a VM or a CT
if [[ -d "$datastore/ns/$2/ct/$1" ]]
then
        type=ct
elif [[ -d "$datastore/ns/$2/vm/$1" ]]
then
        type=vm
else
        echo "cannot find VM/CT"
        exit
fi


path=$datastore/ns/$namespace/$type/$id
cd $path

echo "Starting to build an array of all chunks for "$type $id". Please note this may take some time..."

#if its a VM then do the following
if [ $type = 'vm' ]
then
        #get all .img.fidx file paths for the chosen VM
        readarray -d '' filearray < <(find ~+ -name *.img.fidx -print0)
        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files and create the unsorted array
                for (( i=0; i<${#filearray[@]}; i++ ));
                do
                        for x in `xxd -s +4096 -p -c 32 ${filearray[$i]}`
                        do
                                chunkarrayunsorted+=($x)
                        done
                done
        fi
fi

#if its a CT then do the following
if [ $type = 'ct' ]
then
        #get all the .didx file paths for the chosen CT
        readarray -d '' filearray < <(find ~+ -name *.didx -print0)

        if [ ${#filearray[@]} -gt 0 ]
        then
                #iterate through all relevant files anc create the unsorted array
                for x in `xxd -s +4104 -p -c 40 ${filearray[$i]}`
                do
                        chunkarrayunsorted+=(${x:0:64})
                done
        fi
fi


if [ ${#filearray[@]} -gt 0 ]
then
        #sort the chunks array to remove duplicates
        readarray -td '' chunkarraysorted < <(printf '%s\0' "${chunkarrayunsorted[@]}" | sort -zu)

        let unsortedcount=${#chunkarrayunsorted[@]}
        let sortedcount=${#chunkarraysorted[@]}
        percent=`echo "scale=2 ; (1 - ($sortedcount / $unsortedcount))*100" | bc`
        echo "unsorted chunks: "$unsortedcount
        echo "sorted chunks:   "$sortedcount
        echo "Ignoring chunks: "$percent"%"
        echo "Now iterating through chunks to calculate the total size for "$type $id". This may take some time..."

        #get chunk sizes and calculate total
        for (( i=0; i<${#chunkarraysorted[@]}; i++ ));
        do
                ls -l $datastore/.chunks/${chunkarraysorted[$i]:0:4}/${chunkarraysorted[$i]} | awk '{print $5}'
        done | paste -sd+ | bc | numfmt --to iec --format "%8.2f"
else
        echo "no files found to process"
fi
 
Last edited:
@masgo @markus.b thanks for making the scripts! it's fantastic!

> a) VMs are backed up in fixed-sized chunks of 4 MiB. (According to https://pbs.proxmox.com/docs/technical-overview.html#fixed-sized-chunks
> they are "typically" 4 MiB, but I could not find when it would be a different size). They are then (usually) compressed and maybe encrypted.

yes, but for getting the real space eaters in our backups, wouldn't it be helpful if we also would calculate and print the real (on disk size) instead of the raw/uncompressed chunk size ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!