How to get the exactly backup size in proxmox backup

q16marvin · Nov 23, 2023

tuxick said:
Yes i added just that, will submit when tested

hey

can your script also handles namespaces?

RolandK · Nov 27, 2023

what about adding a sum/total of each VMs backup size and also print how many chunks/mb are unique to a VM, for easier identification of real "space eaters"

BloodyIron · Mar 8, 2024

Yeah in both Proxmox VE and Proxmox Backup Server, we really need the ability to see higher-level statistics for usage. Some examples:

How much Bytes-On-Disk was used total for each Backup Task across an entire Proxmox VE cluster (one backup task configuration typically backs things up across multiple Proxmox VE nodes, tallying this info up per-task across all nodes is valuable information). And a way to actually look at the aggregate history of the Backup Task, instead of having to go per-Proxmox-VE-Node. As this becomes a scaling problem with many Proxmox VE nodes in current design.
Being able to see similar info to #1 but from the Proxmox Backup Server webGUI/CLI perspective. Being able to check the total stats, like total disk usage, and some drill-down, per task, would really help determine how "big" each backup task is in-totality (before dedup? unsure). Useful for things like Sync task planning.
These aggregate logs could also include transfer and other performance speed metrics to help identify any "unexpectedly slow" Proxmox VE nodes, again to help scaling problems (do you really want to do that manually for hundreds or thousands of nodes in a Proxmox VE Cluster? I know I don't!).

The backup ecosystem is really great, but this level of reporting is very lacking. Especially for determining backup failures at-scale per Proxmox VE Object. If I have hundreds of Proxmox VE nodes in a single cluster, well that is then hundreds of Backup Tasks I need to manually review _per Backup Task execution_ and that's an extremely inefficient way to do it. Sometimes drilling down that manually is very worthwhile, but there are scenarios where that's very inefficient. So having both IMO is the way to go, and both Proxmox VE and Proxmox Backup Server should have these aspects fleshed out a lot more.

Bernd hanisch · Jul 9, 2024

q16marvin said:
hey can your script also handles namespaces?

I modifed the python script to handle namespaces https://github.com/BerndHanisch/pve/blob/patch-1/pbs/estiname-size.py

IamLunchbox · Aug 20, 2024

I rewrote parts of the script (some standard library functions were shadowed, e.g. all) and added more features to it. For example, I added the option to pass vmids as comma separated values as well as ranges and I added functions to summarize and order the list by highest consumption.

It can be found here: https://gist.github.com/IamLunchbox/9002b1feb2ca501856b5661c3fe84315

@RolandK : The scripts already do check, if chunks are used several times within a given vmid and only count individual referenced chunks. I think checking, if a specific chunk is only used by one vm alone is probably even more complex.

pschonmann · Sep 11, 2024

IamLunchbox said:
I rewrote parts of the script (some standard library functions were shadowed, e.g. all) and added more features to it. For example, I added the option to pass vmids as comma separated values as well as ranges and I added functions to summarize and order the list by highest consumption.

It can be found here: https://gist.github.com/IamLunchbox/9002b1feb2ca501856b5661c3fe84315

@RolandK : The scripts already do check, if chunks are used several times within a given vmid and only count individual referenced chunks. I think checking, if a specific chunk is only used by one vm alone is probably even more complex.

Doesnt work on nested namespaces as described here
https://forum.proxmox.com/threads/h...y-backups-from-namespaces.154202/#post-702157

Code:

/mnt/datastore/internal/
└── ns
    ├── customer1
    │   └── ns
    │       ├── customer1-hw1
    │       ├── customer1-hw2
    │       └── customer1-hw3
    ├── customer2
    │   └── ns
    │       ├── customer2-hw1
    │       ├── customer2-hw2
    │       └── customer2-hw3
    ├── customer3
    │   └── ns
    │       ├── proxmox4
    │       ├── proxmox6
    │       ├── proxmox7
    │       └── proxmox8

IamLunchbox · Sep 11, 2024

Did you try to hardcode your datastore path to the nested path, e.g. /mnt/datastore/internal/ns/customer? In that case it should work I think, because it looks for $datastore/ns/$namespace.

You could then loop with bash through all the customers for customer in /mnt/datastore/internal/ns/*; do python3 [...] $customer; done, since the script does not support this recursive lookup.
If you want to nest loops even further, you could additionally go for

for customer in /mnt/datastore/internal/ns/*; do for namespace in /mnt/datastore/internal/ns/$customer/ns/*; do python3 [...] -n $namespace /mnt/datastore/internal/ns/$customer; done; done

pschonmann · Sep 12, 2024

IamLunchbox said:
Did you try to hardcode your datastore path to the nested path, e.g. /mnt/datastore/internal/ns/customer? In that case it should work I think, because it looks for $datastore/ns/$namespace.

You could then loop with bash through all the customers for customer in /mnt/datastore/internal/ns/*; do python3 [...] $customer; done, since the script does not support this recursive lookup.
If you want to nest loops even further, you could additionally go for for customer in /mnt/datastore/internal/ns/*; do for namespace in /mnt/datastore/internal/ns/$customer/ns/*; do python3 [...] -n $namespace /mnt/datastore/internal/ns/$customer; done; done

Thanks for info. Ive do loops for my customer and do something nasty

For example one customer

Code:

ESTIMATED DISK USAGE OCCUPIED ON PROXMOX-BACKUP (number of chunks * 4MB) FOR NAMESPACE customer1
customer1 / customer1-hw1 692.465 GB
-----------
TOTAL:
----------
692.465 GB

But du of all these chunks are different.

Code:

find  -type f -name '*.fidx' -exec proxmox-backup-debug inspect file {} --decode - \; | tr -c -d "[:alnum:]\n" > customer1.chunks
then
:sort u in vim
and 
cat customer1.chunks | egrep [0-9a-f]{64} | sed -r 's/(.{4})(.{60})/\/mnt\/datastore\/internal\/.chunks\/\1\/\1\2/' | tr "\n" "\0" | du --files0-from - -schm

and this give me

Code:

255.320 GB    total

But tried to do some investigation. 4M is probably max chunk size. Lots of chunks are smaller and average file is probably 1M chunk.

Code:

root@proxmox-backup:~/petr# python3 histo.py /mnt/datastore/internal/.chunks/
File Size Histogram:
0 KB - 999 KB          : 1434266 files
1000 KB - 1999 KB : 556446 files
2000 KB - 2999 KB : 96721 files
3000 KB - 3999 KB : 60269 files
4000 KB - 4999 KB : 34895 files
5000 KB - 5999 KB : 0 files

histo.py is

Python:

import os
import sys
import math

# Define bucket size in bytes (500 KB)
BUCKET_SIZE = 500 * 1024

def get_file_sizes(directory):
    file_sizes = []
    for foldername, subfolders, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(foldername, filename)
            try:
                file_size = os.path.getsize(filepath)
                file_sizes.append(file_size)
            except OSError as e:
                print(f"Error retrieving size for {filepath}: {e}")
    return file_sizes

def print_histogram(file_sizes, bucket_size):
    if not file_sizes:
        print("No files to display in histogram.")
        return

    # Create buckets
    max_size = max(file_sizes)
    num_buckets = math.ceil(max_size / bucket_size)
    histogram = [0] * (num_buckets + 1)

    # Populate buckets
    for size in file_sizes:
        bucket_index = size // bucket_size
        histogram[bucket_index] += 1

    # Print histogram
    print("File Size Histogram:")
    for i, count in enumerate(histogram):
        lower_bound = i * bucket_size
        upper_bound = (i + 1) * bucket_size - 1
        print(f"{lower_bound // 1024} KB - {upper_bound // 1024} KB : {count} files")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <directory_path>")
        sys.exit(1)

    directory = sys.argv[1]
    if not os.path.isdir(directory):
        print(f"The path {directory} is not a valid directory.")
        sys.exit(1)

    file_sizes = get_file_sizes(directory)
    print_histogram(file_sizes, BUCKET_SIZE)

IamLunchbox · Sep 12, 2024

Yes, this is a known limitation of the script, there are probably approaches to make this estimation more accurate.

The easiest, but probably not the fastest way could be to implement a switch, which enables a check of each encountered chunk how big it actually is (somewhere else it was mentioned, that this might be very resource intensive). Or a faster version could be to take only a sample (1%?), build a median distribution and use that as an average.

How long did it take your script to go through your 2 million chunks? And how big is the datastore with what kinds of disks?

VoltKraft · Apr 23, 2025

I also tried my hand at writing my own script. It searches for all .fidx and .didx files in a certain folder. This folder can be a whole namespace or just a VM/CT. The associated chunks are then listed and deduplicated. The size of the remaining chunks is then totaled. If I'm not making a mistake, I end up with the exact size that a namespace, a VM or a container consumes on disk. The problem is that the execution takes a long time. For example, to check an 8GB container with 156 restore points, the script takes almost 2 minutes.
What do you think about this?

https://github.com/VoltKraft/PBS_Chunk_Checker

Micinka · Apr 25, 2025

Hello people, i hope i won`t insult the original creators of the script, but i took their work and moded it a bit ( thank you for your work ppl ). The script supports simple datastore, also datastore with namespaces, AND in todays revision, it supports multiple layers of namespaces ( datastore/company1/branch1/; datastore/company1/branch2/; etc. ).

I had a REAL need for calculating multiple namespaces in my dataset ( i backup multiple companies and they get billed on space taken ). The script is not yet perfect as ppl said before, it does not calculate the EXACT size of dataset, but aproximate size. You can also calculate aprox size after PBS`s block deduplication, all in the readme. Hope you like the bit of work i tried to make it better.

The script is not SUPER fast, but it works through aprox. 120TB of my datastore in few minutes.

Hopefully my readme is understandable

have at it ppl.

https://github.com/Micinek/PBSEstimator

VoltKraft · Monday at 19:49

Hey everyone

Just a quick update from my side regarding the topic and my post above (#30).
I’ve been continuing to work on this idea and published it as PBS_Chunk_Checker.

The script has now been completely rewritten in Python, which gave it a significant performance boost compared to the original Bash version I mentioned earlier.
It scans .fidx and .didx files within a namespace, VM, or CT folder, collects all referenced chunks, deduplicates them, and then sums the actual file sizes on disk.
This means it calculates the exact physical storage usage — not an estimate based on assumed chunk sizes.

That also means:

PBSEstimator → much faster, but works with fixed (theoretical) chunk sizes
PBS_Chunk_Checker → slower, but gives exact on-disk values

So depending on your use case, both scripts can be useful:
PBSEstimator if you want a quick approximation across a large dataset,
and PBS_Chunk_Checker if you need precise results for a single VM, CT, or namespace.

The updated project is available here:
https://github.com/VoltKraft/PBS_Chunk_Checker

Search

Search

How to get the exactly backup size in proxmox backup

q16marvin

Renowned Member

RolandK

Famous Member

BloodyIron

Renowned Member

Bernd hanisch

Member

IamLunchbox

Member

pschonmann

Member

IamLunchbox

Member

pschonmann

Member

IamLunchbox

Member

VoltKraft

New Member

Micinka

Member

VoltKraft

New Member

We value your privacy