How to get the exactly backup size in proxmox backup

what about adding a sum/total of each VMs backup size and also print how many chunks/mb are unique to a VM, for easier identification of real "space eaters"
 
Yeah in both Proxmox VE and Proxmox Backup Server, we really need the ability to see higher-level statistics for usage. Some examples:
  1. How much Bytes-On-Disk was used total for each Backup Task across an entire Proxmox VE cluster (one backup task configuration typically backs things up across multiple Proxmox VE nodes, tallying this info up per-task across all nodes is valuable information). And a way to actually look at the aggregate history of the Backup Task, instead of having to go per-Proxmox-VE-Node. As this becomes a scaling problem with many Proxmox VE nodes in current design.
  2. Being able to see similar info to #1 but from the Proxmox Backup Server webGUI/CLI perspective. Being able to check the total stats, like total disk usage, and some drill-down, per task, would really help determine how "big" each backup task is in-totality (before dedup? unsure). Useful for things like Sync task planning.
  3. These aggregate logs could also include transfer and other performance speed metrics to help identify any "unexpectedly slow" Proxmox VE nodes, again to help scaling problems (do you really want to do that manually for hundreds or thousands of nodes in a Proxmox VE Cluster? I know I don't!).
The backup ecosystem is really great, but this level of reporting is very lacking. Especially for determining backup failures at-scale per Proxmox VE Object. If I have hundreds of Proxmox VE nodes in a single cluster, well that is then hundreds of Backup Tasks I need to manually review _per Backup Task execution_ and that's an extremely inefficient way to do it. Sometimes drilling down that manually is very worthwhile, but there are scenarios where that's very inefficient. So having both IMO is the way to go, and both Proxmox VE and Proxmox Backup Server should have these aspects fleshed out a lot more.
 
I rewrote parts of the script (some standard library functions were shadowed, e.g. all) and added more features to it. For example, I added the option to pass vmids as comma separated values as well as ranges and I added functions to summarize and order the list by highest consumption.

It can be found here: https://gist.github.com/IamLunchbox/9002b1feb2ca501856b5661c3fe84315

@RolandK : The scripts already do check, if chunks are used several times within a given vmid and only count individual referenced chunks. I think checking, if a specific chunk is only used by one vm alone is probably even more complex.
 
I rewrote parts of the script (some standard library functions were shadowed, e.g. all) and added more features to it. For example, I added the option to pass vmids as comma separated values as well as ranges and I added functions to summarize and order the list by highest consumption.

It can be found here: https://gist.github.com/IamLunchbox/9002b1feb2ca501856b5661c3fe84315

@RolandK : The scripts already do check, if chunks are used several times within a given vmid and only count individual referenced chunks. I think checking, if a specific chunk is only used by one vm alone is probably even more complex.
Doesnt work on nested namespaces as described here
https://forum.proxmox.com/threads/h...y-backups-from-namespaces.154202/#post-702157

Code:
/mnt/datastore/internal/
└── ns
    ├── customer1
    │   └── ns
    │       ├── customer1-hw1
    │       ├── customer1-hw2
    │       └── customer1-hw3
    ├── customer2
    │   └── ns
    │       ├── customer2-hw1
    │       ├── customer2-hw2
    │       └── customer2-hw3
    ├── customer3
    │   └── ns
    │       ├── proxmox4
    │       ├── proxmox6
    │       ├── proxmox7
    │       └── proxmox8
 
Did you try to hardcode your datastore path to the nested path, e.g. /mnt/datastore/internal/ns/customer? In that case it should work I think, because it looks for $datastore/ns/$namespace.

You could then loop with bash through all the customers for customer in /mnt/datastore/internal/ns/*; do python3 [...] $customer; done, since the script does not support this recursive lookup.
If you want to nest loops even further, you could additionally go for for customer in /mnt/datastore/internal/ns/*; do for namespace in /mnt/datastore/internal/ns/$customer/ns/*; do python3 [...] -n $namespace /mnt/datastore/internal/ns/$customer; done; done
 
Last edited:
Did you try to hardcode your datastore path to the nested path, e.g. /mnt/datastore/internal/ns/customer? In that case it should work I think, because it looks for $datastore/ns/$namespace.

You could then loop with bash through all the customers for customer in /mnt/datastore/internal/ns/*; do python3 [...] $customer; done, since the script does not support this recursive lookup.
If you want to nest loops even further, you could additionally go for for customer in /mnt/datastore/internal/ns/*; do for namespace in /mnt/datastore/internal/ns/$customer/ns/*; do python3 [...] -n $namespace /mnt/datastore/internal/ns/$customer; done; done
Thanks for info. Ive do loops for my customer and do something nasty :)
For example one customer

Code:
ESTIMATED DISK USAGE OCCUPIED ON PROXMOX-BACKUP (number of chunks * 4MB) FOR NAMESPACE customer1
customer1 / customer1-hw1 692.465 GB
-----------
TOTAL:
----------
692.465 GB

But du of all these chunks are different.
Code:
find  -type f -name '*.fidx' -exec proxmox-backup-debug inspect file {} --decode - \; | tr -c -d "[:alnum:]\n" > customer1.chunks
then
:sort u in vim
and 
cat customer1.chunks | egrep [0-9a-f]{64} | sed -r 's/(.{4})(.{60})/\/mnt\/datastore\/internal\/.chunks\/\1\/\1\2/' | tr "\n" "\0" | du --files0-from - -schm
and this give me
Code:
255.320 GB    total

But tried to do some investigation. 4M is probably max chunk size. Lots of chunks are smaller and average file is probably 1M chunk.

Code:
root@proxmox-backup:~/petr# python3 histo.py /mnt/datastore/internal/.chunks/
File Size Histogram:
0 KB - 999 KB          : 1434266 files
1000 KB - 1999 KB : 556446 files
2000 KB - 2999 KB : 96721 files
3000 KB - 3999 KB : 60269 files
4000 KB - 4999 KB : 34895 files
5000 KB - 5999 KB : 0 files

histo.py is
Python:
import os
import sys
import math

# Define bucket size in bytes (500 KB)
BUCKET_SIZE = 500 * 1024

def get_file_sizes(directory):
    file_sizes = []
    for foldername, subfolders, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(foldername, filename)
            try:
                file_size = os.path.getsize(filepath)
                file_sizes.append(file_size)
            except OSError as e:
                print(f"Error retrieving size for {filepath}: {e}")
    return file_sizes

def print_histogram(file_sizes, bucket_size):
    if not file_sizes:
        print("No files to display in histogram.")
        return

    # Create buckets
    max_size = max(file_sizes)
    num_buckets = math.ceil(max_size / bucket_size)
    histogram = [0] * (num_buckets + 1)

    # Populate buckets
    for size in file_sizes:
        bucket_index = size // bucket_size
        histogram[bucket_index] += 1

    # Print histogram
    print("File Size Histogram:")
    for i, count in enumerate(histogram):
        lower_bound = i * bucket_size
        upper_bound = (i + 1) * bucket_size - 1
        print(f"{lower_bound // 1024} KB - {upper_bound // 1024} KB : {count} files")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python script.py <directory_path>")
        sys.exit(1)

    directory = sys.argv[1]
    if not os.path.isdir(directory):
        print(f"The path {directory} is not a valid directory.")
        sys.exit(1)

    file_sizes = get_file_sizes(directory)
    print_histogram(file_sizes, BUCKET_SIZE)
 
Yes, this is a known limitation of the script, there are probably approaches to make this estimation more accurate.

The easiest, but probably not the fastest way could be to implement a switch, which enables a check of each encountered chunk how big it actually is (somewhere else it was mentioned, that this might be very resource intensive). Or a faster version could be to take only a sample (1%?), build a median distribution and use that as an average.

How long did it take your script to go through your 2 million chunks? And how big is the datastore with what kinds of disks?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!