Retention Policy Design on PBS

gun.k

New Member
Nov 1, 2024
3
0
1
We are Plan using PBS and considering to adjust our backup retention policy to balance long-term restore capability with efficient storage usage.
Daily Backup Job (Incremental)
  • Schedule: Daily
  • Compression: ZSTD
  • Retention: Keep-Daily = 90 days
Weekly Backup Job (Full)
  • Schedule: Every Saturday
  • Compression: ZSTD
  • Retention: Keep-Weekly = 12
Monthly Backup Job (Full)
  • Schedule: 1st of each month
  • Compression: ZSTD
  • Retention: Keep-Monthly = 3
Currently, our backup usage is about 4.77 TB of 46.05 TB, with a deduplication factor of around 40x, and the active VM data is approximately 14.91 TB in size.

We would like to ask:
  1. Does this retention plan look sustainable and efficient in the long run, based on our current usage and deduplication rate?
  2. Can force full backup weekly and monthly job ?
  3. Best practice would recommend for a setup like ours?
Thank you
 
1. Usage Estimates

This confused me at first, but right now the sustainability metric is located not on your datastore instance, but in the parent Datastore section summary:proxmox-pbs-storage.png
One of the best ways to determine how accurate that measure is will be to simply to give it more time - operate your normal workload for a few weeks, see how it looks, and compare again a few weeks later. If you have a shift in your workload, check in and see.

2. PBS Backup Schedules

PBS offers a really different way of doing backups than what you might be familiar with from other solutions, especially ones that operate on a single OS or were designed in the era of HDDs:

  • EVERY backup is a FULL backup
  • backups are created at the BLOCK-level, in chunks
    (not at the file level as they appear in the explorer)
  • the backup begins on the PVE host:
    • disk block chunk => compression => encryption => pbs
    • file-block-chunk mapping => index => pbs (I think this is also compressed and encrypted)
  • file indexes are interpreted by the PVE host, which is how you get file-level restore
  • live full disk restore happens much like webtorrent - blocks can be restored just-in-time

The pruning / garbage collection process actually goes through and checks every backup index and marks each block that is in use by ANY valid backup (non-expired, non-deleted) and later collects all unused blocks for deletion and, following at least a 24h 5m delay, the unused blocks are permanently dereferenced.

There's a backup cache that's kept as long as the VM is running, and is rebuilt on the first backup after a reboot. For 14TB of data that can take a while, but once it's indexed, only changed blocks are synced - so it's incredibly network efficient.

One of the common recommendations is to ditch the idea of "daily" and run the backups every hour.

3. General Case vs Your Case

All of the above is general advice.

What's your workload like?

Do you have estimates in terms of the size and number of files being created / deleted / modified within a day or week or month?

If not, try the "wait and see" approach with the estimator in the summary - let it calculate for you baseb on the real data.

Also, you can check what you're envisioning against how PBS is interpreting your inputs with the pruning simulator:
https://pbs.proxmox.com/docs/prune-simulator/

Then if you need to adjust backup frequency to hit your targets, you can get a calendar representation to make it easier to understand at-a-glance:
pbs-pruning-simulator.png
And if you're ever in an air-gapped environment (or just want another way to remember how to get to it), It's also part of the built-in documentation available from the Prune help:

pbs-add-prune.png
pbs-prune-sim.png
 
Last edited:
  • Like
Reactions: UdoB