keep last X vs hourly vs xxx

Sep 26, 2023
80
5
8
Hello.
I'm trying to understand a sync process that I have and the amt of the data storage.

Here's my environment.
production - 2 servers, pve and pve/pbr
dr site - single pve/pbr

The production server does backups every 2 hours and a sync job replicates it over to the pve/pbr server on the same schedule. . This thinking process is that i have a 2 hour window if something goes wrong on the production box and i need to restore something. wouldn't the better option be to replicate every hour with a 2 hour backup schedule to shorten the window?
The retention level on the production side (prune) is configured to last-5, daily 8, weekly/monthly-12, and yearly 2 with weekly garbage collection.

The dr side does a pull every 2 hours from what is in/on the production side.
The dr side prune job is set to last-8, daily-30, weekly/monthly-12, and 5 years.

I have 17 servers containing approx. 2.1TB of space. My current storage, of backups, on the Dr side is approx. 8TB and climbing.
Manually doing the math based off the # of servers * snapshots I would end up with approx. 1244 backup/snapshots at the DR side. The same math means that every year I would have 2.1TB for my yearly backups and a 'floating storage' space of 11Tb. Does this sound correct? I didn't think the data sprawl would be this big, especially the amt of 'floating' data required each year until the previous month/weekly/etc drop off. This seems to indicate that I need at least 25TB of space on the Dr side.

What am I missing? Also, even though i have over 1 year of data (backup/snapshots) on my system and my prune/GC jobs have run - I'm still showing more files than I should. for instance: 1 server has 18 months of backups, and it should only 12 + 1 (yearly). The jobs complete w/o issues but my data never get's reduced.
 
  • Like
Reactions: Johannes S
Did you already tried the Prune simulator at https://pbs.proxmox.com/docs/prune-simulator/ ?
I found it really helpful for understandibg the omplications of specific prune settings.
can you please post a log of yozr prune and gc Jobs? Maybe they give an insight what's going wrong.
Which storage media are you using and how do you connect it to the PBS?
 
  • Like
Reactions: UdoB
@markf1301 The amount of data used on PBS depends heavily on 1) how much is duplicated, since PBS deduplicates, and 2) how much data changes in between backups. A VM that is off will take basically no extra space for each backup. A VM that changes all of its data every day will be a full backup every day.

> The dr side prune job is set to last-8, daily-30, weekly/monthly-12, and 5 years.

So it sounds like your issue is that PBS should be pruning out months 13-18, and isn't? (though, the last one in 2023 it should keep also)
 
  • Like
Reactions: Johannes S
here's the latest prune/gc jobs. i removed some of the prune info but you can see the start, schedule, and completion.
the Gc says that it still has it has 1.6Tb to be removed but i haven't seen that data actually be removed from the system. after running both of these jobs today, i still have approx 7.7Tb of datastore usage. when would the actually 1.6Tb get removed - if not after the job completes? Yes, i did a refresh of the data.

correct SteveITs - I shouldn't have more than 63 backup/snapshots on the system for all my servers, + 1 additional set of 17 for my servers on a yearly basis.

My # of snapshots did drop from 687 to 513, but not my 'free space' on the datastore.
 

Attachments

The prune marks the chunks delete-able, GC removes them: https://pbs.proxmox.com/docs/backup-client.html#garbage-collection

In your prune log the "protected" backups will stay forever. 209 and 603 don't have older backups than the two protected ones? Therefore nothing to prune there...?
True. that said, from the last report on my GC it shows the following:
2025-05-29T13:07:02-04:00: Removed chunks: 3744
2025-05-29T13:07:02-04:00: Pending removals: 1.395 TiB (in 1287856 chunks)
2025-05-29T13:07:02-04:00: Original data usage: 61.633 TiB
2025-05-29T13:07:02-04:00: On-Disk usage: 5.93 TiB (9.62%)
2025-05-29T13:07:02-04:00: On-Disk chunks: 4261065
2025-05-29T13:07:02-04:00: Deduplication factor: 10.39
2025-05-29T13:07:02-04:00: Average chunk size: 1.459 MiB
2025-05-29T13:07:02-04:00: TASK OK

When does the 'pending removals' supposed to happen, if not after the job - or several hours later?
 
Chunks are removed if they are not referenced in a backup snapshot for more than 24 hours and five minutes. If I recall correct there is now an option to change this but I'm not sure what caveats changing it might have.
 
here's the latest prune/gc jobs. i removed some of the prune info but you can see the start, schedule, and completion.

I actually wanted to see whether the prune job even removed anything, you actually removed the part I was most interested in ;)

From the gc log it seems that it removed 1.5 GiB and 1.395 TiB of unused chunks will be removed at a later time since their snapshots were pruned less than the access time cutoff (from your log):
2025-05-29T11:54:57-04:00: starting garbage collection on store corpbackups
2025-05-29T11:54:58-04:00: Access time update check successful, proceeding with GC.
2025-05-29T11:54:58-04:00: Using access time cutoff 1d 5m, minimum access time is 2025-05-28T15:49:57Z
2025-05-29T11:54:58-04:00: Start GC phase1 (mark used chunks)
2025-05-29T12:01:09-04:00: marked 1% (6 of 516 index files)
...(removed lines)...
2025-05-29T13:00:57-04:00: marked 99% (511 of 516 index files)
2025-05-29T13:01:21-04:00: marked 100% (516 of 516 index files)
2025-05-29T13:01:21-04:00: Start GC phase2 (sweep unused chunks)
2025-05-29T13:01:40-04:00: processed 1% (55740 chunks)
2025-05-29T13:01:54-04:00: processed 2% (111054 chunks)
...(removed lines)...
2025-05-29T13:06:53-04:00: processed 97% (5385889 chunks)
2025-05-29T13:06:56-04:00: processed 98% (5441458 chunks)
2025-05-29T13:06:59-04:00: processed 99% (5497213 chunks)
2025-05-29T13:07:02-04:00: Removed garbage: 5.91 GiB
2025-05-29T13:07:02-04:00: Removed chunks: 3744
2025-05-29T13:07:02-04:00: Pending removals: 1.395 TiB (in 1287856 chunks)
2025-05-29T13:07:02-04:00: Original data usage: 61.633 TiB
2025-05-29T13:07:02-04:00: On-Disk usage: 5.93 TiB (9.62%)
2025-05-29T13:07:02-04:00: On-Disk chunks: 4261065
2025-05-29T13:07:02-04:00: Deduplication factor: 10.39
2025-05-29T13:07:02-04:00: Average chunk size: 1.459 MiB
2025-05-29T13:07:02-04:00: TASK OK

So for me this looks like everythings works as designed ;) I would expect that a garbage collection job launched on 2025-30 after 1 PM will remove most if not all of the pending removals. Please report how it works out
 
Chunks are removed if they are not referenced in a backup snapshot for more than 24 hours and five minutes. If I recall correct there is now an option to change this but I'm not sure what caveats changing it might have.
I would think that most people are doing backups more frequently than 24 hours and 5 minutes apart so having another mechanism or ability to change that would help keep disk space under control. Not sure where to find that option, but it would be nice. A couple of other thoughts: since there isn't a way to 'recover' those deleted backups - from the GC job that I'm aware of, and since you 'manually' (by creating that process, or manually doing it) - then that space should be freed up after the job completes. That seems more reasonable to me, sort of like 'deleting' files and not having a 'recycle' or recovery bin to restore them from.
 
More frequent backups would each be smaller.

The 24 hours is after a prune job marks them as deleteable, unrelated to any backup job. I get what you’re saying though, not really sure why it waits a day.

I suppose one could run multiple prune and/or GC jobs per day.
 
An update for everyone.
I checked the storage space this am and no space has been reclaimed from running the GC job yesterday. It hasn't been a 'full' 24+5min as referenced above so I'll check in a few hours when the (approx.) 24 hour cycle has come/gone.

Locating another thread with a user having the same concerns there was a reference by Fabian that it might be 'dependent' to chunks containing similar data from other snapshots. Not sure I understand that theory because I (probably others as well) that have many (40+) snapshots from the same, or similar servers will have the same 'chunks' of data in it.

The user did state that rebooting the server seemed to fix his issue and clear up the space.
Great - 2 theories now. Any bets which might come to fruition?

reference link to the other post - https://forum.proxmox.com/threads/solved-garbage-collection-doesnt-delete-anything.151269/
1748610512741.png
 
True. that said, from the last report on my GC it shows the following:
2025-05-29T13:07:02-04:00: Removed chunks: 3744
2025-05-29T13:07:02-04:00: Pending removals: 1.395 TiB (in 1287856 chunks)
2025-05-29T13:07:02-04:00: Original data usage: 61.633 TiB
2025-05-29T13:07:02-04:00: On-Disk usage: 5.93 TiB (9.62%)
2025-05-29T13:07:02-04:00: On-Disk chunks: 4261065
2025-05-29T13:07:02-04:00: Deduplication factor: 10.39
2025-05-29T13:07:02-04:00: Average chunk size: 1.459 MiB
2025-05-29T13:07:02-04:00: TASK OK

When does the 'pending removals' supposed to happen, if not after the job - or several hours later?
Correct. I'm more about the 1.xTB of free space that was 'collected' during the GC job that hasn't been removed, or space reclaimed for additional storage at this time.
 
More frequent backups would each be smaller.

The 24 hours is after a prune job marks them as deleteable, unrelated to any backup job. I get what you’re saying though, not really sure why it waits a day.

I suppose one could run multiple prune and/or GC jobs per day.
I'm going to have to do that, in hopes of reclaiming space. I thought I'd seen something on either a forum or documentation about running the prune/gc jobs to close to each other was an issue. Something to do with keeping track of the chunks, or perhaps that was on older versions. My current process has (or was till today) to run the prune jobs daily, after backups and then run the gc job once a week. I've since changed that to run the gc jobs to run daily as well - albeit a few hours after the prune job.
 
The access time cutoff, after which unused chunks will be cleaned up can be configured in the datastore tuning parameters [0] since PBS version 3.4.0, the reasoning for the default 24h 5m is explained here [1]. You can set this also via the WebUI in the datastore's options tab, minimum is 1 minute.

However, the direct atime updates performed by PBS are honored by any sane filesystem implementation, therefore the cutoff can be lowered also when the filesystem is mounted with relatime option. We added an atime safety check to the start of the garbage collection run to verify that, so please do keep the atime safety check enabled (default), especially if you want to reduce the cutoff. The check will use a marker to detect if the atime updates are handled as expected by the filesystem the datastore resides on.

[0] https://pbs.proxmox.com/docs/storage.html#tuning
[1] https://pbs.proxmox.com/docs/maintenance.html#gc-background
 
Last edited:
Not 'similar' but the same...if the same chunk is used across 50 servers then deleting all the backups of server #37 won't remove that chunk. Sounds like that poster had a problem though, if a reboot let the deletion actually happen.
FYI - the reboot didn't help with the space reclamation.
Can you expand on the chunk thing abit? Potentially most servers, in everyone's environment will have 70% of the same data across their servers with a 30% 'change rate' of data. If this is a good analogy, then I should be only backing up the 'changed' blocks or 30% of the data across the servers. Correct?
Whenever I look at my backup/snapshots I don't see, in the case 30% of a 60Gb file for my daily backups - but rather a 60Gb file for each day. That's what's not making sense to me.

I've looked at my backup jobs and can't even determine how much data is actually changing on the servers during that process, and the same is true with the pbr pull jobs? If there was a way to determine how much data is changing daily, either globally or individually on each server then I might be able to better predict some storage requirements.
 
The access time cutoff, after which unused chunks will be cleaned up can be configured in the datastore tuning parameters [0] since PBS version 3.4.0, the reasoning for the default 24h 5m is explained here [1]. You can set this also via the WebUI in the datastore's options tab, minimum is 1 minute.

However, the direct atime updates performed by PBS are honored by any sane filesystem implementation, therefore the cutoff can be lowered also when the filesystem is mounted with relatime option. We added an atime safety check to the start of the garbage collection run to verify that, so please do keep the atime safety check enabled (default), especially if you want to reduce the cutoff. The check will use a marker to detect if the atime updates are handled as expected by the filesystem the datastore resides on.

[0] https://pbs.proxmox.com/docs/storage.html#tuning
[1] https://pbs.proxmox.com/docs/maintenance.html#gc-background
Chris -
So if my job started at this time - 11:54am and finished at 13:06, or almost 2 hours later then you are suggesting at 'today' at approx 13:15pm the 1.395TB space should be removed from the system and reallocated back as available space. Is this correct?


2025-05-29T11:54:57-04:00: starting garbage collection on store corpbackups
2025-05-29T11:54:58-04:00: Access time update check successful, proceeding with GC.
2025-05-29T11:54:58-04:00: Using access time cutoff 1d 5m, minimum access time is 2025-05-28T15:49:57Z
2025-05-29T11:54:58-04:00: Start GC phase1 (mark used chunks)
2025-05-29T12:01:09-04:00: marked 1% (6 of 516 index files)

and then finished, at this time -
2025-05-29T13:06:59-04:00: processed 99% (5497213 chunks)
2025-05-29T13:07:02-04:00: Removed garbage: 5.91 GiB
2025-05-29T13:07:02-04:00: Removed chunks: 3744
2025-05-29T13:07:02-04:00: Pending removals: 1.395 TiB (in 1287856 chunks)
2025-05-29T13:07:02-04:00: Original data usage: 61.633 TiB
2025-05-29T13:07:02-04:00: On-Disk usage: 5.93 TiB (9.62%)
2025-05-29T13:07:02-04:00: On-Disk chunks: 4261065
2025-05-29T13:07:02-04:00: Deduplication factor: 10.39
2025-05-29T13:07:02-04:00: Average chunk size: 1.459 MiB
2025-05-29T13:07:02-04:00: TASK OK
 
Update - it's now 13:36, which should definitely be more than the 24+5 rule mentioned above, and also in the gc log. Current status hasn't changed.
I'll try to be more patient but am currently at 81% of capacity on my storage.

1748626616471.png
 
rather a 60Gb file for each day
I had to wrap my head around it a bit, too. All backups in PVE are full backups. With PBS only the "new chunks" are saved to disk since the existing ones already exist. So the backup of a 60 GB disk is shown as 60 GB even if most of it is zeroes. That's the deduplication at work. Our dedupe factor I think was 110ish until we added several VMs last week so now it's around 70.

2 hours for GC seems like a lot. On our (admittedly brand new) server it takes under a minute. Do you have the new GC cache set high enough? See this thread for instance.
how much data is changing
In the task log for the backup, on the server/node, you should see that:
Code:
...
INFO: started backup task '24e52e79-1e76-49f4-8b2d-f1765e55c3b8'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (9.3 GiB of 400.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 9.3 GiB dirty of 400.0 GiB total
INFO:  16% (1.5 GiB of 9.3 GiB) in 3s, read: 517.3 MiB/s, write: 488.0 MiB/s
INFO:  35% (3.4 GiB of 9.3 GiB) in 6s, read: 629.3 MiB/s, write: 514.7 MiB/s
INFO:  54% (5.1 GiB of 9.3 GiB) in 9s, read: 590.7 MiB/s, write: 582.7 MiB/s
INFO:  76% (7.2 GiB of 9.3 GiB) in 12s, read: 706.7 MiB/s, write: 508.0 MiB/s
INFO:  89% (8.3 GiB of 9.3 GiB) in 15s, read: 401.3 MiB/s, write: 388.0 MiB/s
INFO: 100% (9.3 GiB of 9.3 GiB) in 18s, read: 340.0 MiB/s, write: 312.0 MiB/s
INFO: Waiting for server to finish backup validation...
INFO: backup is sparse: 192.00 MiB (2%) total zero data
INFO: backup was done incrementally, reused 391.82 GiB (97%)
INFO: transferred 9.33 GiB in 21 seconds (455.0 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 112 (00:00:21)
...
 
  • Like
Reactions: Johannes S