Strategy: Fastest way to clear and refill a daily external datastore

Chris&Patte

Renowned Member
Sep 3, 2013
58
1
73
Hello,
i would like to discuss the best/fastest way to update the content of a external datastore.

Environment:
We use a PBS with a big datatsore which stores all our backups.
We also have a bunch of external harddrives, one for each day of the week, which we plug one in every day, free their space (prune & gc) and then sync a part of the data to them. After that we remove that external datastore / harddrive and put it into a fire proof safe.
This way we have 7 disks, each for each day of the week, containing the latest snapshots of that day away from the PBs.

The external datastores are really filled till maximum (more then 90%), so that i could not just sync the new content on them and then delete (prune & gc) the content from the week before, but have to clear (prune&gc) them completly before starting to sync.
This way the gc as well as the sync takes really long, as it could not use the deduplication mechanism / cannot reuse already existing chunks from the backups the week before.

I now wonder if there is a better way to do that? Partially pruning&gc , then syncing, then pruning & gc multiple times is also not possible because of the 24h that needs to be between them. In fact i have 24h only for every external datastore, as i change them every day. So if i insert a external datastore i can prune&gc it (last time this was done was a week before) then sync it. If the free space is not sufficient i can do the same on the next day, but then i need to replace that drive with the next one for the next day.

Another question is, if there is no solution for reusing those chunks from the week before, which way is the fastest to clear that external datastore at all. That i at least can save some time (and mechanical wear on the drives) by not prune&gc the whole content of that external drive.
 
Hi,

Hello,
i would like to discuss the best/fastest way to update the content of a external datastore.

Environment:
We use a PBS with a big datatsore which stores all our backups.
We also have a bunch of external harddrives, one for each day of the week, which we plug one in every day, free their space (prune & gc) and then sync a part of the data to them. After that we remove that external datastore / harddrive and put it into a fire proof safe.
This way we have 7 disks, each for each day of the week, containing the latest snapshots of that day away from the PBs.

The external datastores are really filled till maximum (more then 90%), so that i could not just sync the new content on them and then delete (prune & gc) the content from the week before, but have to clear (prune&gc) them completly before starting to sync.
This way the gc as well as the sync takes really long, as it could not use the deduplication mechanism / cannot reuse already existing chunks from the backups the week before.

How big are the deltas typically? If you are within that approx. 10% headroom you mention than syncing the latest contents via a transfer last set to 1 should work. And it would allow you to reuse the already known chunks and drastically reduce the transfer times.

Best would probably be:
  1. run garbage collection once to free up already pending chunks from the previous (last weeks, therefore more than 24h ago) garbage collection run
  2. sync with transfer last
  3. prune older snapshots (so unused chunks can be cleared)
  4. run garbage collection again, to mark used chunks, so unused ones can be cleared next time you attach the disk and run garbage collection again in step 1
I now wonder if there is a better way to do that? Partially pruning&gc , then syncing, then pruning & gc multiple times is also not possible because of the 24h that needs to be between them. In fact i have 24h only for every external datastore, as i change them every day. So if i insert a external datastore i can prune&gc it (last time this was done was a week before) then sync it. If the free space is not sufficient i can do the same on the next day, but then i need to replace that drive with the next one for the next day.

Does running the garbage collection before and after the sync maybe already help, as it then does clean up chunks for the snapshots pruned a week before? Also note: There is work in progress to be able to reduce the 24h cleanup window and speed up phase one of garbage collection, see [0, 1].

Another question is, if there is no solution for reusing those chunks from the week before, which way is the fastest to clear that external datastore at all. That i at least can save some time (and mechanical wear on the drives) by not prune&gc the whole content of that external drive.

[0] https://lore.proxmox.com/pbs-devel/20250306145252.565270-1-c.ebner@proxmox.com/T/
[1] https://lore.proxmox.com/pbs-devel/20250310111634.162156-1-c.ebner@proxmox.com/T/
 
Thank you Chris for your contribution to my problem,

i had hoped that there is a way to prune & GC not needed chunks before the new data had been synced at all.
I think about a special function where you
- could select the data to get synced on the source datastore and
- check the chunks on the target datastore against the "to get synced" index files/snapshots, then
- delete (gc) all chunks on the target datastore that are not needed any more BEFORE the sync had happened.
- then sync the additional chunks to the datastore that are missing

That functionality would need a logic that uses the index files of the "to get synced data" on the source datastore and checks them against the existing chunks on the target datastore, thereby updating the timestamps of the chunks that are still needed.
Then a GC run would delete all unwanted chunks (thereby "removing" all data snapshots on the target datastore, but not removing all the chunks of them and freeing up the space for the missing chunks to get synced to the datastore.

The problem on this approach is of course that in the middle of the process, the datastore is in a "broken" status, with no functional snapshot but full of chunks that are not referenced by a index file/snapshot on the target datastore itself, but only by the "to get synced" snapshots/index files on the source datastore.

That approach / functionality would therefore need a new datastore status and the need to mark that target dastore in a way that it can never be seen as a "correct/normal" datastore status (in case the procedure get interupted, a error happens, etc.)

But OK, it's for sure to much work for such a edge-case. I could also just buy a bigger external drive :-)
 
Last edited:
Thank you Chris for your contribution to my problem,

i had hoped that there is a way to prune & GC not needed chunks before the new data had been synced at all.
I think about a special function where you
- could select the data to get synced on the source datastore and
- check the chunks on the target datastore against the "to get synced" index files/snapshots, then
- delete (gc) all chunks on the target datastore that are not needed any more BEFORE the sync had happened.
- then sync the additional chunks to the datastore that are missing

That functionality would need a logic that uses the index files of the "to get synced data" on the source datastore and checks them against the existing chunks on the target datastore, thereby updating the timestamps of the chunks that are still needed.
Then a GC run would delete all unwanted chunks (thereby "removing" all data snapshots on the target datastore, but not removing all the chunks of them and freeing up the space for the missing chunks to get synced to the datastore.

The problem on this approach is of course that in the middle of the process, the datastore is in a "broken" status, with no functional snapshot but full of chunks that are not referenced by a index file/snapshot on the target datastore itself, but only by the "to get synced" snapshots/index files on the source datastore.

That approach / functionality would therefore need a new datastore status and the need to mark that target dastore in a way that it can never be seen as a "correct/normal" datastore status (in case the procedure get interupted, a error happens, etc.)

But OK, it's for sure to much work for such a edge-case. I could also just buy a bigger external drive :-)
No, I do not think that such a touch chunks which might get reused in future will get implemented, unless there is some additional stronger use case for this. A chunk is only considered as in-use as long as it is referenced by an index file of that same datastore, remotes do not count for that.
 
  • Like
Reactions: Johannes S