PBS Deletes Valid Backup Chunks During GC – Unreliable Restores on Large-Scale Setup (70 Nodes, 500+ VMs)

masood96

New Member
Jun 26, 2025
1
0
1
Hi everyone,

I’m facing a critical issue with my Proxmox Backup Server (PBS) setup, and I haven’t found a clear solution so far.

Setup Details:
  • PBS version: 3.4.1
  • Storage: Single ZFS pool with 35 HDDs (104TB usable space)
  • Backups: 500+ backups from over 500 VMs across a 70+ node PVE cluster
  • Backup Frequency: Daily (mandatory)
  • Pruning: Keep last 1, weekly 1, and monthly 1 (total 3 per VM)
Problem:
PBS Garbage Collection (GC) is aggressively deleting chunks that are still referenced by existing backups. I usually discover this only when trying to restore a backup, and it fails due to missing chunks. This has made 100% of backups unreliable – I cannot trust any of them for restores.

Additional issues:
  • GC takes 5+ days to complete.
  • Verification jobs take 2+ weeks, so they’re practically useless in a daily backup setup.
  • Disk health and ZFS pool status are healthy.
  • I can’t turn off GC because the server runs out of space within a month.
Question:
  • Is this a known limitation of PBS with large-scale environments?
  • How can I ensure that GC does not delete valid chunks?
  • Is there a reliable strategy to make backups restorable in this kind of high-load environment?
I would really appreciate any guidance or solutions from those who have handled large PBS deployments or faced similar issues. Thank you!
 
Last edited:
Hi,
Hi everyone,

I’m facing a critical issue with my Proxmox Backup Server (PBS) setup, and I haven’t found a clear solution so far.

Setup Details:
  • PBS version: 3.4.1
  • Storage: Single ZFS pool with 35 HDDs (104TB usable space)
you should definitely consider adding a fast redundant disk setup as metadata special device as recommended, see https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements That will not only reduce your garbage collection runtime, but will help for other operations only accessing file metadata as well. Adding it now will only affect newly written data, but it will help in the long run.

  • Backups: 500+ backups from over 500 VMs across a 70+ node PVE cluster
  • Backup Frequency: Daily (mandatory)
  • Pruning: Keep last 1, weekly 1, and monthly 1 (total 3 per VM)
Problem:
PBS Garbage Collection (GC) is aggressively deleting chunks that are still referenced by existing backups. I usually discover this only when trying to restore a backup, and it fails due to missing chunks. This has made 100% of backups unreliable – I cannot trust any of them for restores.
This is a known issue fixed with version 3.4.1, for setups with long running garbage collection tasks and frequent backup and aggressive pruning, chunks might not have been marked as in use, see https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=cb9814e33112f2e4847083a1b742e3126952064b

Since you are already on version 3.4.1, I guess in your particular case the garbage collection does not actually remove chunks anymore, but newly created backups using fast incremental mode will re-reference still missing chunks. So to break this chain, you will have to either stop the VMs before the next backup run (full stop, not just VM reboot) or verify at least the last backup snapshot in each group. If verification fail, the next backup run will be a full one, re-uploading all chunks (which might even heal some of the previous backups if the chunk data is unchanged).

Additional issues:
  • GC takes 5+ days to complete.
As mentioned above, best is to setup an additional metadata special device for you ZFS pool. Also, since PBS 3.4 there is a chunk cache to avoid multiple atime updates on the same chunk file during phase 1 of garbage collection. You might want to increase the gc-cache-capacity value to it's maximum in the datastores tuning options if you have enough system memory headroom, see https://pbs.proxmox.com/docs/storage.html#tuning Also make sure the atime safety check is enabled (which is the default).

  • Verification jobs take 2+ weeks, so they’re practically useless in a daily backup setup.
  • Disk health and ZFS pool status are healthy.
  • I can’t turn off GC because the server runs out of space within a month.
It might be better to run garbage collection more frequently, once you have a special device set up. So you clear unused chunk and regain space more frequently.

Question:
  • Is this a known limitation of PBS with large-scale environments?
No, what you observed is most likely the bug as described above.

  • How can I ensure that GC does not delete valid chunks?
As mentioned above, this is fixed with 3.4.1, but you must assure that the new backups do one full backup, not with fast incremental mode, which is performed if the previous backup snapshot fails verification or the VM has been powered off in-between backup runs.

  • Is there a reliable strategy to make backups restorable in this kind of high-load environment?
You must verify the snapshots in order to detect corruptions, or do a restore test to see which backup snapshots are corrupt. And a new full backup might heal some of the previously corrupt snapshots by the means of reuploading a referenced chunk which has been deleted.

I would really appreciate any guidance or solutions from those who have handled large PBS deployments or faced similar issues. Thank you!
Hope this helps to get you going again!

Edit: Stopping the VM is not enough, the last snapshot of each backup group has to be verified.
 
Last edited: