PBS Deletes Valid Backup Chunks During GC – Unreliable Restores on Large-Scale Setup (70 Nodes, 500+ VMs)

masood96 · Jul 1, 2025

Hi everyone,

I’m facing a critical issue with my Proxmox Backup Server (PBS) setup, and I haven’t found a clear solution so far.

Setup Details:

PBS version: 3.4.1
Storage: Single ZFS pool with 35 HDDs (104TB usable space)
Backups: 500+ backups from over 500 VMs across a 70+ node PVE cluster
Backup Frequency: Daily (mandatory)
Pruning: Keep last 1, weekly 1, and monthly 1 (total 3 per VM)

Problem:
PBS Garbage Collection (GC) is aggressively deleting chunks that are still referenced by existing backups. I usually discover this only when trying to restore a backup, and it fails due to missing chunks. This has made 100% of backups unreliable – I cannot trust any of them for restores.

Additional issues:

GC takes 5+ days to complete.
Verification jobs take 2+ weeks, so they’re practically useless in a daily backup setup.
Disk health and ZFS pool status are healthy.
I can’t turn off GC because the server runs out of space within a month.

Question:

Is this a known limitation of PBS with large-scale environments?
How can I ensure that GC does not delete valid chunks?
Is there a reliable strategy to make backups restorable in this kind of high-load environment?

I would really appreciate any guidance or solutions from those who have handled large PBS deployments or faced similar issues. Thank you!

Chris · Jul 1, 2025

Hi,

masood96 said:
Hi everyone,

I’m facing a critical issue with my Proxmox Backup Server (PBS) setup, and I haven’t found a clear solution so far.

Setup Details:

PBS version: 3.4.1

Storage: Single ZFS pool with 35 HDDs (104TB usable space)

you should definitely consider adding a fast redundant disk setup as metadata special device as recommended, see https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements That will not only reduce your garbage collection runtime, but will help for other operations only accessing file metadata as well. Adding it now will only affect newly written data, but it will help in the long run.

masood96 said:
Backups: 500+ backups from over 500 VMs across a 70+ node PVE cluster

Backup Frequency: Daily (mandatory)

Pruning: Keep last 1, weekly 1, and monthly 1 (total 3 per VM)

Problem:
PBS Garbage Collection (GC) is aggressively deleting chunks that are still referenced by existing backups. I usually discover this only when trying to restore a backup, and it fails due to missing chunks. This has made 100% of backups unreliable – I cannot trust any of them for restores.

This is a known issue fixed with version 3.4.1, for setups with long running garbage collection tasks and frequent backup and aggressive pruning, chunks might not have been marked as in use, see https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=cb9814e33112f2e4847083a1b742e3126952064b

Since you are already on version 3.4.1, I guess in your particular case the garbage collection does not actually remove chunks anymore, but newly created backups using fast incremental mode will re-reference still missing chunks. So to break this chain, you will have to ~~either stop the VMs before the next backup run (full stop, not just VM reboot) or~~ verify at least the last backup snapshot in each group. If verification fail, the next backup run will be a full one, re-uploading all chunks (which might even heal some of the previous backups if the chunk data is unchanged).

masood96 said:
Additional issues:

GC takes 5+ days to complete.

As mentioned above, best is to setup an additional metadata special device for you ZFS pool. Also, since PBS 3.4 there is a chunk cache to avoid multiple atime updates on the same chunk file during phase 1 of garbage collection. You might want to increase the gc-cache-capacity value to it's maximum in the datastores tuning options if you have enough system memory headroom, see https://pbs.proxmox.com/docs/storage.html#tuning Also make sure the atime safety check is enabled (which is the default).

masood96 said:
Verification jobs take 2+ weeks, so they’re practically useless in a daily backup setup.

Disk health and ZFS pool status are healthy.

I can’t turn off GC because the server runs out of space within a month.

It might be better to run garbage collection more frequently, once you have a special device set up. So you clear unused chunk and regain space more frequently.

masood96 said:
Question:

Is this a known limitation of PBS with large-scale environments?

No, what you observed is most likely the bug as described above.

masood96 said:
How can I ensure that GC does not delete valid chunks?

As mentioned above, this is fixed with 3.4.1, but you must assure that the new backups do one full backup, not with fast incremental mode, which is performed if the previous backup snapshot fails verification ~~or the VM has been powered off in-between backup runs.~~

masood96 said:
Is there a reliable strategy to make backups restorable in this kind of high-load environment?

You must verify the snapshots in order to detect corruptions, or do a restore test to see which backup snapshots are corrupt. And a new full backup might heal some of the previously corrupt snapshots by the means of reuploading a referenced chunk which has been deleted.

masood96 said:
I would really appreciate any guidance or solutions from those who have handled large PBS deployments or faced similar issues. Thank you!

Hope this helps to get you going again!

Edit: Stopping the VM is not enough, the last snapshot of each backup group has to be verified.

Search

Search

PBS Deletes Valid Backup Chunks During GC – Unreliable Restores on Large-Scale Setup (70 Nodes, 500+ VMs)

masood96

New Member

Chris

Proxmox Staff Member

We value your privacy