Backup jobs hang on 4.0.18 S3 storage

Thanks for the backtraces, they do point indeed to a deadlock in garbage collection. Can you verify that all the other tasks run without issues if you deactivate the garbage collection for now? Will get back to you once the critical codepath has been identified.
 
Thanks for the backtraces, they do point indeed to a deadlock in garbage collection. Can you verify that all the other tasks run without issues if you deactivate the garbage collection for now? Will get back to you once the critical codepath has been identified.
As soon as I stop the GC job (the only way to stop it is by restarting promxox-backup-proxy), backups start to work normally again. I've disabled the GC job for now. I'm assuming that everything will continue working as normal until GC runs again, as it only starts failing when GC job starts on schedule.

Thanks for looking into it.
 
this should be fixed since proxmox-backup-server 4.0.20-1
 
this should be fixed since proxmox-backup-server 4.0.20-1
Ok great, one more question. I still haven't been able to go through and delete all the '.bad' chunks, from the initial bug. My datastore is extremely large and it's not proving to be easy to enumerate every file and search for .bad chunks.

- If no backup, or any 'incremental' backup for that same VM has ever been 'verified' or restored, will there be any chance that it has false '.bad' chunks? I don't use the verify feature. So I'm thinking to just delete backups for any VM group that has had a failed 'verify' job in the past.

Basically what I'm asking is: Can I just delete all VM groups that have a failed verification job, then go through and run 'verify' on all the remaining ones, and be confident that any ones that fail verification this time are truly corrupt, and not false positives?
 
I've send a tentative bugfix for the deadlock to the developer mailing list [0]. Thanks again for providing the backtraces!

[0] https://lore.proxmox.com/pbs-devel/20251104175208.872621-1-c.ebner@proxmox.com/T/
I have now updated this PBS node to 4.1.0. Since my last post (around November 4, about 3 weeks ago) I had GC disabled on this datastore. Today I re-enabled it and started GC manually.

Environment:
- Proxmox Backup Server 4.1.0
- Debian 12
- Datastore size: ~17 TB used
- /run is tmpfs, size=3G

New behavior after the upgrade:

1) When I first re-enabled GC, /run hit 100% inodes and GC/logins failed with ENOSPC:

df -i /run:
Code:
tmpfs  4103116 4103115        1 100% /run

log excerpts:
Code:
mkstemp "/run/proxmox-backup/active-operations/dw-pbs-ny.tmp_XXXXXX" failed: ENOSPC: No space left on device

I worked around this by stopping PBS, deleting /run/proxmox-backup, and then increasing the inode pool on /run:

Code:
systemctl stop proxmox-backup proxmox-backup-proxy
  mount -o remount,nosuid,nodev,size=3G,nr_inodes=8388608 /run
  (later increased again to)
  mount -o remount,nosuid,nodev,size=3G,nr_inodes=16777216 /run
2) After that I started GC again. The task log shows:

Code:
  2025-11-26T17:33:34-05:00: Start GC phase2 (sweep unused chunks)
GC has now been stuck in phase2 for almost 2 hours, and inode usage on /run keeps climbing. Some df -i /run samples during this run:

right after cleanup:
tmpfs 16777216 876 16776340 1% /run

later:
tmpfs 16777216 5979644 10797572 36% /run

most recent:
tmpfs 16777216 11696774 5080442 70% /run

So inodes on /run/proxmox-backup are increasing more or less "forever" while that GC task sits in phase2. The only reason things still work is that I keep bumping nr_inodes on /run; backups themselves are no longer blocked like before, but GC appears stuck and extremely inode hungry.

My questions:
- For a datastore of ~17 TB that has skipped GC for ~3 weeks, is it expected that a single GC run in phase2 would allocate 10M+ inodes under /run?
- After GC finishes, should inode usage on /run drop back down significantly, or does this pattern suggest a leak / regression in GC in PBS 4.1.0?

Happy to provide additional logs or task output if needed.
 
My questions:
- For a datastore of ~17 TB that has skipped GC for ~3 weeks, is it expected that a single GC run in phase2 would allocate 10M+ inodes under /run?
- After GC finishes, should inode usage on /run drop back down significantly, or does this pattern suggest a leak / regression in GC in PBS 4.1.0?

if you reboot your PBS system, you'll see a new tmpfs mount point for /run/proxmox-backup that has no inode limit. you have a rather big datastore, so yes, a lot of lockfiles for S3 are expected. the good thing is they take up barely any resources (other than inodes) because they are stored on tmpfs and are all empty.. 10M does seem excessive, unless your chunks are all very small on disk and the 17TB is on-disk usage, not logical usage? e.g., if your average chunk is 1MB on disk then 17M lock files would be possible for 17TB of physical data, but usually chunks are more in the +-2MB range..
 
  • Like
Reactions: Chris
Ok great, one more question. I still haven't been able to go through and delete all the '.bad' chunks, from the initial bug. My datastore is extremely large and it's not proving to be easy to enumerate every file and search for .bad chunks.
Bad chunks will be cleaned up over time by garbage collection as well when either the correct chunk was re-uploaded or all snapshots with index files referencing this chunk disappeared.

- If no backup, or any 'incremental' backup for that same VM has ever been 'verified' or restored, will there be any chance that it has false '.bad' chunks? I don't use the verify feature. So I'm thinking to just delete backups for any VM group that has had a failed 'verify' job in the past.
You have to be carefull, as there are incremental backups which can reuse chunks present in index files of the last backup snapshot of the backup group. Therefore the recommendation to at least verify the last backup snapshot for each group. If that verifies, the following snapshot is safe to reuse chunks, if it however fails verification, the next backup to that group does not reuse chunks and uploads them to the PBS again. In that case you might even "heal" other snapshots referencing the re-uploaded chunks if missing/bad.

Basically what I'm asking is: Can I just delete all VM groups that have a failed verification job, then go through and run 'verify' on all the remaining ones, and be confident that any ones that fail verification this time are truly corrupt, and not false positives?
As stated above, I would recommend to do the verify of last backup in the group followed by a backup run for each group. By that you make sure that new snapshots will not reuse older but marked as bad chunks. Only then it makes sense to verify and remove older snapshots which fail verification.