I've send a tentative bugfix for the deadlock to the developer mailing list [0]. Thanks again for providing the backtraces!
[0]
https://lore.proxmox.com/pbs-devel/20251104175208.872621-1-c.ebner@proxmox.com/T/
I have now updated this PBS node to 4.1.0. Since my last post (around November 4, about 3 weeks ago) I had GC disabled on this datastore. Today I re-enabled it and started GC manually.
Environment:
- Proxmox Backup Server 4.1.0
- Debian 12
- Datastore size: ~17 TB used
- /run is tmpfs, size=3G
New behavior after the upgrade:
1) When I first re-enabled GC, /run hit 100% inodes and GC/logins failed with ENOSPC:
df -i /run:
Code:
tmpfs 4103116 4103115 1 100% /run
log excerpts:
Code:
mkstemp "/run/proxmox-backup/active-operations/dw-pbs-ny.tmp_XXXXXX" failed: ENOSPC: No space left on device
I worked around this by stopping PBS, deleting /run/proxmox-backup, and then increasing the inode pool on /run:
Code:
systemctl stop proxmox-backup proxmox-backup-proxy
mount -o remount,nosuid,nodev,size=3G,nr_inodes=8388608 /run
(later increased again to)
mount -o remount,nosuid,nodev,size=3G,nr_inodes=16777216 /run
2) After that I started GC again. The task log shows:
Code:
2025-11-26T17:33:34-05:00: Start GC phase2 (sweep unused chunks)
GC has now been stuck in phase2 for almost 2 hours, and inode usage on /run keeps climbing. Some df -i /run samples during this run:
right after cleanup:
tmpfs 16777216 876 16776340 1% /run
later:
tmpfs 16777216 5979644 10797572 36% /run
most recent:
tmpfs 16777216 11696774 5080442 70% /run
So inodes on /run/proxmox-backup are increasing more or less "forever" while that GC task sits in phase2. The only reason things still work is that I keep bumping nr_inodes on /run; backups themselves are no longer blocked like before, but GC appears stuck and extremely inode hungry.
My questions:
- For a datastore of ~17 TB that has skipped GC for ~3 weeks, is it expected that a single GC run in phase2 would allocate 10M+ inodes under /run?
- After GC finishes, should inode usage on /run drop back down significantly, or does this pattern suggest a leak / regression in GC in PBS 4.1.0?
Happy to provide additional logs or task output if needed.