S3 chunk lock files accumulate indefinitely in tmpfs causing memory exhaustion

synexa

New Member
Jun 4, 2026
2
0
1
Hi everyone,

We have been running PBS with an S3 backend for some time and recently started experiencing memory exhaustion issues that required manual server restarts to recover. After some investigation we would like to share what we have found so far and get some feedback from the community and developers.

Version: Proxmox Backup Server 4.2

Environment description:
  • PBS with sync jobs (pull) running in a VM with 32 GB of RAM
  • S3 backend with a 600 GB local cache on XFS, on enterprise NVMe disks (software RAID 1)
  • Datastore with 355 VMs/CTs and a retention policy of 5 daily, 4 weekly and 2 monthly copies
  • Several sync jobs configured to avoid or minimize overlap. Syncs are done per namespace from and to a single datastore

Problem:
PBS creates empty lock files under /run/proxmox-backup/locks/<datastore>/.chunks/ during any operation that accesses chunks (sync jobs, garbage collection...). These files are never deleted, accumulating indefinitely.
With the described data volume, millions of files accumulate in that directory after several days of operation. In our case we exceeded 23 million files at the time of performing an exhaustive analysis of the situation.

Although the files are empty and occupy negligible space on disk since they are stored in tmpfs, the kernel maintains a shmem_inode_cache object in the Slab for each of them. This memory is SUnreclaim (non-reclaimable without restarting), and with 23 million files it consumes approximately 21-22 GB of RAM in an unrecoverable way.

The result is that within a few days the system runs out of available memory, the OOM killer starts killing PBS processes, tasks get stuck in unknown state and the server requires a manual restart to recover normal operation.

Evidence:
Memory state with 23 million lock files accumulated:

Code:
MemTotal:       32861888 kB
MemFree:         2771876 kB
MemAvailable:    9348084 kB
Slab:           29376704 kB
SReclaimable:    6707936 kB
SUnreclaim:     22668768 kB


Top Slab objects (via slabtop):

Code:
23340620   shmem_inode_cache   24515328K (~23.4 GB)
23354709   dentry               4448824K (~4.2 GB)
23358524   lsm_inode_cache      2031176K (~1.9 GB)

After restarting the server with the lock files directory empty:

Code:
MemFree:         2771876 kB  →  26 GB
MemAvailable:    9348084 kB  →  30 GB
Slab:           29376704 kB  →  261 MB
SUnreclaim:     22668768 kB  →  103 MB


Confirmed behavior:
  • Lock files accumulate continuously during any operation that accesses chunks
  • Lock files are never deleted regardless of whether the operation completes successfully or not
  • drop_caches does not free this memory as it is SUnreclaim
  • The only confirmed remedy so far is a full server restart, as restarting the proxmox-backup-proxy service does not empty the directory. Over time, file accumulation is inevitable and the server will eventually run out of memory again

Is this the expected behavior for chunk lock files with an S3 backend? If so, is there any mechanism to clean them up periodically without having to restart the server?

Thanks in advance for any feedback or guidance on this issue.
 
Hi,
Is this the expected behavior for chunk lock files with an S3 backend? If so, is there any mechanism to clean them up periodically without having to restart the server?
the lockfiles not being cleaned up is intentional behavior to not run into races since the locking relies on flock() internally. But for these to consume so much space is definitely not intentional and we need to find a solution, so please open an issue for this at https://bugzilla.proxmox.com, referencing this thread.

As a workaround for the time being you could do one of the following:
  • Set the datastore to maintenance mode offline and mount a filesystem backed by a physical disk on top of /run/proxmox-backup/locks/<datastore>/.chunks/, making sure of the correct backup:backup ownership for files and directories on that path.
  • Periodically set the datastore into maintenance mode offline, once set, run rm -rf /run/proxmox-backup/locks/<datastore>/.chunks/* and then bring the datastore back online.
 
Hi,

Thank you for your help and quick response. We have submitted the bug to the tracker as requested. In the meantime, we have implemented the suggested workaround by mounting a dedicated filesystem on top of /run/proxmox-backup/locks/&lt;datastore&gt;/.chunks/. We will monitor how the jobs evolve with this solution.


Thanks again for your support.