Hi everyone,
We have been running PBS with an S3 backend for some time and recently started experiencing memory exhaustion issues that required manual server restarts to recover. After some investigation we would like to share what we have found so far and get some feedback from the community and developers.
Version: Proxmox Backup Server 4.2
Environment description:
Problem:
PBS creates empty lock files under /run/proxmox-backup/locks/<datastore>/.chunks/ during any operation that accesses chunks (sync jobs, garbage collection...). These files are never deleted, accumulating indefinitely.
With the described data volume, millions of files accumulate in that directory after several days of operation. In our case we exceeded 23 million files at the time of performing an exhaustive analysis of the situation.
Although the files are empty and occupy negligible space on disk since they are stored in tmpfs, the kernel maintains a shmem_inode_cache object in the Slab for each of them. This memory is SUnreclaim (non-reclaimable without restarting), and with 23 million files it consumes approximately 21-22 GB of RAM in an unrecoverable way.
The result is that within a few days the system runs out of available memory, the OOM killer starts killing PBS processes, tasks get stuck in unknown state and the server requires a manual restart to recover normal operation.
Evidence:
Memory state with 23 million lock files accumulated:
Top Slab objects (via slabtop):
After restarting the server with the lock files directory empty:
Confirmed behavior:
Is this the expected behavior for chunk lock files with an S3 backend? If so, is there any mechanism to clean them up periodically without having to restart the server?
Thanks in advance for any feedback or guidance on this issue.
We have been running PBS with an S3 backend for some time and recently started experiencing memory exhaustion issues that required manual server restarts to recover. After some investigation we would like to share what we have found so far and get some feedback from the community and developers.
Version: Proxmox Backup Server 4.2
Environment description:
- PBS with sync jobs (pull) running in a VM with 32 GB of RAM
- S3 backend with a 600 GB local cache on XFS, on enterprise NVMe disks (software RAID 1)
- Datastore with 355 VMs/CTs and a retention policy of 5 daily, 4 weekly and 2 monthly copies
- Several sync jobs configured to avoid or minimize overlap. Syncs are done per namespace from and to a single datastore
Problem:
PBS creates empty lock files under /run/proxmox-backup/locks/<datastore>/.chunks/ during any operation that accesses chunks (sync jobs, garbage collection...). These files are never deleted, accumulating indefinitely.
With the described data volume, millions of files accumulate in that directory after several days of operation. In our case we exceeded 23 million files at the time of performing an exhaustive analysis of the situation.
Although the files are empty and occupy negligible space on disk since they are stored in tmpfs, the kernel maintains a shmem_inode_cache object in the Slab for each of them. This memory is SUnreclaim (non-reclaimable without restarting), and with 23 million files it consumes approximately 21-22 GB of RAM in an unrecoverable way.
The result is that within a few days the system runs out of available memory, the OOM killer starts killing PBS processes, tasks get stuck in unknown state and the server requires a manual restart to recover normal operation.
Evidence:
Memory state with 23 million lock files accumulated:
Code:
MemTotal: 32861888 kB
MemFree: 2771876 kB
MemAvailable: 9348084 kB
Slab: 29376704 kB
SReclaimable: 6707936 kB
SUnreclaim: 22668768 kB
Top Slab objects (via slabtop):
Code:
23340620 shmem_inode_cache 24515328K (~23.4 GB)
23354709 dentry 4448824K (~4.2 GB)
23358524 lsm_inode_cache 2031176K (~1.9 GB)
After restarting the server with the lock files directory empty:
Code:
MemFree: 2771876 kB → 26 GB
MemAvailable: 9348084 kB → 30 GB
Slab: 29376704 kB → 261 MB
SUnreclaim: 22668768 kB → 103 MB
Confirmed behavior:
- Lock files accumulate continuously during any operation that accesses chunks
- Lock files are never deleted regardless of whether the operation completes successfully or not
- drop_caches does not free this memory as it is SUnreclaim
- The only confirmed remedy so far is a full server restart, as restarting the proxmox-backup-proxy service does not empty the directory. Over time, file accumulation is inevitable and the server will eventually run out of memory again
Is this the expected behavior for chunk lock files with an S3 backend? If so, is there any mechanism to clean them up periodically without having to restart the server?
Thanks in advance for any feedback or guidance on this issue.