Garbage collect job fails with "EMFILE: Too many open files"

tbahn

New Member
Aug 12, 2024
10
1
3
Kiel, Germany
All of a sudden the garbage collect job started to fail with:
TASK ERROR: update atime failed for chunk/file "/mnt/proxmox-backup/.chunks/69f8/69f82c53064d2eb7795d401a63177ebda5bba778f02db397235e6508a2fee0ed" - EMFILE: Too many open files
This happens after marked 85% (212 of 249 index files) consistently with the same chunk.

Upgraded all packages, rebooted, retried.

Increased number of open files in /etc/security/limits.conf from 2^16-1 (65,535) step by step, doubling each time, to 2^20-1 (1,048,575), each time I rebooted the PBS and retried the job.
* soft nofile 1048575
* hard nofile 1048575

Prune and verify jobs succeed.

stat /mnt/proxmox-backup/.chunks/69f8/69f82c53064d2eb7795d401a63177ebda5bba778f02db397235e6508a2fee0ed works.
 
Hello and welcome to the Proxmox Community,

What is the output of the following commands:

Code:
sysctl -a | grep inotify

sysctl -a | grep file-max

And the now active value:

Code:
cat /proc/sys/fs/file-max
 
Last edited:
Hello and welcome to the Proxmox Community,

What is the output of the following commands:

Code:
sysctl -a | grep inotify

sysctl -a | grep file-max

And the now active value:

Code:
cat /proc/sys/fs/file-max
Hi fireon,

thank you for the warm welcome.

Bash:
2024-08-12T20:06:01+02:00: starting garbage collection on store Unraid-Backup
2024-08-12T20:06:01+02:00: Start GC phase1 (mark used chunks)
2024-08-12T20:06:01+02:00: TASK ERROR: unexpected error on datastore traversal: Too many open files (os error 24) - "/mnt/proxmox-backup/vm"

sysctl -a | grep inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 30025
user.max_inotify_instances = 128
user.max_inotify_watches = 30025

sysctl -a | grep file-max
fs.file-max = 9223372036854775807

cat /proc/sys/fs/file-max
9223372036854775807

And directly after a server restart:
Bash:
sysctl -a | grep inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 30025
user.max_inotify_instances = 128
user.max_inotify_watches = 30025

sysctl -a | grep file-max
fs.file-max = 9223372036854775807

cat /proc/sys/fs/file-max
9223372036854775807
 
Tonight the garbage collection job completed successfully?! Don't know, why. But I hope that this problem simply disappeared as suddenly as it came.

To be on the safe side, increase the max user watches and instances. I have already configured this as the default on my servers.

Code:
nano /etc/sysctl.d/custom.conf

Code:
fs.inotify.max_user_watches=5242880
fs.inotify.max_user_instances=1024
fs.inotify.max_queued_events = 8388608
user.max_inotify_instances = 1024
user.max_inotify_watches = 5242880

Set manually so that you do not have to reboot:

Code:
sysctl -w fs.inotify.max_user_watches=5242880
sysctl -w fs.inotify.max_user_instances=1024
sysctl -w fs.inotify.max_queued_events = 8388608
sysctl -w user.max_inotify_instances = 1024
sysctl -w user.max_inotify_watches = 5242880
 
Hallo Fireon,

thank you for the pointer to the inotify API.

For those reading this thread: Have a look at https://man7.org/linux/man-pages/man7/inotify.7.html
"The inotify API provides a mechanism for monitoring filesystem events. Inotify can be used to monitor individual files, or to monitor directories."

In the section "/proc interfaces" is a description of the settings Fireon provided:

/proc/sys/fs/inotify/max_queued_events
... is used ... to set an upper limit on the number of events that can be queued to the corresponding inotify instance. ...

/proc/sys/fs/inotify/max_user_instances
This specifies an upper limit on the number of inotify instances that can be created per real user ID.

/proc/sys/fs/inotify/max_user_watches
This specifies an upper limit on the number of watches that can be created per real user ID.

By experiment, I have found that setting the fs.inotify values causes the user values to have the same value.

Danke, Fireon
 
  • Like
Reactions: fireon
It didn't help. :(

Today the daily garbage collector task failed again with:
Error: unexpected error on datastore traversal: Too many open files (os error 24) - "/mnt/proxmox-backup"

The prune job executed afterwards failed, too:
Pruning failed: EMFILE: Too many open files

Manually starting the garbage collector task fails immediately:
2024-08-15T11:37:11+02:00: starting garbage collection on store Unraid-Backup
2024-08-15T11:37:11+02:00: Start GC phase1 (mark used chunks)
2024-08-15T11:37:13+02:00: TASK ERROR: update atime failed for chunk/file "/mnt/proxmox-backup/.chunks/baf9/baf9db3ad92d8800646788131c75ac5c694aad72f6ff0aeb5cb0163ed81ac526" - EMFILE: Too many open files

lsof | wc -l (list open files, count the rows of the output) returns 2535, which is quite a small number compared to the limits set.
 
It worked for some days, but today there was again the "Too many open files" error.

I don't see the "system" behind working/not working days.
 
I am currently experiencing the same issue as you. Did you ever get to permanently fix your issue ?
 
Hello,


I'm experiencing recurring "Too many open files" (EMFILE, os error 24) errors with Proxmox Backup Server (PBS) version 3.4.1, running on a dedicated Dell R720 bare-metal server (filesystem: XFS).

Error examples:​

  • During backup verification:


    can't verify chunk, load failed - store 'backup' [...] - Too many open files (os error 24)
  • During backup jobs:
    POST /fixed_chunk: 400 Bad Request: inserting chunk [...] failed: EMFILE: Too many open files
Repository stats:
Number of namespaces: find /backup/ns/ -type d |wc -l
6024
Number of chunk files: find /backup/.chunks/ -type f |wc -l
36966951

Additional observations:​

  • The errors occur both during garbage collection and regular backup jobs.
  • There is no clear pattern; the issue appears sporadically.
  • Disk space, CPU, and RAM usage are within normal limits.

Troubleshooting steps taken:​

  • Verified PBS version and updates.
  • Checked disk space (no issues).
  • Monitored system resource usage.

Questions for the community:​

  1. Does this scale of stored data (nearly 37 million chunk files) require special tuning?
  2. What are the best practices for configuring PBS for large-scale repositories?
  3. Are there recommended kernel or system limits (e.g., ulimit, fs.file-max) for this scenario?
Any advice on diagnosing or resolving this would be greatly appreciated!