Garbage collect job fails with "EMFILE: Too many open files"

tbahn

New Member
Aug 12, 2024
8
1
3
Kiel, Germany
All of a sudden the garbage collect job started to fail with:
TASK ERROR: update atime failed for chunk/file "/mnt/proxmox-backup/.chunks/69f8/69f82c53064d2eb7795d401a63177ebda5bba778f02db397235e6508a2fee0ed" - EMFILE: Too many open files
This happens after marked 85% (212 of 249 index files) consistently with the same chunk.

Upgraded all packages, rebooted, retried.

Increased number of open files in /etc/security/limits.conf from 2^16-1 (65,535) step by step, doubling each time, to 2^20-1 (1,048,575), each time I rebooted the PBS and retried the job.
* soft nofile 1048575
* hard nofile 1048575

Prune and verify jobs succeed.

stat /mnt/proxmox-backup/.chunks/69f8/69f82c53064d2eb7795d401a63177ebda5bba778f02db397235e6508a2fee0ed works.
 
Hello and welcome to the Proxmox Community,

What is the output of the following commands:

Code:
sysctl -a | grep inotify

sysctl -a | grep file-max

And the now active value:

Code:
cat /proc/sys/fs/file-max
 
Last edited:
Hello and welcome to the Proxmox Community,

What is the output of the following commands:

Code:
sysctl -a | grep inotify

sysctl -a | grep file-max

And the now active value:

Code:
cat /proc/sys/fs/file-max
Hi fireon,

thank you for the warm welcome.

Bash:
2024-08-12T20:06:01+02:00: starting garbage collection on store Unraid-Backup
2024-08-12T20:06:01+02:00: Start GC phase1 (mark used chunks)
2024-08-12T20:06:01+02:00: TASK ERROR: unexpected error on datastore traversal: Too many open files (os error 24) - "/mnt/proxmox-backup/vm"

sysctl -a | grep inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 30025
user.max_inotify_instances = 128
user.max_inotify_watches = 30025

sysctl -a | grep file-max
fs.file-max = 9223372036854775807

cat /proc/sys/fs/file-max
9223372036854775807

And directly after a server restart:
Bash:
sysctl -a | grep inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 30025
user.max_inotify_instances = 128
user.max_inotify_watches = 30025

sysctl -a | grep file-max
fs.file-max = 9223372036854775807

cat /proc/sys/fs/file-max
9223372036854775807
 
Tonight the garbage collection job completed successfully?! Don't know, why. But I hope that this problem simply disappeared as suddenly as it came.

To be on the safe side, increase the max user watches and instances. I have already configured this as the default on my servers.

Code:
nano /etc/sysctl.d/custom.conf

Code:
fs.inotify.max_user_watches=5242880
fs.inotify.max_user_instances=1024
fs.inotify.max_queued_events = 8388608
user.max_inotify_instances = 1024
user.max_inotify_watches = 5242880

Set manually so that you do not have to reboot:

Code:
sysctl -w fs.inotify.max_user_watches=5242880
sysctl -w fs.inotify.max_user_instances=1024
sysctl -w fs.inotify.max_queued_events = 8388608
sysctl -w user.max_inotify_instances = 1024
sysctl -w user.max_inotify_watches = 5242880
 
Hallo Fireon,

thank you for the pointer to the inotify API.

For those reading this thread: Have a look at https://man7.org/linux/man-pages/man7/inotify.7.html
"The inotify API provides a mechanism for monitoring filesystem events. Inotify can be used to monitor individual files, or to monitor directories."

In the section "/proc interfaces" is a description of the settings Fireon provided:

/proc/sys/fs/inotify/max_queued_events
... is used ... to set an upper limit on the number of events that can be queued to the corresponding inotify instance. ...

/proc/sys/fs/inotify/max_user_instances
This specifies an upper limit on the number of inotify instances that can be created per real user ID.

/proc/sys/fs/inotify/max_user_watches
This specifies an upper limit on the number of watches that can be created per real user ID.

By experiment, I have found that setting the fs.inotify values causes the user values to have the same value.

Danke, Fireon
 
  • Like
Reactions: fireon
It didn't help. :(

Today the daily garbage collector task failed again with:
Error: unexpected error on datastore traversal: Too many open files (os error 24) - "/mnt/proxmox-backup"

The prune job executed afterwards failed, too:
Pruning failed: EMFILE: Too many open files

Manually starting the garbage collector task fails immediately:
2024-08-15T11:37:11+02:00: starting garbage collection on store Unraid-Backup
2024-08-15T11:37:11+02:00: Start GC phase1 (mark used chunks)
2024-08-15T11:37:13+02:00: TASK ERROR: update atime failed for chunk/file "/mnt/proxmox-backup/.chunks/baf9/baf9db3ad92d8800646788131c75ac5c694aad72f6ff0aeb5cb0163ed81ac526" - EMFILE: Too many open files

lsof | wc -l (list open files, count the rows of the output) returns 2535, which is quite a small number compared to the limits set.
 
It worked for some days, but today there was again the "Too many open files" error.

I don't see the "system" behind working/not working days.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!