proxmox-backup-api exits Too many open files (os error 24)

speatzle_ · Apr 8, 2021

Hello,

we have a recurring issue where one of our proxmox backup servers thats under heavy load stops working about once a month. The pbs stopped working differently every time but it was always due to the an Error some where stateing "Too many open files".

This time the proxmox-backup service just exited and did not restart automatically (systemctl status proxmox-backup said dead / inactive). This caused all logins to fail and we had to login via ssh. The issue was easily fixed this time by running "systemctl start proxmox-backup".

This is the relevant section from the syslog:

Code:

Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: server error: error accepting connection: Too many open files (os error 24)
Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: SET SHUTDOWN MODE
Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: daemon shutting down...
Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: daemon shut down...
Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: server shutting down, waiting for active workers to complete
Apr  5 01:14:48 srv-k0m1rh proxmox-backup-api[818]: done - exit server

After that the log gets spammed with failed login requests as the backend isen't running

Code:

Apr  5 01:14:48 srv-k0m1rh proxmox-backup-proxy[903]: POST /api2/json/access/ticket: 400 Bad Request: [client [::ffff:10.91.102.101]:35546] connection error: Connection reset by peer (os err
or 104)

Ulimit:

Code:

$ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63804
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 63804
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Versions (we know these aren't the latest but we can only update monthly):

Code:

$proxmox-backup-manager versions --verbose
proxmox-backup             1.0-4        running kernel: 5.4.103-1-pve
proxmox-backup-server      1.0.13-1     running version: 1.0.9
pve-kernel-5.4             6.3-8
pve-kernel-helper          6.3-8
pve-kernel-5.4.103-1-pve   5.4.103-1
pve-kernel-5.4.101-1-pve   5.4.101-1
pve-kernel-5.4.78-2-pve    5.4.78-2
pve-kernel-5.4.65-1-pve    5.4.65-1
ifupdown2                  3.0.0-1+pve3
libjs-extjs                6.0.1-10
proxmox-backup-docs        1.0.13-1
proxmox-backup-client      1.0.13-1
proxmox-mini-journalreader 1.1-1
proxmox-widget-toolkit     2.4-9
pve-xtermjs                4.7.0-3
smartmontools              7.2-pve2
zfsutils-linux             2.0.4-pve1

Is this a know issue?
Can we simply increase the file limit from 1024 to prevent this from happening? (Should this then also be adjusted on the installation iso)
Shouldn't systemd always restart the services?

dylanw · Apr 8, 2021

You should probably look at the output of something like for i in $(pgrep proxmox-backup); do lsof -np $i; done, when this happens to understand what's actually using all the open files, in case this is a bug. I wouldn't recommend increasing the file limit until that is known.

speatzle_ said:
Shouldn't systemd always restart the services?

To my knowledge, it does try to restart failed services (where the option is configured). However, it only retries so many times before eventually giving up.

speatzle_ · Apr 8, 2021

dylanw said:
You should probably look at the output of something like for i in $(pgrep proxmox-backup); do lsof -np $i; done, when this happens to understand what's actually using all the open files, in case this is a bug. I wouldn't recommend increasing the file limit until that is known.

To my knowledge, it does try to restart failed services (where the option is configured). However, it only retries so many times before eventually giving up.

OK,I'll check up on the process from time to time using that command.

Systemd did not try to restart the service (the logs and the systemctl status would have shown that). The unit file is configured to only restart on failure not always, so my guess is that it stopped with exit code 0. Maybe this should be changed to always restart or is there any reason why the backend might want to stop it self without being restarted?

Edit:
I have added a cronjob to monitor and report on the file usage, hopefully this will be enough to figure out whats going on

dylanw · Apr 9, 2021

A patch has been submitted [1], which likely addresses the issue you're having here.

speatzle_ said:
Systemd did not try to restart the service (the logs and the systemctl status would have shown that). The unit file is configured to only restart on failure not always, so my guess is that it stopped with exit code 0.

And you are correct. At this point in time, the way in which proxmox-backup-api fails causes a clean stop of the service, so systemd does not attempt to restart it. With this, the discussion about whether this service should be restarted in this case will likely open up too.

[1] https://lists.proxmox.com/pipermail/pbs-devel/2021-April/002664.html

speatzle_ · Apr 9, 2021

dylanw said:
A patch has been submitted [1], which likely addresses the issue you're having here.

Thanks, this is very much appreciated.

As that patch dident fix the underlying issue of too many open files, i will continue to monitor the file usage to see if i can find some leaks as tonight the backup server had around 400 files open during peak usage which is still far off the 1024 limit.

dylanw · Apr 9, 2021

Ah sorry to hear! But keep me posted if you manage to find anything anyway, and we'll what's to blame

speatzle_ · Apr 22, 2021

dylanw said:
But keep me posted if you manage to find anything anyway, and we'll what's to blame

After tweaking my script some more it has finally produced some Results. It seemes that every night once or twice during high load lots of files are opened for a short period of time. The biggest one yet was a jump from around 350 open files to around 670 for a second:

I do have the output of the lsof -np <pid> command at the time of these Incidents.
the Majority of the new open "files" during these spikes look like this (2021-04-21_00:11:40):

proxmox-b 903 backup  344u     IPv6          159356220      0t0       TCP <pbs ipv4 address>:8007-><remote ipv4 address>:34972 (CLOSE_WAIT)

I can share the full Command Output from before, during and after these spikes privately as they contain lots of internal ip addresses and hostnames.

speatzle_ · Apr 28, 2021

i just noticed a spike of over 1000 open files, thankfully we updated or it would have crashed again...

Is there anywhere where i can upload the command output privatly?

tom · Apr 28, 2021

speatzle_ said:
Is there anywhere where i can upload the command output privatly?

If you need direct and private support, go for a support subscription.

speatzle_ · Apr 28, 2021

tom said:
If you need direct and private support, go for a support subscription.

I would like to but our Customer wont / can't get licenses for these Servers right now, we will get licenses for these Servers as soon as a recent incident has blown over and their new Proxmox solution has proven itself.

For now i have consistently renamed all IP Addresses and Hostnames.

Thank you for your support and these awesome products.

Search

Search

proxmox-backup-api exits Too many open files (os error 24)

speatzle_

Active Member

dylanw

Proxmox Retired Staff

speatzle_

Active Member

dylanw

Proxmox Retired Staff

speatzle_

Active Member

dylanw

Proxmox Retired Staff

speatzle_

Active Member

speatzle_

Active Member

tom

Proxmox Staff Member

speatzle_

Active Member

Attachments