Too many open files

Mikepop

Well-Known Member
Feb 6, 2018
63
5
48
51
Hello, I've seen this error in one node only of the cluster's five:


Dec 23 07:20:08 int104 pveproxy[22699]: 22699: unable to read '/etc/pve/nodes/int105/pve-ssl.pem' - Too many open files Dec 23 07:20:08 int104 pveproxy[22699]: failed to accept connection: Too many open files Dec 23 07:20:08 int104 pveproxy[3823]: ipcc_send_rec[1] failed: Too many open files Dec 23 07:20:08 int104 pveproxy[3823]: ipcc_send_rec[2] failed: Too many open files Dec 23 07:20:08 int104 pveproxy[3823]: ipcc_send_rec[3] failed: Too many open files


pveversion --verbose proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve) pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6) pve-kernel-5.4: 6.2-7 pve-kernel-helper: 6.2-7 pve-kernel-5.3: 6.1-6 pve-kernel-5.0: 6.0-11 pve-kernel-5.4.65-1-pve: 5.4.65-1 pve-kernel-5.4.55-1-pve: 5.4.55-1 pve-kernel-5.4.44-1-pve: 5.4.44-1 pve-kernel-5.4.41-1-pve: 5.4.41-1 pve-kernel-4.15: 5.4-6 pve-kernel-5.3.18-3-pve: 5.3.18-3 pve-kernel-5.3.18-2-pve: 5.3.18-2 pve-kernel-5.3.18-1-pve: 5.3.18-1 pve-kernel-5.3.13-1-pve: 5.3.13-1 pve-kernel-5.3.10-1-pve: 5.3.10-1 pve-kernel-5.0.21-5-pve: 5.0.21-10 pve-kernel-5.0.21-3-pve: 5.0.21-7 pve-kernel-5.0.21-1-pve: 5.0.21-2 pve-kernel-5.0.18-1-pve: 5.0.18-3 pve-kernel-5.0.15-1-pve: 5.0.15-1 pve-kernel-4.15.18-18-pve: 4.15.18-44 pve-kernel-4.13.13-2-pve: 4.13.13-33 ceph-fuse: 12.2.11+dfsg1-2.1+b1 corosync: 3.0.4-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.16-pve1 libproxmox-acme-perl: 1.0.5 libpve-access-control: 6.1-3 libpve-apiclient-perl: 3.0-3 libpve-common-perl: 6.2-4 libpve-guest-common-perl: 3.1-3 libpve-http-server-perl: 3.0-6 libpve-storage-perl: 6.2-10 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.3-1 lxcfs: 4.0.3-pve3 novnc-pve: 1.1.0-1 proxmox-backup-client: 1.0.1-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.3-10 pve-cluster: 6.2-1 pve-container: 3.2-2 pve-docs: 6.2-6 pve-edk2-firmware: 2.20200531-1 pve-firewall: 4.1-3 pve-firmware: 3.1-3 pve-ha-manager: 3.1-1 pve-i18n: 2.2-2 pve-qemu-kvm: 5.1.0-6 pve-xtermjs: 4.7.0-2 qemu-server: 6.2-19 smartmontools: 7.1-pve2 spiceterm: 3.1-1 vncterm: 1.6-2 zfsutils-linux: 0.8.4-pve2

We have this sysctl values but see this are the proxmox default values now in proxmox 6:
root@int104:~# sysctl fs.inotify fs.inotify.max_queued_events = 8388608 fs.inotify.max_user_instances = 65536 fs.inotify.max_user_watches = 4194304 root@int104:~# ulimit -n 1024


We do not use LXC containers, only kvm so I guess the problem it's related to a vm running on that node but not sure what values should be ok in this case or how to fix this issue.

Regards
 
* Check which files/sockets pveproxy has open:
`lsof -np <pid>` (replace <pid> with the pids of all pveproxy processes)

* check if you have many connections on this node:
`netstat -anp`
(check for pveproxy in the process column)

Also - consider upgrading to the latest version (6.3 was released a bit over a month ago)

I hope this helps!
 
Hello Stoiko, thanks for your reply.

root@int104:~# lsof -np 32632
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pveproxy 32632 www-data 12r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 13r REG 0,57 1696 15 /etc/pve/nodes/int104/pve-ssl.pem
pveproxy 32632 www-data 14r REG 0,57 1696 80 /etc/pve/nodes/int105/pve-ssl.pem
pveproxy 32632 www-data 15r REG 0,26 22016 25523 /usr/share/perl5/Net/LDAP/Constant.pm
pveproxy 32632 www-data 16r REG 0,57 1696 82 /etc/pve/nodes/int102/pve-ssl.pem
pveproxy 32632 www-data 17r REG 0,57 1696 83 /etc/pve/nodes/int103/pve-ssl.pem
pveproxy 32632 www-data 18r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 19r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 20r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 21r REG 0,57 1696 15 /etc/pve/nodes/int104/pve-ssl.pem
pveproxy 32632 www-data 22r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 23r REG 0,57 1696 15 /etc/pve/nodes/int104/pve-ssl.pem
pveproxy 32632 www-data 24r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 25r REG 0,57 1696 15 /etc/pve/nodes/int104/pve-ssl.pem
pveproxy 32632 www-data 26r REG 0,57 1696 80 /etc/pve/nodes/int105/pve-ssl.pem
pveproxy 32632 www-data 27r REG 0,57 1696 82 /etc/pve/nodes/int102/pve-ssl.pem
pveproxy 32632 www-data 28r REG 0,57 1696 83 /etc/pve/nodes/int103/pve-ssl.pem
pveproxy 32632 www-data 29r REG 0,57 3786 79 /etc/pve/nodes/int101/pveproxy-ssl.pem
pveproxy 32632 www-data 30r REG 0,57 1696 15 /etc/pve/nodes/int104/pve-ssl.pem
pveproxy 32632 www-data 31r REG 0,57 1696 80 /etc/pve/nodes/int105/pve-ssl.pem
pveproxy 32632 www-data 32r REG 0,57 1696 82 /etc/pve/nodes/int102/pve-ssl.pem
pveproxy 32632 www-data 33r REG 0,57 1696 83 /etc/pve/nodes/int103/pve-ssl.pem
pveproxy 32632 www-data 34r REG 0,57 1696 83 /etc/pve/nodes/int103/pve-ssl.pem


and so on a lot...with another similar pid, so I guess these are the open files



root@int104:~# netstat -anp|grep pveproxy
tcp 0 0 0.0.0.0:8006 0.0.0.0:* LISTEN 2706/pveproxy
unix 3 [ ] STREAM CONNECTED 2933984 32632/pveproxy work
unix 2 [ ] DGRAM 55 2706/pveproxy
unix 3 [ ] STREAM CONNECTED 3174983 4679/pveproxy worke
unix 3 [ ] STREAM CONNECTED 2864312 8517/pveproxy worke

I'll try to upgrade to 6.3 to see this solve the problem.

Regards
 
and so on a lot...with another similar pid, so I guess these are the open files
could you paste the complete output of `for i in $(pgrep pveproxy); do lsof -np $i; done` (this should give us all open files of all pveproxy processess)
(would help to get a complete picture if you just have many nodes - or if there's a file-descriptor leak somewhere)

any particular pattern in your access to pveproxy (a monitoring system, some other automatic system doing requests directed at that node where you see the problem (int104)?

But - if possible try to see if simply upgrading makes the problem disappear :)

Thanks!
 
Hello Stoiko, this is the out of that command. This is the only node of the cluster that do not have ceph fs/disks locally and connects to the rbd volume. I've checked the other nodes and cannot see this error.
We use a grafana dashboard with prometheus module, but it's getting metrics from int101, not from int104.

I'll post if update solves it in a couple of days.

Regards
 
Sadly, upgrade to 6.3 and Octopus did not solved the issue.

Dec 28 15:20:19 int104 pveproxy[22714]: 22714: unable to read '/etc/pve/nodes/int105/pve-ssl.pem' - Too many open files
Dec 28 15:20:19 int104 pveproxy[22714]: 22714: unable to read '/etc/pve/nodes/int101/pveproxy-ssl.pem' - Too many open files
Dec 28 15:20:19 int104 pveproxy[22714]: 22714: unable to read '/etc/pve/nodes/int103/pve-ssl.pem' - Too many open files
Dec 28 15:20:19 int104 pveproxy[22714]: 22714: unable to read '/etc/pve/nodes/int102/pve-ssl.pem' - Too many open files
Dec 28 15:20:19 int104 pveproxy[22714]: failed to accept connection: Too many open files

Regards
 
The number of open .pem files indeed is quite high - however I could not reproduce the issue locally currently...

Could you check the journal of the node where the issue is happening (`journalctl -b`) - check for any messages related to errors, network timeouts or other potential problems
Also try comparing the pveproxy-logs on the nodes:
/var/log/pveproxy/access.log
is there anything specific regarding the requests you see on int104? (compared to the other nodes in your cluster)?

FWIW currently I doubt that the issue is related to Ceph
 
Hello Stoiko, I see error like:

Dec 29 01:35:18 int104 pveproxy[28573]: got inotify poll request in wrong process - disabling inotify
Dec 29 01:55:11 int104 pveproxy[22973]: failed to accept connection: Too many open files
Dec 29 01:55:15 int104 pveproxy[11203]: got inotify poll request in wrong process - disabling inotify
Dec 29 02:03:22 int104 pvestatd[2629]: VM 115 qmp command failed - VM 115 qmp command 'query-proxmox-support' failed - got timeout
Dec 29 02:14:42 int104 pvestatd[2629]: VM 119 qmp command failed - VM 119 qmp command 'query-proxmox-support' failed - got timeout
Dec 29 02:14:47 int104 pve-ha-lrm[25159]: VM 119 qmp command failed - VM 119 qmp command 'query-status' failed - unable to connect to VM 119 qmp socket - timeout after 31 retries
Dec 29 02:25:12 int104 pveproxy[16731]: failed to accept connection: Too many open files
Dec 29 02:25:15 int104 pveproxy[17789]: got inotify poll request in wrong process - disabling inotify
Dec 29 02:39:37 int104 pve-ha-lrm[20019]: VM 138 qmp command failed - VM 138 qmp command 'query-status' failed - got timeout
Dec 29 02:51:27 int104 pve-ha-lrm[15930]: VM 148 qmp command failed - VM 148 qmp command 'query-status' failed - got timeout
Dec 29 02:55:16 int104 pveproxy[28574]: failed to accept connection: Too many open files
Dec 29 02:55:19 int104 pveproxy[24760]: got inotify poll request in wrong process - disabling inotify

I see a lot of requets from our whmcs, we've changed that to another node to see if there is any difference ans check if error shows in the new one:

- - [29/12/2020:18:05:20 +0100] "POST /api2/json/access/ticket HTTP/1.1" 200 1308
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes/int101/qemu/143/config HTTP/1.1" 200 450
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes HTTP/1.1" 200 1659
- - [29/12/2020:18:05:20 +0100] "POST /api2/json/access/ticket HTTP/1.1" 200 1308
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes HTTP/1.1" 200 1659
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes/int101/qemu/143/status/current HTTP/1.1" 200 1870
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/cluster/resources?type=vm HTTP/1.1" 200 26294
- - [29/12/2020:18:05:20 +0100] "POST /api2/json/access/ticket HTTP/1.1" 200 1308
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes HTTP/1.1" 200 1659
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes/int101/qemu/143/config HTTP/1.1" 200 450
- - [29/12/2020:18:05:20 +0100] "POST /api2/json/access/ticket HTTP/1.1" 200 1308
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes HTTP/1.1" 200 1659
- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes/int101/qemu/143/rrddata?timeframe=week&cf=MAX HTTP/1.1" 200 15872

Regards
 
That is, we see now the error in the new node:

Dec 29 19:20:13 int105 pveproxy[1807796]: failed to accept connection: Too many open files
Dec 29 19:20:17 int105 pveproxy[1883137]: got inotify poll request in wrong process - disabling inotify

Any advice to solve it? Increase sysctl values even more?

Regards
 
I see a lot of requets from our whmcs, we've changed that to another node to see if there is any difference ans check if error shows in the new one:
ok - that seems to be the initial reason for the error ... - however that should still not happen

Dec 29 02:03:22 int104 pvestatd[2629]: VM 115 qmp command failed - VM 115 qmp command 'query-proxmox-support' failed - got timeout
This sounds like you have a potential mismatch of the versions of some of the PVE packages installed:
* could you upgrade a node to the latest available version and reboot - and see if the issue then happens as well?
else it's difficult to rule out that the issue is not simply a glitch due to the version mismatch

Thanks!
 
Hello Stoiko, I've done so in all nodes a few days ago, I only see a couple of python packages available for upgrade, same in all the nodes.

Regards
 
could you please post the output of:
Code:
apt update
apt full-upgrade
pveversion -v
(in code tags)
 
Sure, here it is:

Code:
root@int101:/var/log# apt update
Get:1 http://security.debian.org buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Hit:3 http://deb.debian.org/debian buster InRelease                                           
Get:4 http://ftp.es.debian.org/debian buster InRelease [121 kB]                               
Hit:5 http://download.proxmox.com/debian/pve buster InRelease
Hit:6 http://download.proxmox.com/debian/ceph-octopus buster InRelease
Fetched 252 kB in 1s (247 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
2 packages can be upgraded. Run 'apt list --upgradable' to see them.
root@int101:/var/log# apt full-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  python-apt-common python3-apt
2 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 281 kB of archives.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Abort.
root@int101:/var/log# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Regards
 
In other hv:
Code:
root@int105:~# apt update
Hit:1 http://deb.debian.org/debian buster InRelease
Get:2 http://ftp.es.debian.org/debian buster InRelease [121 kB]                                                             
Hit:3 http://security.debian.org buster/updates InRelease                                                                   
Hit:4 http://security.debian.org/debian-security buster/updates InRelease                                                   
Hit:5 http://download.proxmox.com/debian/pve buster InRelease                                                               
Hit:6 http://download.proxmox.com/debian/ceph-octopus buster InRelease
Fetched 121 kB in 1s (176 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
2 packages can be upgraded. Run 'apt list --upgradable' to see them.
root@int105:~# apt full-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  python-apt-common python3-apt
2 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 281 kB of archives.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://security.debian.org buster/updates/main amd64 python-apt-common all 1.8.4.3 [96.3 kB]
Get:2 http://security.debian.org buster/updates/main amd64 python3-apt amd64 1.8.4.3 [185 kB]
Fetched 281 kB in 0s (26.2 MB/s)
Reading changelogs... Done
(Reading database ... 109798 files and directories currently installed.)
Preparing to unpack .../python-apt-common_1.8.4.3_all.deb ...
Unpacking python-apt-common (1.8.4.3) over (1.8.4.2) ...
Preparing to unpack .../python3-apt_1.8.4.3_amd64.deb ...
Unpacking python3-apt (1.8.4.3) over (1.8.4.2) ...
Setting up python-apt-common (1.8.4.3) ...
Setting up python3-apt (1.8.4.3) ...
root@int105:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Regards
 
Hello Stoiko, I guess my reply was buried in the holidays, just bump in it.
pretty much exactly what happened - thanks for the reminder

on a quick glance I guess there is some file-descriptor leak with proxied requests (ones going from one node to another to gather data) like in:

- root@pam [29/12/2020:18:05:20 +0100] "GET /api2/json/nodes/int101/qemu/143/status/current HTTP/1.1" 200 1870

I'll try to get this reproduced here with a small test-cluster

if possible please open a bug-report in our bugzilla: https://bugzilla.proxmox.com (referencing this thread)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!