pvestatd leak creates 3000+ processes, consumes all RAM & SWAP, and halts/reboots machine over a 4 hour cycle. Then repeats.

blebo

Member
Apr 22, 2018
5
1
8
34
Perth, Australia
This has occurred from the later PVE 6.1 updates (I think) and definitely all throughout PVE 6.2 (including 6.2-12) since about March 2020. Prior to this the system was rock-solid for about 1.5 years.

Normal operation is ~10-12Gig usage (of 32G total). See attached picture for the cycle.

Recently I noticed that if I stop pvestatd service via GUI before the leak hits SWAP, and it successfully stops, then the memory leak is stopped and cleared instantly. However starting pvestatd again just starts the 4hr cycle from the beginning.

I’ve monitored the pvestatd process count using the following command (in a cron bash script every 5mins), where I routinely see 3000+ pvestatd processes:
ps -eo pid,ppid,cmd,%mem,%cpu,vsz,rss --sort=-%mem | grep pvestatd | wc -l

ZFS ram cache appears to behave and progressively reduce its foot print as the pvestatd leak approaches 100% RAM - monitored using the following command:
awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats

Package versions:

Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1


Any assistance is much appreciated. Let me know if more information required.

Cheers
 

Attachments

  • Screen Shot 2020-10-01 at 5.52.50 pm.png
    Screen Shot 2020-10-01 at 5.52.50 pm.png
    52.9 KB · Views: 7

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,511
908
163
you could see what those processes are doing (or rather, what those processes are stuck on). my guess is some hanging storage?
 

blebo

Member
Apr 22, 2018
5
1
8
34
Perth, Australia
Looking into it a bit further, I am see the following typical in syslog:

Code:
Oct 04 16:00:47 pm01 pvestatd[28666]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18976]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19126]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19964]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19443]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[21277]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: node status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[29396]: status update error: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18790]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: qemu status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 112 -p' failed: open3: pipe(GLOB(0x55f17f4c32f0), IO::File=GLOB(0x55f17f4b6700)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[23438]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 114 -p' failed: open3: pipe(GLOB(0x55f17f4b6fe8), IO::File=GLOB(0x55f17f4cfc48)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 116 -p' failed: open3: pipe(GLOB(0x55f17ee98fd8), IO::File=GLOB(0x55f17eecf970)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 107 -p' failed: open3: pipe(GLOB(0x55f17eeafd60), IO::File=GLOB(0x55f17eee3750)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24990]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: lxc status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17f4cacb0), IO::File=GLOB(0x55f17ee9ea68)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-vm', zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[21546]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[29871]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[28837]: ipcc_send_rec[2] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24325]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[19780]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[22968]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30280]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30282]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30280]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30277]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30279]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30278]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
 

blebo

Member
Apr 22, 2018
5
1
8
34
Perth, Australia
Following my update above, is it possible that somewhere between PVE 5.x and PVE 6.2, the default limit has been increased for a clean install (i.e. ulimit -n)?

output of ulimit -a :
Code:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127563
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127563
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Then when running the following script, shows that each pvestatd process is attempting to open 1024 files:
Code:
SEARCH="pvestatd"
for i in $(ps -C "${SEARCH}" -o pid | grep -v PID)
do
     echo "PID # ${i} open files count : $(ls -l /proc/${i}/fd | wc -l)"
done


Output (typical):
Code:
...
PID # 31846 open files count : 1023
PID # 31847 open files count : 1023
PID # 31848 open files count : 1023
PID # 31849 open files count : 1023
PID # 31850 open files count : 1023
PID # 32049 open files count : 1023
PID # 32050 open files count : 1023
PID # 32051 open files count : 1023
PID # 32052 open files count : 1023
PID # 32053 open files count : 1023
PID # 32054 open files count : 1023
PID # 32055 open files count : 1023
PID # 32275 open files count : 1023
PID # 32276 open files count : 1023
PID # 32277 open files count : 1023
PID # 32278 open files count : 1023
PID # 32279 open files count : 1023
PID # 32280 open files count : 1023
PID # 32281 open files count : 1023
PID # 32379 open files count : 1023
PID # 32380 open files count : 1023
PID # 32381 open files count : 1023
PID # 32382 open files count : 1023
PID # 32383 open files count : 1023
PID # 32384 open files count : 1023
PID # 32385 open files count : 1023
PID # 32573 open files count : 1023
PID # 32574 open files count : 1023
PID # 32575 open files count : 1023
...
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
5,511
908
163
I think that is just a symptom.. please check where the earlier processes (with the lowest PIDs) are hanging (e.g., from which file they are reading, which children they have forked and are waiting for, ...).
 

ChristianW

New Member
Feb 13, 2020
10
1
3
42
Hi,
is there any Update on this? We are facing the same issue, except the cycle of occurance is about seven days. It began probably with an dist-upgrade 14 days ago.
 

blebo

Member
Apr 22, 2018
5
1
8
34
Perth, Australia
Definitely NOT solved, however I've been able to stabilise memory to more typical usage pattern when I stopped an LXC container running Grafana/Influx DB (Ubuntu 18.04) - which was monitoring the same PVE host. Something in pvestatd must be having trouble with that type of resource usage profile.
 

ChristianW

New Member
Feb 13, 2020
10
1
3
42
Hm, we do have somehow a similar setup:
VM running OPNSense with wireguard tunnel to grafana/graphite.
Proxmox pushes metric data through this tunnel.

In our case we have a debian base with proxmox installed.
Our trouble startet with the last dist-upgrade, we made 9.10.2020.
 

ChristianW

New Member
Feb 13, 2020
10
1
3
42
Last week we changed from udp to tcp in status.cfg. For the last round we had no crash. Further observing and maybe we'll give udp with the new mtu feature for status.cfg a try.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!