pvestatd leak creates 3000+ processes, consumes all RAM & SWAP, and halts/reboots machine over a 4 hour cycle. Then repeats.

blebo · Oct 1, 2020

This has occurred from the later PVE 6.1 updates (I think) and definitely all throughout PVE 6.2 (including 6.2-12) since about March 2020. Prior to this the system was rock-solid for about 1.5 years.

Normal operation is ~10-12Gig usage (of 32G total). See attached picture for the cycle.

Recently I noticed that if I stop pvestatd service via GUI before the leak hits SWAP, and it successfully stops, then the memory leak is stopped and cleared instantly. However starting pvestatd again just starts the 4hr cycle from the beginning.

I’ve monitored the pvestatd process count using the following command (in a cron bash script every 5mins), where I routinely see 3000+ pvestatd processes:
ps -eo pid,ppid,cmd,%mem,%cpu,vsz,rss --sort=-%mem | grep pvestatd | wc -l

ZFS ram cache appears to behave and progressively reduce its foot print as the pvestatd leak approaches 100% RAM - monitored using the following command:
awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats

Package versions:

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

Any assistance is much appreciated. Let me know if more information required.

Cheers

fabian · Oct 1, 2020

you could see what those processes are doing (or rather, what those processes are stuck on). my guess is some hanging storage?

blebo · Oct 4, 2020

Looking into it a bit further, I am see the following typical in syslog:

Code:

Oct 04 16:00:47 pm01 pvestatd[28666]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18976]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19126]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19964]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19443]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[21277]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: node status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[29396]: status update error: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18790]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: qemu status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 112 -p' failed: open3: pipe(GLOB(0x55f17f4c32f0), IO::File=GLOB(0x55f17f4b6700)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[23438]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 114 -p' failed: open3: pipe(GLOB(0x55f17f4b6fe8), IO::File=GLOB(0x55f17f4cfc48)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 116 -p' failed: open3: pipe(GLOB(0x55f17ee98fd8), IO::File=GLOB(0x55f17eecf970)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 107 -p' failed: open3: pipe(GLOB(0x55f17eeafd60), IO::File=GLOB(0x55f17eee3750)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24990]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: lxc status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17f4cacb0), IO::File=GLOB(0x55f17ee9ea68)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-vm', zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[21546]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[29871]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[28837]: ipcc_send_rec[2] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24325]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[19780]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[22968]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30280]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30282]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30280]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30277]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30279]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30278]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.

blebo · Oct 4, 2020

Following my update above, is it possible that somewhere between PVE 5.x and PVE 6.2, the default limit has been increased for a clean install (i.e. ulimit -n)?

output of ulimit -a :

Code:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127563
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127563
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Then when running the following script, shows that each pvestatd process is attempting to open 1024 files:

Code:

SEARCH="pvestatd"
for i in $(ps -C "${SEARCH}" -o pid | grep -v PID)
do
     echo "PID # ${i} open files count : $(ls -l /proc/${i}/fd | wc -l)"
done

Output (typical):

Code:

...
PID # 31846 open files count : 1023
PID # 31847 open files count : 1023
PID # 31848 open files count : 1023
PID # 31849 open files count : 1023
PID # 31850 open files count : 1023
PID # 32049 open files count : 1023
PID # 32050 open files count : 1023
PID # 32051 open files count : 1023
PID # 32052 open files count : 1023
PID # 32053 open files count : 1023
PID # 32054 open files count : 1023
PID # 32055 open files count : 1023
PID # 32275 open files count : 1023
PID # 32276 open files count : 1023
PID # 32277 open files count : 1023
PID # 32278 open files count : 1023
PID # 32279 open files count : 1023
PID # 32280 open files count : 1023
PID # 32281 open files count : 1023
PID # 32379 open files count : 1023
PID # 32380 open files count : 1023
PID # 32381 open files count : 1023
PID # 32382 open files count : 1023
PID # 32383 open files count : 1023
PID # 32384 open files count : 1023
PID # 32385 open files count : 1023
PID # 32573 open files count : 1023
PID # 32574 open files count : 1023
PID # 32575 open files count : 1023
...

fabian · Oct 5, 2020

I think that is just a symptom.. please check where the earlier processes (with the lowest PIDs) are hanging (e.g., from which file they are reading, which children they have forked and are waiting for, ...).

ChristianW · Oct 24, 2020

Hi,
is there any Update on this? We are facing the same issue, except the cycle of occurance is about seven days. It began probably with an dist-upgrade 14 days ago.

blebo · Oct 25, 2020

Definitely NOT solved, however I've been able to stabilise memory to more typical usage pattern when I stopped an LXC container running Grafana/Influx DB (Ubuntu 18.04) - which was monitoring the same PVE host. Something in pvestatd must be having trouble with that type of resource usage profile.

ChristianW · Nov 2, 2020

Hm, we do have somehow a similar setup:
VM running OPNSense with wireguard tunnel to grafana/graphite.
Proxmox pushes metric data through this tunnel.

In our case we have a debian base with proxmox installed.
Our trouble startet with the last dist-upgrade, we made 9.10.2020.

ChristianW · Nov 10, 2020

Last week we changed from udp to tcp in status.cfg. For the last round we had no crash. Further observing and maybe we'll give udp with the new mtu feature for status.cfg a try.

jamesharr · Feb 7, 2022

For anyone in the Googling around, I think I ran into this bug on Proxmox-VE 6.4-13. Turns out I had a Graphite metric server configured using UDP, except the server was offline. System became nearly unresponsive, lots of Out-of-Memory messages on the console. When I finally got logged in, I noticed that there were thousands of pvestatd processes. After killing all of those processes, things returned to normal. After rebooting and keeping an eye on pvestatd, I noticed that RAM slowly increased for that process, about every time it reported in `journalctl` that the graphite server was unreachable.

I've since removed the graphite metric server (via the GUI) and pvestatd seems to be stable, we'll see if it's stable in a few days. If you don't see a follow-up post, that's probably what worked for me. Figured people would want to know that the problem still exists in 6.4-13. I may not enable metrics collection again (need isn't strong) unless I feel like trying it again (or trying another piece of software).

I will also probably upgrade to 7.x in the next couple weeks. Again, it's unlikely I'll re-enable metrics collection, but if I do, I'll report back if I have the problem again.

TY for everyone in this thread. Helped me figure out what was going on.

blebo · Feb 7, 2022

From recollection, issue was still present in early v7.0 releases. Haven’t had another shot at it since.

Search

Search

pvestatd leak creates 3000+ processes, consumes all RAM & SWAP, and halts/reboots machine over a 4 hour cycle. Then repeats.

blebo

Active Member

Attachments

fabian

Proxmox Staff Member

blebo

Active Member

blebo

Active Member

fabian

Proxmox Staff Member

ChristianW

Member

blebo

Active Member

ChristianW

Member

ChristianW

Member

jamesharr

Member

blebo

Active Member