pvestatd leak creates 3000+ processes, consumes all RAM & SWAP, and halts/reboots machine over a 4 hour cycle. Then repeats.

blebo

Active Member
Apr 22, 2018
12
6
43
37
Perth, Australia
This has occurred from the later PVE 6.1 updates (I think) and definitely all throughout PVE 6.2 (including 6.2-12) since about March 2020. Prior to this the system was rock-solid for about 1.5 years.

Normal operation is ~10-12Gig usage (of 32G total). See attached picture for the cycle.

Recently I noticed that if I stop pvestatd service via GUI before the leak hits SWAP, and it successfully stops, then the memory leak is stopped and cleared instantly. However starting pvestatd again just starts the 4hr cycle from the beginning.

I’ve monitored the pvestatd process count using the following command (in a cron bash script every 5mins), where I routinely see 3000+ pvestatd processes:
ps -eo pid,ppid,cmd,%mem,%cpu,vsz,rss --sort=-%mem | grep pvestatd | wc -l

ZFS ram cache appears to behave and progressively reduce its foot print as the pvestatd leak approaches 100% RAM - monitored using the following command:
awk '/^size/ { print $1 " " $3 / 1048576 }' < /proc/spl/kstat/zfs/arcstats

Package versions:

Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 0.8.21-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1


Any assistance is much appreciated. Let me know if more information required.

Cheers
 

Attachments

  • Screen Shot 2020-10-01 at 5.52.50 pm.png
    Screen Shot 2020-10-01 at 5.52.50 pm.png
    52.9 KB · Views: 13
you could see what those processes are doing (or rather, what those processes are stuck on). my guess is some hanging storage?
 
Looking into it a bit further, I am see the following typical in syslog:

Code:
Oct 04 16:00:47 pm01 pvestatd[28666]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18976]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19126]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19964]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[19443]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[21277]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: node status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[29396]: status update error: Too many open files
Oct 04 16:00:47 pm01 pvestatd[18790]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: qemu status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 112 -p' failed: open3: pipe(GLOB(0x55f17f4c32f0), IO::File=GLOB(0x55f17f4b6700)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[23438]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 114 -p' failed: open3: pipe(GLOB(0x55f17f4b6fe8), IO::File=GLOB(0x55f17f4cfc48)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 116 -p' failed: open3: pipe(GLOB(0x55f17ee98fd8), IO::File=GLOB(0x55f17eecf970)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: command 'lxc-info -n 107 -p' failed: open3: pipe(GLOB(0x55f17eeafd60), IO::File=GLOB(0x55f17eee3750)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24990]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: lxc status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17f4cacb0), IO::File=GLOB(0x55f17ee9ea68)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-vm', zfs error: open3: pipe(GLOB(0x55f17eea3578), IO::File=GLOB(0x55f17eefc050)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[21546]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[29871]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17cd7eae0), IO::File=GLOB(0x55f17eee3570)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[3517]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[28837]: ipcc_send_rec[2] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17f4c8690)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[30277]: could not activate storage 'zfs-lxc', zfs error: open3: pipe(GLOB(0x55f17eed0e98), IO::File=GLOB(0x55f17ee98438)) failed: Too many open files at /usr/share/perl5/PVE/Tools.pm line 444.
Oct 04 16:00:47 pm01 pvestatd[24325]: ipcc_send_rec[1] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[19780]: ipcc_send_rec[3] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[22968]: ipcc_send_rec[4] failed: Too many open files
Oct 04 16:00:47 pm01 pvestatd[30277]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30280]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30279]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30278]: Can't call method "reader" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 975.
Oct 04 16:00:47 pm01 pvestatd[30282]: Can't call method "writer" on an undefined value at /usr/share/perl5/PVE/Tools.pm line 954.
Oct 04 16:00:47 pm01 pvestatd[3517]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30280]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30277]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30279]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
Oct 04 16:00:47 pm01 pvestatd[30278]: storage status update error: Unrecognised protocol udp at /usr/share/perl5/PVE/Status/InfluxDB.pm line 92.
 
Following my update above, is it possible that somewhere between PVE 5.x and PVE 6.2, the default limit has been increased for a clean install (i.e. ulimit -n)?

output of ulimit -a :
Code:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127563
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127563
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Then when running the following script, shows that each pvestatd process is attempting to open 1024 files:
Code:
SEARCH="pvestatd"
for i in $(ps -C "${SEARCH}" -o pid | grep -v PID)
do
     echo "PID # ${i} open files count : $(ls -l /proc/${i}/fd | wc -l)"
done


Output (typical):
Code:
...
PID # 31846 open files count : 1023
PID # 31847 open files count : 1023
PID # 31848 open files count : 1023
PID # 31849 open files count : 1023
PID # 31850 open files count : 1023
PID # 32049 open files count : 1023
PID # 32050 open files count : 1023
PID # 32051 open files count : 1023
PID # 32052 open files count : 1023
PID # 32053 open files count : 1023
PID # 32054 open files count : 1023
PID # 32055 open files count : 1023
PID # 32275 open files count : 1023
PID # 32276 open files count : 1023
PID # 32277 open files count : 1023
PID # 32278 open files count : 1023
PID # 32279 open files count : 1023
PID # 32280 open files count : 1023
PID # 32281 open files count : 1023
PID # 32379 open files count : 1023
PID # 32380 open files count : 1023
PID # 32381 open files count : 1023
PID # 32382 open files count : 1023
PID # 32383 open files count : 1023
PID # 32384 open files count : 1023
PID # 32385 open files count : 1023
PID # 32573 open files count : 1023
PID # 32574 open files count : 1023
PID # 32575 open files count : 1023
...
 
I think that is just a symptom.. please check where the earlier processes (with the lowest PIDs) are hanging (e.g., from which file they are reading, which children they have forked and are waiting for, ...).
 
Hi,
is there any Update on this? We are facing the same issue, except the cycle of occurance is about seven days. It began probably with an dist-upgrade 14 days ago.
 
Definitely NOT solved, however I've been able to stabilise memory to more typical usage pattern when I stopped an LXC container running Grafana/Influx DB (Ubuntu 18.04) - which was monitoring the same PVE host. Something in pvestatd must be having trouble with that type of resource usage profile.
 
Hm, we do have somehow a similar setup:
VM running OPNSense with wireguard tunnel to grafana/graphite.
Proxmox pushes metric data through this tunnel.

In our case we have a debian base with proxmox installed.
Our trouble startet with the last dist-upgrade, we made 9.10.2020.
 
Last week we changed from udp to tcp in status.cfg. For the last round we had no crash. Further observing and maybe we'll give udp with the new mtu feature for status.cfg a try.
 
For anyone in the Googling around, I think I ran into this bug on Proxmox-VE 6.4-13. Turns out I had a Graphite metric server configured using UDP, except the server was offline. System became nearly unresponsive, lots of Out-of-Memory messages on the console. When I finally got logged in, I noticed that there were thousands of pvestatd processes. After killing all of those processes, things returned to normal. After rebooting and keeping an eye on pvestatd, I noticed that RAM slowly increased for that process, about every time it reported in `journalctl` that the graphite server was unreachable.

I've since removed the graphite metric server (via the GUI) and pvestatd seems to be stable, we'll see if it's stable in a few days. If you don't see a follow-up post, that's probably what worked for me. Figured people would want to know that the problem still exists in 6.4-13. I may not enable metrics collection again (need isn't strong) unless I feel like trying it again (or trying another piece of software).

I will also probably upgrade to 7.x in the next couple weeks. Again, it's unlikely I'll re-enable metrics collection, but if I do, I'll report back if I have the problem again.

TY for everyone in this thread. Helped me figure out what was going on.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!