PVE6 slab cache grows until VMs start to crash

Bash:
root@vspve4999:~# systemctl stop check_mk.socket

will wait a few hours and reenable it then...
 
do I see correctly that stopping the monitoring mitigates the problem?!
or is this the effect of the graph's not getting drawn while the system was not monitored? and the issue is present whether cmk-agent runs or not?

In the first case you could narrow down the problem further by:
* disabling all cmk-plugins (move the plugindir away):
** if the slab does not grow further - it's in one of the plugins
** if it grows further - chances are it's (only) in one of the commands run by the check_mk_agent script

Thanks!
 
as far as i can see is, the slab has stopped growing.

what i did (round about 1740h): configure the execution of the agent 'old style' by xinetd.
seems to be stable...plz. wait a couple of hours...i will give an update.
1570122345852.png
 
Last edited:
amazing:
1570129051067.png
i did nothing else than switching cmk agent from socket to xinetd.
 
commands executed now on every pve6 node:
Bash:
systemctl stop check_mk.socket
systemctl stop system-check_mk.slice
systemctl disable check_mk.socket
systemctl restart xinetd.service

will give an update tomorrow...
 
switching check_mk_agent to xinetd seems to help.
here a graph from an other node:
1570166344912.png

for testing purposes, i will switch back to socket on the virtual node to see if the slab grows there again ...
 
@zeha:

may you please verify a running 'system-check_mk.slice' and its memory limit/current?

Code:
systemctl status system-check_mk.slice
systemctl show system-check_mk.slice -p MemoryLimit
systemctl show system-check_mk.slice -p MemoryCurrent
 
i can verify that the slab grows immediately after reactivating the check_mk.socket
 
@zeha:

may you please verify a running 'system-check_mk.slice' and its memory limit/current?

Code:
systemctl status system-check_mk.slice
systemctl show system-check_mk.slice -p MemoryLimit
systemctl show system-check_mk.slice -p MemoryCurrent

Can report:

Code:
17:32 ch@vn01:~ % systemctl show system-check_mk.slice -p MemoryLimit
MemoryLimit=infinity
17:32 ch@vn01:~ % systemctl show system-check_mk.slice -p MemoryCurrent
MemoryCurrent=562638848
17:32 ch@vn03:~ % systemctl show system-check_mk.slice -p MemoryLimit
MemoryLimit=infinity
17:32 ch@vn03:~ % systemctl show system-check_mk.slice -p MemoryCurrent
MemoryCurrent=1619193856

This might be related: https://bugzilla.redhat.com/show_bug.cgi?id=1507149
I've seen some possibly related kernel fixes elsewhere, but can't find them right now.
 
BTW, even better:

systemctl stop system-check_mk.slice results in the memory getting freed:

Screenshot 2019-10-04 at 22.45.35.png

No useless reboots <3
 
@zeha:
you maybe like to execute:

Bash:
systemctl stop check_mk.socket
systemctl disable check_mk.socket
sed -i s/KillMode=process/Type=forking/ /etc/systemd/system/check_mk@.service
systemctl daemon-reload
systemctl enable check_mk.socket
systemctl start check_mk.socket

please report back...
 
For now I can report this:
On one site I've switched the misbehaving machine from check_mk.socket to xinetd (because the rest of the fleet there is set up like that), and the problem is gone.

I'll try Type=forking soon.
 
the guys from cmk told me, that the fix will be merged to the latest stable tree (1.6) in an upcoming release.
anyway: this seems to me as a memory leak in systemd.

thanx to the proxmox staff for help sorting this out.
 
Wow, If anyone ever needed evidence that the Proxmox support community was active and helpful, this is it.

I have been tracking this but waiting until resolution to ask a noobie question: Is this issue present in all Proxmox 6 ISO installations or did zeha install an extra monitoring package that ended up causing this issue?

If it affects all Proxmox 6 ISO installations, should we all do as instructed in entry #54 above?

If it is due to an extra package installation on Proxmox 6, is the package we should avoid installing the monitoring package check_mk_agent ?
 
The issue seems to be between systemd (init-system on many linux distribution) and CheckMK (a.k.a cmk) - an independent monitoring system, which triggers the bug.

Default installations (without any additional software installed) of PVE 6.0 do not trigger this problem.

I hope this helps!
 
The issue seems to be between systemd (init-system on many linux distribution) and CheckMK (a.k.a cmk) - an independent monitoring system, which triggers the bug.

Default installations (without any additional software installed) of PVE 6.0 do not trigger this problem.

I hope this helps!

Thank you for the quick and clear response
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!