PVE6 slab cache grows until VMs start to crash

kohly · Oct 3, 2019

Bash:

root@vspve4999:~# systemctl stop check_mk.socket

will wait a few hours and reenable it then...

kohly · Oct 3, 2019

wow, look at this:

Stoiko Ivanov · Oct 3, 2019

do I see correctly that stopping the monitoring mitigates the problem?!
or is this the effect of the graph's not getting drawn while the system was not monitored? and the issue is present whether cmk-agent runs or not?

In the first case you could narrow down the problem further by:
* disabling all cmk-plugins (move the plugindir away):
** if the slab does not grow further - it's in one of the plugins
** if it grows further - chances are it's (only) in one of the commands run by the check_mk_agent script

Thanks!

kohly · Oct 3, 2019

as far as i can see is, the slab has stopped growing.

what i did (round about 1740h): configure the execution of the agent 'old style' by xinetd.
seems to be stable...plz. wait a couple of hours...i will give an update.

kohly · Oct 3, 2019

amazing:

i did nothing else than switching cmk agent from socket to xinetd.

kohly · Oct 3, 2019

commands executed now on every pve6 node:

Bash:

systemctl stop check_mk.socket
systemctl stop system-check_mk.slice
systemctl disable check_mk.socket
systemctl restart xinetd.service

will give an update tomorrow...

kohly · Oct 4, 2019

switching check_mk_agent to xinetd seems to help.
here a graph from an other node:

for testing purposes, i will switch back to socket on the virtual node to see if the slab grows there again ...

kohly · Oct 4, 2019

@zeha:

may you please verify a running 'system-check_mk.slice' and its memory limit/current?

Code:

systemctl status system-check_mk.slice
systemctl show system-check_mk.slice -p MemoryLimit
systemctl show system-check_mk.slice -p MemoryCurrent

kohly · Oct 4, 2019

i can verify that the slab grows immediately after reactivating the check_mk.socket

kohly · Oct 4, 2019

got an answer from cmk:

Ronny Bruska commented on FEED-4380 (check_mk.socket memory leak):

Hallo Kohly,

danke. Wir haben diesbezüglich schon eine Änderung im master.

[https://checkmk.de/check_mk-werks.php?werk_id=10070]

Diese wird in Kürze auch in die 1.6 gepusht.

Viele Grüße

Ronny

zeha · Oct 4, 2019

Also see: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940021

zeha · Oct 4, 2019

kohly said:
@zeha:

may you please verify a running 'system-check_mk.slice' and its memory limit/current?

Code:

systemctl status system-check_mk.slice systemctl show system-check_mk.slice -p MemoryLimit systemctl show system-check_mk.slice -p MemoryCurrent

Can report:

Code:

17:32 ch@vn01:~ % systemctl show system-check_mk.slice -p MemoryLimit
MemoryLimit=infinity
17:32 ch@vn01:~ % systemctl show system-check_mk.slice -p MemoryCurrent
MemoryCurrent=562638848
17:32 ch@vn03:~ % systemctl show system-check_mk.slice -p MemoryLimit
MemoryLimit=infinity
17:32 ch@vn03:~ % systemctl show system-check_mk.slice -p MemoryCurrent
MemoryCurrent=1619193856

This might be related: https://bugzilla.redhat.com/show_bug.cgi?id=1507149
I've seen some possibly related kernel fixes elsewhere, but can't find them right now.

zeha · Oct 4, 2019

BTW, even better:

systemctl stop system-check_mk.slice results in the memory getting freed:

No useless reboots <3

kohly · Oct 5, 2019

@zeha:
you maybe like to execute:

Bash:

systemctl stop check_mk.socket
systemctl disable check_mk.socket
sed -i s/KillMode=process/Type=forking/ /etc/systemd/system/check_mk@.service
systemctl daemon-reload
systemctl enable check_mk.socket
systemctl start check_mk.socket

please report back...

zeha · Oct 5, 2019

For now I can report this:
On one site I've switched the misbehaving machine from check_mk.socket to xinetd (because the rest of the fleet there is set up like that), and the problem is gone.

I'll try Type=forking soon.

zeha · Oct 8, 2019

zeha said:
I'll try Type=forking soon.

That also appears to solve the problem. Bit of a meh "solution" though.

kohly · Oct 8, 2019

the guys from cmk told me, that the fix will be merged to the latest stable tree (1.6) in an upcoming release.
anyway: this seems to me as a memory leak in systemd.

thanx to the proxmox staff for help sorting this out.

Ramblin · Oct 9, 2019

Wow, If anyone ever needed evidence that the Proxmox support community was active and helpful, this is it.

I have been tracking this but waiting until resolution to ask a noobie question: Is this issue present in all Proxmox 6 ISO installations or did zeha install an extra monitoring package that ended up causing this issue?

If it affects all Proxmox 6 ISO installations, should we all do as instructed in entry #54 above?

If it is due to an extra package installation on Proxmox 6, is the package we should avoid installing the monitoring package check_mk_agent ?

Stoiko Ivanov · Oct 9, 2019

The issue seems to be between systemd (init-system on many linux distribution) and CheckMK (a.k.a cmk) - an independent monitoring system, which triggers the bug.

Default installations (without any additional software installed) of PVE 6.0 do not trigger this problem.

I hope this helps!

Ramblin · Oct 9, 2019

Stoiko Ivanov said:
The issue seems to be between systemd (init-system on many linux distribution) and CheckMK (a.k.a cmk) - an independent monitoring system, which triggers the bug.

Default installations (without any additional software installed) of PVE 6.0 do not trigger this problem.

I hope this helps!

Thank you for the quick and clear response

PVE6 slab cache grows until VMs start to crash

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Proxmox Staff Member

Member

We value your privacy