PVE6 slab cache grows until VMs start to crash

small update:
on the virtual pve node finaly the corosync process was killed by oom, so i had to reboot:

Bash:
grep oom-kill /var/log/messages | tail -n 1
Oct  1 09:50:53 vspve4999 kernel: [419571.923228] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=1286,uid=0

overview of the growing slab for a week:
1569922478095.png
corresponding uptime:
1569922833617.png

you may find the attached pdf with all performance data collected from this node over a week.
 
The virtual pve-node was running with the mainline-kernel over the weekend?
That would indicate that the issue is probably in the upstream kernel - and not introduced by Ubuntu's or our patchset.

any chance you could try running the node with an older PVE-kernel (e.g. pve-kernel-5.0.12-1-pve)?

Thanks for reporting back!
 
yes, the node run with the mainline-kernel over the weekend.

the latest step today was to revert to the latest pve-kernel (5.0.21-6) which is a bit newer than the previous pve-kernel (5.0.21-2).
tldr. the changelog if the new kernel would address this issue?

oh, and i could not attach the pdf...too large.
you may find it here:
http://share.kohly.de/index.php/s/6sRi365ory7tEfA
 
the kernel "PVE 5.0.21-6" did not change anything regarding the usage of slab.
i just have installed the pve-kernel-5.0.12-1-pve as suggested and rebooted the virtual pve node.
 
hmm.. - one other user, who experiences this, reported that the issue was not present after booting 5.0.21-6 initially (which made me hope it might be fixed in that version)

Thanks for reporting back!
 
small update:
the graph with 5.0.21-6:
1570009474804.png
the graph with 5.0.12-1:
1570009567035.png
 

Attachments

  • 1570009545612.png
    1570009545612.png
    20.6 KB · Views: 2
thanx for your research!

yes, we are using cmk for monitoring, and yes, we use ipmi to monitor the hardware.

Bash:
root@vspve4999:~# aptitude search ipmi | grep ^i
i  freeipmi - GNU implementation of the IPMI protocol
i A freeipmi-bmc-watchdog - GNU implementation of the IPMI protocol - BMC watchdog
i A freeipmi-common - GNU implementation of the IPMI protocol - common files
i A freeipmi-ipmidetect - GNU IPMI - IPMI node detection tool
i  freeipmi-tools - GNU implementation of the IPMI protocol - tools
i  ipmitool - utility for IPMI control with kernel driver or LAN interface (daemon)
i A libfreeipmi17 - GNU IPMI - libraries
i A libipmiconsole2 - GNU IPMI - Serial-over-Lan library
i A libipmidetect0 - GNU IPMI - IPMI node detection library
root@vspve4999:~# aptitude remove freeipmi freeipmi-bmc-watchdog freeipmi-common freeipmi-ipmidetect freeipmi-tools ipmitool libfreeipmi17 libipmiconsole2 libipmidetect0
<snip>
<snap>
root@vspve4999:~# reboot; exit

so we will see...
 
Can you try unloading all the ipmi_ kmods? (lsmod | grep ipmi_ -> rmmod ...)

Result over here:

Screenshot 2019-10-02 at 22.11.06.png


But as just said to someone else, I suspect we're chasing more than one kernel memleak ...
 
this is a great idea, but this node is a virtual machine.
so there are no ipmi modules loaded at all...

Bash:
root@vspve4999:~# lsmod | grep ipmi_
root@vspve4999:~#
 
Ah. So then you had ipmitool installed but it could never do anything, right?
Yeah so we are indeed chasing more than one leak there...
 
Can you run execsnoop for a few minutes and post what it reports?
My other wild guess is stopping pve-firewall.service.
 
sorry, don't know 'execsnoop', please explain what to do...
stopped the pve-firewall.service right now.
 
sorry, don't know 'execsnoop', please explain what to do...

The packages to install would be `bpfcc-tools` and `pve-headers`; then you'll have execsnoop-bpfcc.
Can just run this for a few minutes (as `execsnoop-bpfcc -tx` preferably). It will show what programs get executed.
 
hm - apart from the cmk-agent invocations and some regular timers from PVE I don't see too much going on on that host ...
Would it be possible to disable the cmk-agent service/socket on the host and see whether there's any effect? (you can continue monitoring with slabtop)

Thanks!