PVE6 slab cache grows until VMs start to crash

kohly · Sep 28, 2019

small update:

kohly · Oct 1, 2019

small update:
on the virtual pve node finaly the corosync process was killed by oom, so i had to reboot:

Bash:

grep oom-kill /var/log/messages | tail -n 1
Oct  1 09:50:53 vspve4999 kernel: [419571.923228] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=1286,uid=0

overview of the growing slab for a week:

corresponding uptime:

you may find the attached pdf with all performance data collected from this node over a week.

Stoiko Ivanov · Oct 1, 2019

The virtual pve-node was running with the mainline-kernel over the weekend?
That would indicate that the issue is probably in the upstream kernel - and not introduced by Ubuntu's or our patchset.

any chance you could try running the node with an older PVE-kernel (e.g. pve-kernel-5.0.12-1-pve)?

Thanks for reporting back!

kohly · Oct 1, 2019

yes, the node run with the mainline-kernel over the weekend.

the latest step today was to revert to the latest pve-kernel (5.0.21-6) which is a bit newer than the previous pve-kernel (5.0.21-2).
tldr. the changelog if the new kernel would address this issue?

oh, and i could not attach the pdf...too large.
you may find it here:
http://share.kohly.de/index.php/s/6sRi365ory7tEfA

kohly · Oct 2, 2019

the kernel "PVE 5.0.21-6" did not change anything regarding the usage of slab.
i just have installed the pve-kernel-5.0.12-1-pve as suggested and rebooted the virtual pve node.

Stoiko Ivanov · Oct 2, 2019

hmm.. - one other user, who experiences this, reported that the issue was not present after booting 5.0.21-6 initially (which made me hope it might be fixed in that version)

Thanks for reporting back!

kohly · Oct 2, 2019

small update:
the graph with 5.0.21-6:

the graph with 5.0.12-1:

Stoiko Ivanov · Oct 2, 2019

Another user, who experiences the issue shared via user-list that potentially installed ipmi related packages could be a contributing factor (and from the report you shared I'd guess you're also using cmk/some other monitoring solution):
https://pve.proxmox.com/pipermail/pve-user/2019-October/171063.html

I hope this helps!

kohly · Oct 2, 2019

thanx for your research!

yes, we are using cmk for monitoring, and yes, we use ipmi to monitor the hardware.

Bash:

root@vspve4999:~# aptitude search ipmi | grep ^i
i  freeipmi - GNU implementation of the IPMI protocol
i A freeipmi-bmc-watchdog - GNU implementation of the IPMI protocol - BMC watchdog
i A freeipmi-common - GNU implementation of the IPMI protocol - common files
i A freeipmi-ipmidetect - GNU IPMI - IPMI node detection tool
i  freeipmi-tools - GNU implementation of the IPMI protocol - tools
i  ipmitool - utility for IPMI control with kernel driver or LAN interface (daemon)
i A libfreeipmi17 - GNU IPMI - libraries
i A libipmiconsole2 - GNU IPMI - Serial-over-Lan library
i A libipmidetect0 - GNU IPMI - IPMI node detection library
root@vspve4999:~# aptitude remove freeipmi freeipmi-bmc-watchdog freeipmi-common freeipmi-ipmidetect freeipmi-tools ipmitool libfreeipmi17 libipmiconsole2 libipmidetect0
<snip>
<snap>
root@vspve4999:~# reboot; exit

so we will see...

Stoiko Ivanov · Oct 2, 2019

fingers crossed

kohly · Oct 2, 2019

seems like it makes no difference:

zeha · Oct 2, 2019

Can you try unloading all the ipmi_ kmods? (lsmod | grep ipmi_ -> rmmod ...)

Result over here:

But as just said to someone else, I suspect we're chasing more than one kernel memleak ...

kohly · Oct 2, 2019

this is a great idea, but this node is a virtual machine.
so there are no ipmi modules loaded at all...

Bash:

root@vspve4999:~# lsmod | grep ipmi_
root@vspve4999:~#

zeha · Oct 2, 2019

Ah. So then you had ipmitool installed but it could never do anything, right?
Yeah so we are indeed chasing more than one leak there...

zeha · Oct 2, 2019

Can you run execsnoop for a few minutes and post what it reports?
My other wild guess is stopping pve-firewall.service.

kohly · Oct 3, 2019

sorry, don't know 'execsnoop', please explain what to do...
stopped the pve-firewall.service right now.

zeha · Oct 3, 2019

kohly said:
sorry, don't know 'execsnoop', please explain what to do...

The packages to install would be `bpfcc-tools` and `pve-headers`; then you'll have execsnoop-bpfcc.
Can just run this for a few minutes (as `execsnoop-bpfcc -tx` preferably). It will show what programs get executed.

kohly · Oct 3, 2019

hopefully about five minutes are enough, please tell, if more needed

kohly · Oct 3, 2019

btw: slab cache keeps growing with disabled pve-firewall

Stoiko Ivanov · Oct 3, 2019

hm - apart from the cmk-agent invocations and some regular timers from PVE I don't see too much going on on that host ...
Would it be possible to disable the cmk-agent service/socket on the host and see whether there's any effect? (you can continue monitoring with slabtop)

Thanks!

PVE6 slab cache grows until VMs start to crash

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Attachments

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Attachments

Renowned Member

Proxmox Staff Member

We value your privacy