PVE6 slab cache grows until VMs start to crash

small update:
on the virtual pve node finaly the corosync process was killed by oom, so i had to reboot:

Bash:
grep oom-kill /var/log/messages | tail -n 1
Oct  1 09:50:53 vspve4999 kernel: [419571.923228] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=1286,uid=0

overview of the growing slab for a week:
1569922478095.png
corresponding uptime:
1569922833617.png

you may find the attached pdf with all performance data collected from this node over a week.
 
The virtual pve-node was running with the mainline-kernel over the weekend?
That would indicate that the issue is probably in the upstream kernel - and not introduced by Ubuntu's or our patchset.

any chance you could try running the node with an older PVE-kernel (e.g. pve-kernel-5.0.12-1-pve)?

Thanks for reporting back!
 
yes, the node run with the mainline-kernel over the weekend.

the latest step today was to revert to the latest pve-kernel (5.0.21-6) which is a bit newer than the previous pve-kernel (5.0.21-2).
tldr. the changelog if the new kernel would address this issue?

oh, and i could not attach the pdf...too large.
you may find it here:
http://share.kohly.de/index.php/s/6sRi365ory7tEfA
 
the kernel "PVE 5.0.21-6" did not change anything regarding the usage of slab.
i just have installed the pve-kernel-5.0.12-1-pve as suggested and rebooted the virtual pve node.
 
hmm.. - one other user, who experiences this, reported that the issue was not present after booting 5.0.21-6 initially (which made me hope it might be fixed in that version)

Thanks for reporting back!
 
small update:
the graph with 5.0.21-6:
1570009474804.png
the graph with 5.0.12-1:
1570009567035.png
 

Attachments

  • 1570009545612.png
    1570009545612.png
    20.6 KB · Views: 2
thanx for your research!

yes, we are using cmk for monitoring, and yes, we use ipmi to monitor the hardware.

Bash:
root@vspve4999:~# aptitude search ipmi | grep ^i
i  freeipmi - GNU implementation of the IPMI protocol
i A freeipmi-bmc-watchdog - GNU implementation of the IPMI protocol - BMC watchdog
i A freeipmi-common - GNU implementation of the IPMI protocol - common files
i A freeipmi-ipmidetect - GNU IPMI - IPMI node detection tool
i  freeipmi-tools - GNU implementation of the IPMI protocol - tools
i  ipmitool - utility for IPMI control with kernel driver or LAN interface (daemon)
i A libfreeipmi17 - GNU IPMI - libraries
i A libipmiconsole2 - GNU IPMI - Serial-over-Lan library
i A libipmidetect0 - GNU IPMI - IPMI node detection library
root@vspve4999:~# aptitude remove freeipmi freeipmi-bmc-watchdog freeipmi-common freeipmi-ipmidetect freeipmi-tools ipmitool libfreeipmi17 libipmiconsole2 libipmidetect0
<snip>
<snap>
root@vspve4999:~# reboot; exit

so we will see...
 
Can you try unloading all the ipmi_ kmods? (lsmod | grep ipmi_ -> rmmod ...)

Result over here:

Screenshot 2019-10-02 at 22.11.06.png


But as just said to someone else, I suspect we're chasing more than one kernel memleak ...
 
this is a great idea, but this node is a virtual machine.
so there are no ipmi modules loaded at all...

Bash:
root@vspve4999:~# lsmod | grep ipmi_
root@vspve4999:~#
 
Ah. So then you had ipmitool installed but it could never do anything, right?
Yeah so we are indeed chasing more than one leak there...
 
Can you run execsnoop for a few minutes and post what it reports?
My other wild guess is stopping pve-firewall.service.
 
sorry, don't know 'execsnoop', please explain what to do...
stopped the pve-firewall.service right now.
 
sorry, don't know 'execsnoop', please explain what to do...

The packages to install would be `bpfcc-tools` and `pve-headers`; then you'll have execsnoop-bpfcc.
Can just run this for a few minutes (as `execsnoop-bpfcc -tx` preferably). It will show what programs get executed.
 
hopefully about five minutes are enough, please tell, if more needed
 

Attachments

  • execsnoop-vslnx4999.out.txt
    73.4 KB · Views: 5
hm - apart from the cmk-agent invocations and some regular timers from PVE I don't see too much going on on that host ...
Would it be possible to disable the cmk-agent service/socket on the host and see whether there's any effect? (you can continue monitoring with slabtop)

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!