PVE6 slab cache grows until VMs start to crash

aflott

New Member
Sep 23, 2019
4
0
1
23
Hello,
We have currently two PVE6 setups, that suffer the same problem: The SLAB cache grows steadily until the VMs start to crash.
Both had ben installed freshly with the PVE6 iso. First one is a single node and the second setup is a three node cluster.
On the single node vm storage is plain local Linux LVM, the cluster uses local ZFS. The SLAB caches grows constantly on both nodes. Only a reboot fixes the issue. Dropping the kernel cache via echo 1/2/3 > /proc/sys/vm/drop_caches has no effect on SLAB, ZFS ARC cache gets pruged as expected.
Checking the running processes shows no oddities too. So it's likely a kernel issue. Current kernel is 5.0.21-1-pve and we try to track updates as fast as possible.

Has annyone experienced the same issue as we did? Are there any known leakings/ issues with the 5.x kernel, that could explain the problem?

Kind regards
-Alexander
 
Hello spirit,
As requested the output of /proc/slabinfo before and after dropping the caches with:

# echo 1 > /proc/sys/vm/drop_caches # echo 2 > /proc/sys/vm/drop_caches # echo 3 > /proc/sys/vm/drop_caches

Just to get sure

Kind regards
-Alexander
 

Attachments

That's strange than inode_cache and dentry is not reduce after drop_cache ..?

If I count active_objets * size:

dentry : 2759mb
inode_cache: 4864mb

Also the number of object is quite huge. (do you use only vms or also CT ?)
 
We don't run PVE containers. The only thing we do is backing up all VMs every night to a NFS share via Proxmox's backup mechanism.
 
Current kernel is 5.0.21-1

could you try to reproduce the issue with:
* 5.0.21-2 (currently available on pvetest - http://download.proxmox.com/debian/...64/pve-kernel-5.0.21-2-pve_5.0.21-4_amd64.deb)
* an older 5.0 kernel (e.g. pve-kernel-5.0.12-1-pve)

On systems not using ZFS and LXC containers it might also be worth trying to boot the Ubuntu Mainline Kernel corresponding to 5.0.21 - https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0.21/


Since there are more people experiencing this - what hardware are you running?
any other specifics to your setup?

Thanks!
 
Hi,
1st.: Thank you for your support!
2nd: As a colleague of Alexander I like to provide some additional information:

At the time as Alexander told you the nodes would run at kernel 5.0.21-1 I did allready an update on them to 5.0.21-2.
So the problem is reproduceable on 5.0.21-2.

ATM. I've updated one of the nodes to the testing 5.0.21-4 as mentioned above.
Since this is a VM with only 8GB of RAM we will see if this helps in a very short time period, perhaps a view hours from now.

The hardware we are use are
  • one DELL VTRX (with two nodes in it)
  • one SUN FIRE X4270
  • one Qemu VM (on the X4270 Server)
  • one SUN FIRE X4170
The two DELL nodes, the X4270 and the Qemu VM builds a four node cluster.
All of them except the VM does have a local ZFS Volume for storing the vm disks.
ZFS replication is not in use, backup is stored on a NFS share.

The X4170 is a stand allone node.
This node has no ZFS but a local LVM storage for the disks.
No backup is done here.

All nodes shows the same behavior.

Please let us know if we can provide any further information to figure this out.

Kind Regards
Christian
 
Last edited:
Update:
The VM with kernel 5.0.21-4 shows exact the same behavior as before:
slab-201909261305.png

I will now (try to) update to the Ubuntu Mainline Kernel as suggested.

Kind Regards
Christian
 
Thanks for testing!

Hardware sounds diverse enough (and it happening on the qemu-vm also makes it sound less likely to be related to a particular piece of hardware - could you provide the config of the VM just in case?).

Thanks for trying the mainline-kernel - if it does not happen there we can further narrow it down by trying older pve-kernels
 
Hopefully i understand right, here the configuration of the VM:
agent: 1
balloon: 0
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 8192
name: vspve4999.ise.int
net0: virtio=22:1F:1D:AC:EB:CC,bridge=vmbr0,firewall=1
net1: virtio=96:44:AF:0C:E1:AC,bridge=vmbr1,firewall=1,tag=2
numa: 0
ostype: l26
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=9cb39e0f-db86-4f3d-9c7e-afda45046515
sockets: 1
virtio0: pspve4004-data:vm-4999-disk-0,format=raw,size=16G
vmgenid: d3f5c82d-7d84-490b-95dd-24b9a6f112f5
 
i have installed slabratetop-bpfcc now, but this tool is completely new to me.
what command would you like to prefer most to capture the right amount of data? (e.g. slabratetop-bpfcc -C -r 50 5 10)
 
small update:
seems like the slab is growing a little bit slower now (5.0.21-050021-generic #201906040731)
the peaks are: 1st. the installation of the mainline-kernel, 2nd: the installation of the headers.
slab-201909261524.png
 
slabratetop-bpfcc -C -r 50 5 10
sounds good (I'm not too familiar with the tool either - but hope that we might see a pattern in allocations)

if possible please let the mainline run a bit longer and continue to monitor the slab-growth

is the graph from the KVM-guest?
do guests run on the host where the slab grows?

Thanks!
 
plz find attached file slabratetop.out.txt
do not hesitate to order further data with other parameters...

i will keep the vm running while monitoring the slab-growth as long as needed, or, if it crashes due to oom.

yes, the graph is from this KVM-Guest PVE-node

yes, all physical nodes have (more or less) guests on it, while the PVE-VM is the only KVM-Guest on the X4270-node

You're welcome!
 

Attachments

Last edited:
small update:
four hour graph of the slab from the virtual PVE:
1569559307962.png

one week graph:
1569559277976.png
 
I'm curious to known what is this "names_cache" slab on top of your stats.

I'm seeing it on my proxmox6 test nodes, but not at all on my proxmox5 nodes.

i can confirm that the names_cache slab exists only on pve6 nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!