PVE 5.1 - Memory leak

Last edited:
  • Like
Reactions: chrone
  • Like
Reactions: chrone
new kernel is available on pvetest (pve-kernel-4.13.8-3-pve with version 4.13.8-29):
  • cherry-pick KVM fix for old CPUs
  • cherry-pick / backport IB fixes
  • cherry-pick vhost perf regression and mem-leak fix
  • cherry-pick final Windows BSOD fix
 
  • Like
Reactions: chrone
Many thanks, the patches appear to have worked for us!

We last rebooted the host on Sunday night, so the memory utilisation graph doesn't show a massive dip but still approximately 8GB during the day when running pve-kernel-4.13.8-2-pve:
kvm5a_pve-kernel-4.13.8-2-pve.jpg

Note max and last available memory reduced by 2723.84 MB.​


When running pve-kernel-4.13.8-3-pve:
kvm5a_pve-kernel-4.13.8-3-pve.jpg

Note max and last available memory only reduced by 348.16 MB.​


We previously had an issue with the Intel ixgbe driver on PVE 4.4 when running two 10GbE ports as an active/backup bond, where the bond was then connected to a legacy Linux bridge (aka not OVS). The ports would flap until we disabled offload acceleration settings on the actual ports themselves:
Code:
/etc/rc.local:
  # Buggy Intel network card drivers:
  ethtool -K eth0 tso off gso off gro off;
  ethtool -K eth1 tso off gso off gro off;

I assume the updated ixgbe drivers (v5.3.3) in 4.13.8-27 and later and/or kernel changes fixed the issue so we now no longer have to disable 'tcp-segmentation-offload', 'generic-segmentation-offload' or 'generic-receive-offload' acceleration features. This subsequently results in less buffer overruns and hence less memory leaking, than when we were running 4.13.4-26.

To summarise:
  • 4.13.8-27 includes updated ixgbe drivers which reduce the underlying memory leak problem from occuring
  • 4.13.8-29 includes patches which addresses a memory leak on vhost network traffic when buffers overrun

Again, many thanks for the prompt attention and buliding a kernel with the patches!

PS: The memory leaks were relatively gradual so I'll try to remember to post a memory/network utilisation graph which demonstrates the problem in a more pronounced way in a couple of days.
 
Last edited:
  • Like
Reactions: chrone and micro
As promised herewith an updated graphical representation of the memory leak that was fixed with 4.13.8-29:
kvm5a_memory_leak_kernels.jpg
  • Beginning until 2017-11-16: 4.4.67-1-pve (no problem)
  • 2017-11-16 until 2017-12-01: 4.13.4-1-pve (massive memory leak)
  • 2017-12-01 until 2017-12-05: 4.13.8-2-pve (still leaking but much lower (enabled offloading features))
  • 2017-12-05 until end: 4.13.8-3-pve (no problem)
We live migrated virtual routers and restarted the host on the following dates:
  • 2017-11-16 00:43
  • 2017-11-24 00:30
  • 2017-11-30 00:37
  • 2017-12-01 00:34
  • 2017-12-05 00:35

Herewith a zoom between the 30th of November and the 6th of December:
kvm5a_memory_leak_kernels_zoom.jpg

PS: No leaks during periods of low network traffic utilisation.​
 
  • Like
Reactions: micro and chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!