PVE 5.1 - Memory leak

Last edited:
  • Like
Reactions: chrone
  • Like
Reactions: chrone
new kernel is available on pvetest (pve-kernel-4.13.8-3-pve with version 4.13.8-29):
  • cherry-pick KVM fix for old CPUs
  • cherry-pick / backport IB fixes
  • cherry-pick vhost perf regression and mem-leak fix
  • cherry-pick final Windows BSOD fix
 
  • Like
Reactions: chrone
Many thanks, the patches appear to have worked for us!

We last rebooted the host on Sunday night, so the memory utilisation graph doesn't show a massive dip but still approximately 8GB during the day when running pve-kernel-4.13.8-2-pve:
kvm5a_pve-kernel-4.13.8-2-pve.jpg

Note max and last available memory reduced by 2723.84 MB.​


When running pve-kernel-4.13.8-3-pve:
kvm5a_pve-kernel-4.13.8-3-pve.jpg

Note max and last available memory only reduced by 348.16 MB.​


We previously had an issue with the Intel ixgbe driver on PVE 4.4 when running two 10GbE ports as an active/backup bond, where the bond was then connected to a legacy Linux bridge (aka not OVS). The ports would flap until we disabled offload acceleration settings on the actual ports themselves:
Code:
/etc/rc.local:
  # Buggy Intel network card drivers:
  ethtool -K eth0 tso off gso off gro off;
  ethtool -K eth1 tso off gso off gro off;

I assume the updated ixgbe drivers (v5.3.3) in 4.13.8-27 and later and/or kernel changes fixed the issue so we now no longer have to disable 'tcp-segmentation-offload', 'generic-segmentation-offload' or 'generic-receive-offload' acceleration features. This subsequently results in less buffer overruns and hence less memory leaking, than when we were running 4.13.4-26.

To summarise:
  • 4.13.8-27 includes updated ixgbe drivers which reduce the underlying memory leak problem from occuring
  • 4.13.8-29 includes patches which addresses a memory leak on vhost network traffic when buffers overrun

Again, many thanks for the prompt attention and buliding a kernel with the patches!

PS: The memory leaks were relatively gradual so I'll try to remember to post a memory/network utilisation graph which demonstrates the problem in a more pronounced way in a couple of days.
 
Last edited:
  • Like
Reactions: chrone and micro
As promised herewith an updated graphical representation of the memory leak that was fixed with 4.13.8-29:
kvm5a_memory_leak_kernels.jpg
  • Beginning until 2017-11-16: 4.4.67-1-pve (no problem)
  • 2017-11-16 until 2017-12-01: 4.13.4-1-pve (massive memory leak)
  • 2017-12-01 until 2017-12-05: 4.13.8-2-pve (still leaking but much lower (enabled offloading features))
  • 2017-12-05 until end: 4.13.8-3-pve (no problem)
We live migrated virtual routers and restarted the host on the following dates:
  • 2017-11-16 00:43
  • 2017-11-24 00:30
  • 2017-11-30 00:37
  • 2017-12-01 00:34
  • 2017-12-05 00:35

Herewith a zoom between the 30th of November and the 6th of December:
kvm5a_memory_leak_kernels_zoom.jpg

PS: No leaks during periods of low network traffic utilisation.​
 
  • Like
Reactions: micro and chrone