PVE 5.1 - Memory leak

David Herselman · Dec 4, 2017

Thanks, this thread however primarily deals with vhost/tap/tun memory leak issues. Proposed patches have been acknowledged and are awaiting review and testing.

Fabian, any chance of someone at Proxmox applying these to a test kernel?

https://patchwork.codeaurora.org/patch/393569/
https://patchwork.codeaurora.org/patch/393565/
https://patchwork.codeaurora.org/patch/393567/

fabian · Dec 4, 2017

David Herselman said:
Thanks, this thread however primarily deals with vhost/tap/tun memory leak issue. Proposed patches have been acknowledged and are awaiting review and testing.

Fabian, any chance of someone at Proxmox applying these to a test kernel?

https://patchwork.codeaurora.org/patch/393569/
https://patchwork.codeaurora.org/patch/393565/
https://patchwork.codeaurora.org/patch/393567/

there will be something available later today!

fabian · Dec 4, 2017

new kernel is available on pvetest (pve-kernel-4.13.8-3-pve with version 4.13.8-29):

cherry-pick KVM fix for old CPUs
cherry-pick / backport IB fixes
cherry-pick vhost perf regression and mem-leak fix
cherry-pick final Windows BSOD fix

David Herselman · Dec 5, 2017

Many thanks, the patches appear to have worked for us!

We last rebooted the host on Sunday night, so the memory utilisation graph doesn't show a massive dip but still approximately 8GB during the day when running pve-kernel-4.13.8-2-pve:

Note max and last available memory reduced by 2723.84 MB.

When running pve-kernel-4.13.8-3-pve:

Note max and last available memory only reduced by 348.16 MB.

We previously had an issue with the Intel ixgbe driver on PVE 4.4 when running two 10GbE ports as an active/backup bond, where the bond was then connected to a legacy Linux bridge (aka not OVS). The ports would flap until we disabled offload acceleration settings on the actual ports themselves:

Code:

/etc/rc.local:
  # Buggy Intel network card drivers:
  ethtool -K eth0 tso off gso off gro off;
  ethtool -K eth1 tso off gso off gro off;

I assume the updated ixgbe drivers (v5.3.3) in 4.13.8-27 and later and/or kernel changes fixed the issue so we now no longer have to disable 'tcp-segmentation-offload', 'generic-segmentation-offload' or 'generic-receive-offload' acceleration features. This subsequently results in less buffer overruns and hence less memory leaking, than when we were running 4.13.4-26.

To summarise:

4.13.8-27 includes updated ixgbe drivers which reduce the underlying memory leak problem from occuring
4.13.8-29 includes patches which addresses a memory leak on vhost network traffic when buffers overrun

Again, many thanks for the prompt attention and buliding a kernel with the patches!

PS: The memory leaks were relatively gradual so I'll try to remember to post a memory/network utilisation graph which demonstrates the problem in a more pronounced way in a couple of days.

David Herselman · Dec 11, 2017

As promised herewith an updated graphical representation of the memory leak that was fixed with 4.13.8-29:

Beginning until 2017-11-16: 4.4.67-1-pve (no problem)
2017-11-16 until 2017-12-01: 4.13.4-1-pve (massive memory leak)
2017-12-01 until 2017-12-05: 4.13.8-2-pve (still leaking but much lower (enabled offloading features))
2017-12-05 until end: 4.13.8-3-pve (no problem)

We live migrated virtual routers and restarted the host on the following dates:

2017-11-16 00:43
2017-11-24 00:30
2017-11-30 00:37
2017-12-01 00:34
2017-12-05 00:35

Herewith a zoom between the 30th of November and the 6th of December:

PS: No leaks during periods of low network traffic utilisation.

Search

Search

PVE 5.1 - Memory leak

David Herselman

Renowned Member

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

David Herselman

Renowned Member

David Herselman

Renowned Member

We value your privacy