[SOLVED] Memory leak after update pve 5 to 6

kolesya

Well-Known Member
Oct 6, 2017
30
0
46
47
Howdy. We've updated cluster from 5 to 6 pve version. Some time later one of the nodes got unexpected reboot. So we noticed that one vm(kvm process) on it eating about 1% of node memory each 3 hours. We not using containers, ceph, zfs etc. All other nodes & vms looks fine. All vm images stored on freenas connected via nfs. We tried to switch ballon off/on - doesn't helped. Memory usage inside vm looks ok.
 
Last edited:
pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-17
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
qm config 110
agent: 1
balloon: 20480
boot: cdn
bootdisk: virtio0
cores: 8
ide2: none,media=cdrom
memory: 24576
name: v2
net0: virtio=A2:C7:C2:38:77:EE,bridge=vmbr2
numa: 0
ostype: l26
parent: beforeUpdate
scsihw: virtio-scsi-pci
smbios1: uuid=c59a6b17-7832-4522-a137-9278149eea41
sockets: 1
virtio0: sas-ssd:110/vm-110-disk-0.qcow2,size=100G
 
What memory is does the VM leaks, RSS? Or does "just" the virtual one gets bigger?
top -bn1 -p $(cat /run/qemu-server/110.pid)

What's running inside the VM? I.e., which distro/version and is there a guest-agent running?
It we be nice to know if there's anything that makes this VM special compared to the others, which are just fine.
 
Code:
root@node5:~# top -bn1 -p $(cat /run/qemu-server/110.pid)
top - 13:02:02 up  3:46,  1 user,  load average: 3,85, 3,21, 3,08
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15,9 us,  2,6 sy,  0,0 ni, 77,3 id,  2,1 wa,  0,0 hi,  2,1 si,  0,0 st
MiB Mem :  96661,7 total,  57046,6 free,  38735,2 used,    879,9 buff/cache
MiB Swap:   8192,0 total,   8192,0 free,      0,0 used.  57137,6 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 5753 root      20   0   26,3g  21,2g  10040 S 233,3  22,4 306:01.62 kvm

inside vm:

lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 9.12 (stretch)
Release: 9.12
Codename: stretch
 
so migrate this vm to another node & back 2-3 times per week looks like only solution for us just for now
 
No guest agent running in that VM? At least you have set the agent: 1 in the config.

Ballooning could really be off, but you said you tried disabling that already.
Do the other VMs also use VirtioBlk for their disks (virtioX vs scsiX)?
 
inside vm:
ps aux | grep qemu
root 800 0.0 0.0 22516 2660 ? Ss 02:50 0:18 /usr/sbin/qemu-ga --daemonize -m virtio-serial -p /dev/virtio-ports/org.qemu.guest_agent.0

all other vms also use virtioblk
 
QEMU VM memory leaks are normally "on the outside", so from something like disk threads, or the end of the QEMU guest agent on the outside, or the like.
So things to try out could be:
* disabling the guest agent temporarily.
* moving the disk to another bus/contoller (VirtIO SCSI, may need adaptions how the VM mounts the disk though, as it switches from /dev/vda to /dev/sda)
* switching machine type to q35

This are just guesses, but for cases where only a single VM has issues it may be worth to try them.
 
ok. its very special vm & we can switch it off to apply config changes not every day. i'll try all that u advising asap & will give feedback here.
 
but anyway there was no problems before pve update to 6 branch on same vm config
 
ok. its very special vm & we can switch it off to apply config changes not every day. i'll try all that u advising asap & will give feedback here.

You could also clone it and experiment with the clone (if that even has the same issues).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!