[SOLVED] Memory leak after update pve 5 to 6

kolesya · Jul 9, 2020

Howdy. We've updated cluster from 5 to 6 pve version. Some time later one of the nodes got unexpected reboot. So we noticed that one vm(kvm process) on it eating about 1% of node memory each 3 hours. We not using containers, ceph, zfs etc. All other nodes & vms looks fine. All vm images stored on freenas connected via nfs. We tried to switch ballon off/on - doesn't helped. Memory usage inside vm looks ok.

kolesya · Jul 9, 2020

pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-17
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

kolesya · Jul 9, 2020

qm config 110
agent: 1
balloon: 20480
boot: cdn
bootdisk: virtio0
cores: 8
ide2: none,media=cdrom
memory: 24576
name: v2
net0: virtio=A2:C7:C2:38:77:EE,bridge=vmbr2
numa: 0
ostype: l26
parent: beforeUpdate
scsihw: virtio-scsi-pci
smbios1: uuid=c59a6b17-7832-4522-a137-9278149eea41
sockets: 1
virtio0: sas-ssd:110/vm-110-disk-0.qcow2,size=100G

kolesya · Jul 9, 2020

so, any help is welcome

kolesya · Jul 15, 2020

it can't be fix?

t.lamprecht · Jul 15, 2020

What memory is does the VM leaks, RSS? Or does "just" the virtual one gets bigger?
top -bn1 -p $(cat /run/qemu-server/110.pid)

What's running inside the VM? I.e., which distro/version and is there a guest-agent running?
It we be nice to know if there's anything that makes this VM special compared to the others, which are just fine.

kolesya · Jul 15, 2020

Code:

root@node5:~# top -bn1 -p $(cat /run/qemu-server/110.pid)
top - 13:02:02 up  3:46,  1 user,  load average: 3,85, 3,21, 3,08
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15,9 us,  2,6 sy,  0,0 ni, 77,3 id,  2,1 wa,  0,0 hi,  2,1 si,  0,0 st
MiB Mem :  96661,7 total,  57046,6 free,  38735,2 used,    879,9 buff/cache
MiB Swap:   8192,0 total,   8192,0 free,      0,0 used.  57137,6 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 5753 root      20   0   26,3g  21,2g  10040 S 233,3  22,4 306:01.62 kvm

inside vm:

lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 9.12 (stretch)
Release: 9.12
Codename: stretch

kolesya · Jul 15, 2020

all memory counters grows: res, %mem & also virt for this process

t.lamprecht · Jul 15, 2020

Looks OK, has 21.2 GiB residential and you configured it to 24GiB..

VIRT is not relevant and can be multiple times that big.

t.lamprecht · Jul 15, 2020

kolesya said:
all memory counters grows: res, %mem & also virt for this process

Yeah, but does RES gets over 24 GiB?

kolesya · Jul 15, 2020

t.lamprecht said:
Yeah, but does RES gets over 24 GiB?

exactly. i just migrated all node vm's to another node and back for reboot after last updates. after some time res will be over 24G

kolesya · Jul 15, 2020

and it grows until node goes to reboot

kolesya · Jul 15, 2020

so migrate this vm to another node & back 2-3 times per week looks like only solution for us just for now

t.lamprecht · Jul 15, 2020

No guest agent running in that VM? At least you have set the agent: 1 in the config.

Ballooning could really be off, but you said you tried disabling that already.
Do the other VMs also use VirtioBlk for their disks (virtioX vs scsiX)?

kolesya · Jul 15, 2020

inside vm:
ps aux | grep qemu
root 800 0.0 0.0 22516 2660 ? Ss 02:50 0:18 /usr/sbin/qemu-ga --daemonize -m virtio-serial -p /dev/virtio-ports/org.qemu.guest_agent.0

all other vms also use virtioblk

t.lamprecht · Jul 15, 2020

QEMU VM memory leaks are normally "on the outside", so from something like disk threads, or the end of the QEMU guest agent on the outside, or the like.
So things to try out could be:
* disabling the guest agent temporarily.
* moving the disk to another bus/contoller (VirtIO SCSI, may need adaptions how the VM mounts the disk though, as it switches from /dev/vda to /dev/sda)
* switching machine type to q35

This are just guesses, but for cases where only a single VM has issues it may be worth to try them.

kolesya · Jul 15, 2020

ok. its very special vm & we can switch it off to apply config changes not every day. i'll try all that u advising asap & will give feedback here.

kolesya · Jul 15, 2020

but anyway there was no problems before pve update to 6 branch on same vm config

t.lamprecht · Jul 15, 2020

kolesya said:
ok. its very special vm & we can switch it off to apply config changes not every day. i'll try all that u advising asap & will give feedback here.

You could also clone it and experiment with the clone (if that even has the same issues).

kolesya · Jul 15, 2020

t.lamprecht said:
You could also clone it and experiment with the clone (if that even has the same issues).

we can do it but we have to disable networking for clone.

[SOLVED] Memory leak after update pve 5 to 6

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member