VM down because of oom killer - finding actual reason

Saahib · Mar 26, 2023

Hi,
I found that only VM on server is going down randomly. From dmesg, I can find :

Code:

[Thu Mar 23 02:51:34 2023] Out of memory: Killed process 538560 (kvm) total-vm:125506644kB, anon-rss:113157280kB, file-rss:3988kB, shmem-rss:4kB, UID:0 pgtables:237624kB oom_score_adj:0
[Sun Mar 26 05:09:06 2023] Out of memory: Killed process 1087978 (kvm) total-vm:127067788kB, anon-rss:113982728kB, file-rss:28kB, shmem-rss:0kB, UID:0 pgtables:240364kB oom_score_adj:0

This node has 128GB of RAM and only VM where originally 125GB was assigned to as for now there is no other VM running on it. The VM is cpanel+cloudlinux and with couple of low traffic websites.

Interesting part is that from monitoring, I can see that total memory usage of VM was not more than 12-15GB when it was killed by OOM.

May anyone help me to how should troubleshoot this and how to avoid this happening in future ?

Code:

qm config 5001
agent: 1,freeze-fs-on-backup=0
boot: order=scsi0
cipassword: **********
ciuser: root
cores: 40
cpu: host
ipconfig0: ip=x.x.x.x,gw=x.x.x.1
localtime: 0
machine: q35
memory: 96000
meta: creation-qemu=7.1.0,ctime=1677322719
name: ugi-nl-cl
net0: virtio=E6:10:88:42:58:7A,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-5001-disk-0,cache=writeback,discard=on,size=665360M,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=9163a2fc-5fc7-4be7-a844-26e6196798e1
sockets: 1
sshkeys:
vga: qxl
vmgenid: ee77a50f-a636-48a6-b112-2eaa71cd31ca

Have further reduced memory to ~96GB

Code:

proxmox-ve: 7.3-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-helper: 7.2-14
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Moayad · Mar 27, 2023

Hello,

Thank you for sharing the output of pveversion and the VM config!

May I ask you if you enabled the swap on that node? `cat /etc/fstab` if yes - I would try to disable the swap on the node and see if that did the trick. (This has been the solution for some users as well)

To disable the swap follow the steps:
1. issue the following command on the terminal:

Bash:

swapoff -a

2. edit the `/etc/fstab` file and just comment the swap line as following:

Bash:

#/dev/pve/swap none swap sw 0 0

3. Rebote the server.

Saahib · Mar 27, 2023

Thanks, yes, swap is enabled and I heard some swap helps you to keep linux stable even if you have plenty of unused RAM. It would be helpful if someone can make understand why it is happening, is it stack in VM or its host node itself ?

Dunuin · Mar 27, 2023

You also enabled writeback as cache mode. Keep in mind that each write the VM does will be cached in the hosts RAM too, increase the RAM overhead.

fabian · Mar 27, 2023

the journal/dmesg should also contain more information about the memory situation at the time(s) the process was killed. the one line you posted already indicates that it was using 113GB of memory though, not 12-15GB. note that "usage" from the guest's point of view is relatively meaningless (and might also differ depending on guest OS, ..), what counts is how much is used on the hypervisor.

Saahib · Mar 27, 2023

fabian said:
the journal/dmesg should also contain more information about the memory situation at the time(s) the process was killed. the one line you posted already indicates that it was using 113GB of memory though, not 12-15GB. note that "usage" from the guest's point of view is relatively meaningless (and might also differ depending on guest OS, ..), what counts is how much is used on the hypervisor.

Yes, its basic cpanel with low traffic website, no way it can use that much memory. So what was causing that usage, Do you suggest that write-back can eat that much RAM ?

RolandK · Mar 27, 2023

interesting part is that from monitoring, I can see that total memory usage of VM was not more than 12-15GB when it was killed by OOM.

then your monitoring is wrong, as oom killer is telling very different value

fabian · Mar 28, 2023

also, used memory by the VM process doesn't mean actually used memory by the guest (OS), it could be that at some point it used that much (also for caching inside the VM!), or that something touched a lot of small bits of memory distributed over most of the guest RAM, or ...

Saahib · Mar 28, 2023

fabian said:
also, used memory by the VM process doesn't mean actually used memory by the guest (OS), it could be that at some point it used that much (also for caching inside the VM!), or that something touched a lot of small bits of memory distributed over most of the guest RAM, or ...

I just noticed that if I enabled writeback cache, VM is using a lot more memory than shown inside vm. eg. currently VM is using only 1.6 GB RAM but if I check process, its about 33 GB. I then disabled writeback and after a while I found that VM is showing about 2GB inside and process is about 3.5 GB now. So, it means write-back cache can eat a really a lot of RAM ?

RolandK · Mar 28, 2023

could you tell which tool / exact command you are using for determining memory usage ?

Dunuin · Mar 28, 2023

Saahib said:
So, it means write-back cache can eat a really a lot of RAM ?

Jup. Like I said, this comes on top of what the guest OS is using. And if the host storage can't keep up with writes this can add up.

Saahib · Mar 28, 2023

Dunuin said:
Jup. Like I said, this comes on top of what the guest OS is using. And if the host storage can't keep up with writes this can add up.

Its evident that it eats lots of RAM however, if I see disk io wait on host, its almost zero.

Saahib · Mar 28, 2023

RolandK said:
could you tell which tool / exact command you are using for determining memory usage ?

top inside as well on host.

RolandK · Mar 28, 2023

VM is using a lot more memory than shown inside vm. eg. currently VM is using only 1.6 GB RAM but if I check process, its about 33 GB

if you assign 96 gb ram to your vm , chances are high that it's using that.

what vm is this and what tool shows that it only uses 1.6gb from that ?

Saahib · Mar 28, 2023

RolandK said:
if you assign 96 gb ram to your vm , chances are high that it's using that.

what vm is this and what tool shows that it only uses 1.6gb from that ?

A regular KVM with Alma8.7 and cpanel. Used
free -h or top

Search

Search

VM down because of oom killer - finding actual reason

Saahib

Member

Moayad

Proxmox Staff Member

Saahib

Member

Dunuin

Distinguished Member

fabian

Proxmox Staff Member

Saahib

Member

RolandK

Renowned Member

fabian

Proxmox Staff Member

Saahib

Member

RolandK

Renowned Member

Dunuin

Distinguished Member

Saahib

Member

Saahib

Member

RolandK

Renowned Member

Saahib

Member