VM down because of oom killer - finding actual reason

Saahib

Member
May 2, 2021
84
3
13
Hi,
I found that only VM on server is going down randomly. From dmesg, I can find :

Code:
[Thu Mar 23 02:51:34 2023] Out of memory: Killed process 538560 (kvm) total-vm:125506644kB, anon-rss:113157280kB, file-rss:3988kB, shmem-rss:4kB, UID:0 pgtables:237624kB oom_score_adj:0
[Sun Mar 26 05:09:06 2023] Out of memory: Killed process 1087978 (kvm) total-vm:127067788kB, anon-rss:113982728kB, file-rss:28kB, shmem-rss:0kB, UID:0 pgtables:240364kB oom_score_adj:0

This node has 128GB of RAM and only VM where originally 125GB was assigned to as for now there is no other VM running on it. The VM is cpanel+cloudlinux and with couple of low traffic websites.

Interesting part is that from monitoring, I can see that total memory usage of VM was not more than 12-15GB when it was killed by OOM.

May anyone help me to how should troubleshoot this and how to avoid this happening in future ?

Code:
qm config 5001
agent: 1,freeze-fs-on-backup=0
boot: order=scsi0
cipassword: **********
ciuser: root
cores: 40
cpu: host
ipconfig0: ip=x.x.x.x,gw=x.x.x.1
localtime: 0
machine: q35
memory: 96000
meta: creation-qemu=7.1.0,ctime=1677322719
name: ugi-nl-cl
net0: virtio=E6:10:88:42:58:7A,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-5001-disk-0,cache=writeback,discard=on,size=665360M,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=9163a2fc-5fc7-4be7-a844-26e6196798e1
sockets: 1
sshkeys:
vga: qxl
vmgenid: ee77a50f-a636-48a6-b112-2eaa71cd31ca
Have further reduced memory to ~96GB

Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-helper: 7.2-14
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-2
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
 
Hello,

Thank you for sharing the output of pveversion and the VM config!

May I ask you if you enabled the swap on that node? `cat /etc/fstab` if yes - I would try to disable the swap on the node and see if that did the trick. (This has been the solution for some users as well)

To disable the swap follow the steps:
1. issue the following command on the terminal:
Bash:
swapoff -a
2. edit the `/etc/fstab` file and just comment the swap line as following:
Bash:
#/dev/pve/swap none swap sw 0 0
3. Rebote the server.
 
Thanks, yes, swap is enabled and I heard some swap helps you to keep linux stable even if you have plenty of unused RAM. It would be helpful if someone can make understand why it is happening, is it stack in VM or its host node itself ?
 
You also enabled writeback as cache mode. Keep in mind that each write the VM does will be cached in the hosts RAM too, increase the RAM overhead.
 
the journal/dmesg should also contain more information about the memory situation at the time(s) the process was killed. the one line you posted already indicates that it was using 113GB of memory though, not 12-15GB. note that "usage" from the guest's point of view is relatively meaningless (and might also differ depending on guest OS, ..), what counts is how much is used on the hypervisor.
 
the journal/dmesg should also contain more information about the memory situation at the time(s) the process was killed. the one line you posted already indicates that it was using 113GB of memory though, not 12-15GB. note that "usage" from the guest's point of view is relatively meaningless (and might also differ depending on guest OS, ..), what counts is how much is used on the hypervisor.
Yes, its basic cpanel with low traffic website, no way it can use that much memory. So what was causing that usage, Do you suggest that write-back can eat that much RAM ?
 
interesting part is that from monitoring, I can see that total memory usage of VM was not more than 12-15GB when it was killed by OOM.

then your monitoring is wrong, as oom killer is telling very different value
 
also, used memory by the VM process doesn't mean actually used memory by the guest (OS), it could be that at some point it used that much (also for caching inside the VM!), or that something touched a lot of small bits of memory distributed over most of the guest RAM, or ...
 
also, used memory by the VM process doesn't mean actually used memory by the guest (OS), it could be that at some point it used that much (also for caching inside the VM!), or that something touched a lot of small bits of memory distributed over most of the guest RAM, or ...
I just noticed that if I enabled writeback cache, VM is using a lot more memory than shown inside vm. eg. currently VM is using only 1.6 GB RAM but if I check process, its about 33 GB. I then disabled writeback and after a while I found that VM is showing about 2GB inside and process is about 3.5 GB now. So, it means write-back cache can eat a really a lot of RAM ?
 
Jup. Like I said, this comes on top of what the guest OS is using. And if the host storage can't keep up with writes this can add up.
Its evident that it eats lots of RAM however, if I see disk io wait on host, its almost zero.
 
VM is using a lot more memory than shown inside vm. eg. currently VM is using only 1.6 GB RAM but if I check process, its about 33 GB

if you assign 96 gb ram to your vm , chances are high that it's using that.

what vm is this and what tool shows that it only uses 1.6gb from that ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!