Out of memory: Killed process [...] on multiple VMs

acapoprox

New Member
Sep 9, 2024
3
0
1
Hi guys,
this is our situation.We migrate from VMware this year.
This is our hardware:
BladeCenter VNX5200
Storage Dell EMC C7000 (HBA)
7 Nodes on bladecenter: Blade Server HP BL460C Gen 9 (128GB ram each)

Storage was configured following similar instructions:

blog.mohsen.co/proxmox-shared-storage-with-fc-san-multipath-and-17a10e4edd8d (

HA works smoothly. Backup is perfect.
We have approximately 50 virtual servers in production env. (10 ms windows, 40 linux).

The virtual machines "migrated" from VMware were, for the most part, Oracle Linux 8-9.
We have some ubuntu server too.

After some days, only on some Oracle Linux VM (6-7 at least), we found out the error described in the subject (inside the vm - No error on proxmox nodes.).
Sometimes a "secondary" process is killed but it happened that a central service (mysql, asterisk) was killed, creating business problems.
We absolutely need to fix this, and we certainly don't want to migrate to other platforms (XCP-ng or similar)

All the VMs have the agent properly installed (qemu-agent).
All VM are configured with "ballon memory".
Now, after all these problems, we are trying disabling ballon but we need help.
Have any of you had similar problems?
If so, how did you solve them?

Thanks in advance and sorry for my bad bad english.

Emanuele.
 
Hi @acapoprox, welcome to the forum.

If the OOM event happens inside the VM, that usually means that you've run out of allocated memory in the VM. While you may have identical settings between the old ESXi setup and the new PVE, there are always differences between hypervisors and virtual machine interactions.

The easy "fix" is to increase the memory allocated to the VMs. The next step would be to implement monitoring via Grafana, Nagios, or a similar product.

You should search and read through the many articles on OOM Killer troubleshooting, i.e.: https://serverfault.com/questions/134669/how-to-diagnose-causes-of-oom-killer-killing-processes


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: acapoprox
After some days, only on some Oracle Linux VM (6-7 at least), we found out the error described in the subject (inside the vm - No error on proxmox nodes.).
All VM are configured with "ballon memory".
Please be aware that Proxmox will (forcefully) take memory away from the VM when it reaches 80% (on the host) when VMs are configured with ballooning: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_memory
Maybe the minimum memory you set the VMs to is too little for the software inside the VM? Or maybe you need to give the Oracle VM more "Shares" (so Proxmox will take less memory from them and more memory from other VMs)?
Maybe change the KSM setting to start looking for memory to share before the default 80% (so Proxmox might not have to use ballooning as much)? https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_memory
 
Hy guys thanks for the quick answers.
This is the first thing we did: increase the vRAM in the virtual machines where this problem occurred.
First +2GB, then another 2GB, etc.
This did not solve our problems.
I did not specify one thing:
in the nodes we have a monitoring system that eventually aligns the memory in all the nodes. If a node exceeds 65 percent there is a recalculation and a redistribution of the virtual machines in the remaining 6 nodes.
In any case, they never exceed 60%, for most of the time we have, on each single node, a percentage of memory used of 55% while, on the CPU side (20 x Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (1 Socket)), we are at 20%.


Screenshot 2024-11-11 152519.png


Another note: we already have a monitoring system that monitors our entire infrastructure (Centreon).
It monitors everything, nodes, virtual machines, ups, etc.
It warns us in case of problems of any kind (disk space, memory above 80%, temperature, high values of cpu usage and cpu load, etc) and from what we have noticed, when the "Out of memory: Killed process" is triggered, there are no memory spikes.
Below is an example:
At about 11 o'clock we receive the notification of the kill, but if you go to the memory graph of the server all seems normal.
We are speechless.

ps: This has never happened in 6 years of VMware with the Same hardware.

Screenshot 2024-11-11 151828.png





Screenshot 2024-11-11 152145.png
 
Last edited:
Hi guys,
this is our situation.We migrate from VMware this year.
This is our hardware:
BladeCenter VNX5200
Storage Dell EMC C7000 (HBA)
7 Nodes on bladecenter: Blade Server HP BL460C Gen 9 (128GB ram each)

Storage was configured following similar instructions:

blog.mohsen.co/proxmox-shared-storage-with-fc-san-multipath-and-17a10e4edd8d (

HA works smoothly. Backup is perfect.
We have approximately 50 virtual servers in production env. (10 ms windows, 40 linux).

The virtual machines "migrated" from VMware were, for the most part, Oracle Linux 8-9.
We have some ubuntu server too.

After some days, only on some Oracle Linux VM (6-7 at least), we found out the error described in the subject (inside the vm - No error on proxmox nodes.).
Sometimes a "secondary" process is killed but it happened that a central service (mysql, asterisk) was killed, creating business problems.
We absolutely need to fix this, and we certainly don't want to migrate to other platforms (XCP-ng or similar)

All the VMs have the agent properly installed (qemu-agent).
All VM are configured with "ballon memory".
Now, after all these problems, we are trying disabling ballon but we need help.
Have any of you had similar problems?
If so, how did you solve them?

Thanks in advance and sorry for my bad bad english.

Emanuele.
Hi Emanule

Have you considered Memory Virt/ pooling as a possibility ? take a look at kove.com
 
Please be aware that Proxmox will (forcefully) take memory away from the VM when it reaches 80% (on the host) when VMs are configured with ballooning: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_memory
Maybe the minimum memory you set the VMs to is too little for the software inside the VM? Or maybe you need to give the Oracle VM more "Shares" (so Proxmox will take less memory from them and more memory from other VMs)?
Maybe change the KSM setting to start looking for memory to share before the default 80% (so Proxmox might not have to use ballooning as much)? https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_memory
I will try these solutions asap

thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!