Observation: upgrade from 7.x to 8.x, non-stop OOM killing

darkpixel

Renowned Member
Oct 26, 2010
31
7
73
Apparently our internal testing missed this...but we recently (over the holiday break) upgraded a client with 36 locations from Proxmox 7.x to 8.x.

They have been using Proxmox for a *long* time. These servers started out running Proxmox 5.x, and have been upgraded all along the way. Occasionally a server may have needed a wipe/reinstall or a new location was acquired, but most of them have been upgraded along the way.

Nearly all the servers all have 128 GB RAM (with maybe 4 having 64 GB) and host 3 windows VMs. One is Server 2019 with 16 GB RAM, the other is Server 2019 with 8 GB RAM, and the last one is a Windows 10 or 11 VM with 4 GB RAM. This comes to ~28 GB RAM allocated to the VMs.

We have *never* had an OOM event take down a VM in all these years. But after upgrading to 8.x a few weeks ago, we have had 20 instances of the OOM killer being activated and knifing a random VM in the back. We've had to reduce the "16 GB VM" at every location to 8 GB and that seems to solve the issue.

I have no idea why 3 VMs totaling ~28 GB of RAM are running the box out of memory, but it doesn't seem to matter if the hypervisor has 64 GB or 128 GB of RAM.

I'm looking at one of the 128 GB machines right now. The three VMs are using 12.8%, 6.4%, and 3.2% of the RAM, but top also shows:

GiB Mem : 125.7 total, 26.2 free, 99.1 used, 1.4 buff/cache
GiB Swap: 0.0 total, 0.0 free, 0.0 used. 26.6 avail Mem

Only 1.4 GB is being used by buffers/cache, 99.1 GB is used by processes, and there's 26.6 GB free.

128 GB RAM - 28 GB allocated to the VMs should leave around 100 GB free for use or cache.

I'm not sure what broke in Proxmox, Debian or qemu, but the only thing that recovers all that "missing memory" is stopping the VMs or killing /usr/bin/kvm.

Anyways, just an observation.
 
In the past 30 minutes, I had two more offices die, had to drop the memory from 16 GB to 8 GB and boot the VM back up.

I'm also seeing a pattern where this OOM kill and restart of the VM is frequently followed an hour or so later by the hypervisor spontaneously rebooting with nothing in syslog to indicate the problem, then everything runs fine. (Or at least has been running fine since this first started happening after the upgrade)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!