Apparently our internal testing missed this...but we recently (over the holiday break) upgraded a client with 36 locations from Proxmox 7.x to 8.x.
They have been using Proxmox for a *long* time. These servers started out running Proxmox 5.x, and have been upgraded all along the way. Occasionally a server may have needed a wipe/reinstall or a new location was acquired, but most of them have been upgraded along the way.
Nearly all the servers all have 128 GB RAM (with maybe 4 having 64 GB) and host 3 windows VMs. One is Server 2019 with 16 GB RAM, the other is Server 2019 with 8 GB RAM, and the last one is a Windows 10 or 11 VM with 4 GB RAM. This comes to ~28 GB RAM allocated to the VMs.
We have *never* had an OOM event take down a VM in all these years. But after upgrading to 8.x a few weeks ago, we have had 20 instances of the OOM killer being activated and knifing a random VM in the back. We've had to reduce the "16 GB VM" at every location to 8 GB and that seems to solve the issue.
I have no idea why 3 VMs totaling ~28 GB of RAM are running the box out of memory, but it doesn't seem to matter if the hypervisor has 64 GB or 128 GB of RAM.
I'm looking at one of the 128 GB machines right now. The three VMs are using 12.8%, 6.4%, and 3.2% of the RAM, but top also shows:
GiB Mem : 125.7 total, 26.2 free, 99.1 used, 1.4 buff/cache
GiB Swap: 0.0 total, 0.0 free, 0.0 used. 26.6 avail Mem
Only 1.4 GB is being used by buffers/cache, 99.1 GB is used by processes, and there's 26.6 GB free.
128 GB RAM - 28 GB allocated to the VMs should leave around 100 GB free for use or cache.
I'm not sure what broke in Proxmox, Debian or qemu, but the only thing that recovers all that "missing memory" is stopping the VMs or killing /usr/bin/kvm.
Anyways, just an observation.
They have been using Proxmox for a *long* time. These servers started out running Proxmox 5.x, and have been upgraded all along the way. Occasionally a server may have needed a wipe/reinstall or a new location was acquired, but most of them have been upgraded along the way.
Nearly all the servers all have 128 GB RAM (with maybe 4 having 64 GB) and host 3 windows VMs. One is Server 2019 with 16 GB RAM, the other is Server 2019 with 8 GB RAM, and the last one is a Windows 10 or 11 VM with 4 GB RAM. This comes to ~28 GB RAM allocated to the VMs.
We have *never* had an OOM event take down a VM in all these years. But after upgrading to 8.x a few weeks ago, we have had 20 instances of the OOM killer being activated and knifing a random VM in the back. We've had to reduce the "16 GB VM" at every location to 8 GB and that seems to solve the issue.
I have no idea why 3 VMs totaling ~28 GB of RAM are running the box out of memory, but it doesn't seem to matter if the hypervisor has 64 GB or 128 GB of RAM.
I'm looking at one of the 128 GB machines right now. The three VMs are using 12.8%, 6.4%, and 3.2% of the RAM, but top also shows:
GiB Mem : 125.7 total, 26.2 free, 99.1 used, 1.4 buff/cache
GiB Swap: 0.0 total, 0.0 free, 0.0 used. 26.6 avail Mem
Only 1.4 GB is being used by buffers/cache, 99.1 GB is used by processes, and there's 26.6 GB free.
128 GB RAM - 28 GB allocated to the VMs should leave around 100 GB free for use or cache.
I'm not sure what broke in Proxmox, Debian or qemu, but the only thing that recovers all that "missing memory" is stopping the VMs or killing /usr/bin/kvm.
Anyways, just an observation.