Unexpected shut down, out of memory?

CelticWebs

Member
Mar 14, 2023
75
3
8
I've had an unexpected shut down on a VM that hosts a web server 2 times. Both at 6am, so obviously the first thing I need to do is find whats running at 6am on those days. The shut down appears to be more of a Kill than a shut down. From syslog it appear to be saying it's out of memory and kills the VM process, is this the whole system or just the VM? Is there a way to make this VM high priority so it will kill anything other than this one VM?

System spec is Proxmox 8.1.3 with 128GB

This is what I found in syslog.

Code:
cat /var/log/syslog | grep oom
2023-12-31T05:58:18.900734+00:00 prox380 kernel: [827342.672879] pve-firewall invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2023-12-31T05:58:18.900984+00:00 prox380 kernel: [827342.673962]  oom_kill_process+0x10d/0x1c0
2023-12-31T05:58:18.926437+00:00 prox380 kernel: [827342.700567] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
2023-12-31T05:58:18.958981+00:00 prox380 kernel: [827342.721038] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pve-firewall.service,mems_allowed=0-1,global_oom,task_memcg=/qemu.slice/104.scope,task=kvm,pid=1556670,uid=0
2023-12-31T05:58:18.958982+00:00 prox380 kernel: [827342.721849] Out of memory: Killed process 1556670 (kvm) total-vm:111162224kB, anon-rss:91578576kB, file-rss:2660kB, shmem-rss:0kB, UID:0 pgtables:182384kB oom_score_adj:0
2023-12-31T05:58:19.787619+00:00 prox380 systemd[1]: 104.scope: Failed with result 'oom-kill'.
2023-12-31T05:58:30.501982+00:00 prox380 kernel: [827354.283802] oom_reaper: reaped process 1556670 (kvm), now anon-rss:0kB, file-rss:160kB, shmem-rss:0kB

Would simply reducing the allocated memory for a VM be likely to help this? There are only 3 VMs on this node, the total memory allocated between them all is admittedly about 32GB more than the actual system, but 2 of the 3 VMs don't really use any memory, even though its' allocated.
 
That is what happens when you over commission memory.
I thought Proxmox was good at managing this, forcing VMs to release memory that was just cached, I just assumed it would work it all out. I'm 13% over commissioned, Looks like reducing the VMs memory or possibly increasing available system memory will be a simple fix then?
 
No, without enabling ballooning, installing guest agents and setting the Min RAM lower than Max RAM PVE won't do anything to release RAM that is in use by caching.

And this isn't that great either. You only move the OOM from the host to the guest OS. So your guestOS would maybe kill the process of your webserver instead of the PVE host killing the whole VM.

If you want to be sure nothing gets killed, don't overpeovision your RAM.
 
  • Like
Reactions: andlil
No, without enabling ballooning, installing guest agents and setting the Min RAM lower than Max RAM PVE won't do anything to release RAM that is in use by caching.

And this isn't that great either. You only move the OOM from the host to the guest OS. So your guestOS would maybe kill the process of your webserver instead of the PVE host killing the whole VM.

If you want to be sure nothing gets killed, don't overpeovision your RAM.
Hi Dunuin, thanks again for your responses to my endless newbie questions, I think you've answered on every thread, it's much appreciated.

Are you saying that if ballooning had been enabled on each VM, this would not have happened? I've currently reduced the memory on the VMs, 64GB on the one in question, 16gb on another and 32GB on the final, so that leaves 16GB over and above the allocation on the system. Hoping this fixes the situation till I add more memory in a week or two.
 
I think the problem is not that easy. I manage a few small Proxmox environments and sometimes one of them get those OOMs.

On the one hand, you have the 'regain memory out of the VM to the host again' function like ballooning or KSM.
On the other hand there is a ZFS Arc Cache, which use per default up to 50% of your host memory (but you can limit that to a fixed amount).

Per default configuraton, Kernel Samepage Merging will regain your unused VM memory and give it back to the host (alternatively you can use ballooning for the same effect), and ZFS Arc will use that for filesystem caching. If your VM then use a lot of memory, KSM will allocate the memory back and ZFS Arc will release this memory to satisfy KSM.

Normally this works nearly stable, if you have fast SSDs. But if you use slow consumer SSDs or slow HDDs and a process in your VM aquired a lot of memory very fast, combined with a large disk write, then the ZFS Arc could not release the memory fast enough.
You then overprovisioned your system memory by accident.

A solution would be to limit VM memory + ZFS Arc + Proxmox Host memory (1 GB or so) to a value smaller than your RAM. But this will be a waste of memory most of the time. Just do that, if the problem occours for the first time,b ecause it possibly won't, if you use fast SSDs, so that ZFS can drop the cache fast enough.

The OOMs on the machines i manage always happen on the machines with the cheapest SSDs (BX500, Kingston NV2, etc.). I would never use such SSDs on a server anymore. It was a big mistake.
 
Last edited:
Thanks for teh detailed response.

The system is part of a small 2 server cluster, this VM is is running on a node which has the following spec:

HPE DL380 Gen 9
CPU E5-2620 v3 @ 2.40GHz
128gb 2400MHz DDR4 Memory
2 x 500GB Enterprise SATA SSD in Mirror for OS (500GB Raid)
4 x 500GB Enterprise SATA SSD in stripe/mirror for VMs (1TB Raid)
Bonded 4x1GB Ethernet connected to 10GB switch with 10GB uplink.

I'm hoping that the issue will now be fixed until I can add some more memory.

@ivenae, is the setting of ZFS Arch available in GUI before I go hunting for terminal commands to see what it's up to?
 
That KSM would regain RAM from VMs would be new to me. As far as I know it will run only on the host level and will deduplicate identical pages. So it doesn't release VMs RAM, it will just kind of compact it by deduplicating ARC and VMs caching in case both cache the same data.
 
  • Like
Reactions: CelticWebs
That KSM would regain RAM from VMs would be new to me. As far as I know it will run only on the host level and will deduplicate identical pages. So it doesn't release VMs RAM, it will just kind of compact it by deduplicating ARC and VMs caching in case both cache the same data.
I have systems with 64 GB memory in total. The VMs have 2 GB + 6 GB + 16 GB + 24 GB = 48 GB in sum.
KSM show 27 GB regained.
AFAIK KSM does not regain from host cache, but it does on all other memory.

The effect is, that KSM is similar to ballooning.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!