Random Guest Shutdowns

mcit

Active Member
May 16, 2010
35
1
26
Hi Everyone

I have recently migrated an ageing host to a new server. I have always used hardware RAID, but this new one has no hardware RAID card so I have gone for a ZFS mirror. It is running across 2x 1.8TB NVMe SSD drives with a Xeon E2236 with 64GB Memory.

I have a total of 9 guests runnning simultaneously.
All guests combined are allocated a total of 59GB of memory.

I am seeing an issue where randomly I notice that a guest has powered off. I see nothing in the GUI to show a shutdown was requested, and nothing in the guest logs to show a shutdown command was issued.

Is it possible that the proxmox host is running low on memory, and is powering off a guest to save itself, and if so, is that the expected behavior? I am not aware of having this issue on any of my other nodes. I have over 40 proxmox hosts in service currently, but only this one using ZFS for the data store.

Matthew
 
If you check the syslogs, do you see the OOM killer getting active? If so, then this is definitely a situation where you need more RAM.

Did you check the PVE node summary how much memory is free? Or even better, do you have some external performance monitoring set up to keep and eye on the system stats?

Are the guests configured with a fixed amount of RAM or do you use ballooning?

Using memory ballooning can make the situation worse in a low memory environment. This is because by default ZFS will take free memory up to 50% of the installed memory when needed. This is mainly used for the cache (ARC) to serve read requests much faster. It will free up the memory if other processes require more, but it might take too long to free it up which can cause problems.

If you have some external performance monitoring set up, make sure to also monitor the ZFS ARC to get an idea how much memory it is using and how many requests it can serve without having to access the underlying disks.


In general, I can recommend leaving a bit more RAM for the system if you use ZFS and not to give it all to VMs as you will benefit from faster read operations if the data is cached by ZFS.
 
Thanks Aaron

I have found the OOM entries in the syslog that coincide with the shutdowns. I see entries like

Out of memory: Killed process 31263 (kvm) total-vm:20341864kB, anon-rss:17988224kB, file-rss:2248kB, shmem-rss:4kB, UID:0 pgtables:38676kB oom_score_adj:0

When this process kicks in, does it just terminate the first KVM process it sees and therefore the guest that is shutdown is random? Is there a way to control which guest is shutdown if memory gets too low?

In answer to your questions:

The sumary page currently says - RAM USage 97.10% (60.92 GiB of 62.74 GiB) and KSM Sharing is 21.48GiB.

I do not use Ballooning, everything is fixed size.

I do have a PRTG monitor on the system for simple uptime, I will see if I can extend this for memory tracking. However, the issue appears to be explained by what you have advised. I may need to simply add more memory or move a guest or two off to another system.

Thanks for your help, at least I now have a cause for the issue that I can verify.

Matthew
 
When this process kicks in, does it just terminate the first KVM process it sees and therefore the guest that is shutdown is random? Is there a way to control which guest is shutdown if memory gets too low?
The OOM killer will decide by a few metrics which process is the best to kill to alleviate the low memory situation. There is no way to control which one.

The sumary page currently says - RAM USage 97.10% (60.92 GiB of 62.74 GiB) and KSM Sharing is 21.48GiB.
Okay, with the KSM sharing up to 21.5G the memory is quite a bit overprovisioned. Meaning, that a few VMs are sharing the same memory as it is the same. And even with that, almost all RAM is full. Now if a VMs memory is changing and cannot be shared anymore, it will need dedicated memory again -> even less memory is available -> OOM killer needs to get active.

I do not use Ballooning, everything is fixed size.
Okay, so 59G is allocated to all VMs which is way over 80% where the KSM will start to work. Then a bit more memory will be free which might be used by ZFS. We cannot say for sure because for that, we'd need to check the size of the ARC (arcstat).

You could try to limit the RAM size of ZFS, but honestly, there isn't too much available anyway and it will need roughly 1GiB / 1TiB of storage space anyway. More is used for the cache.

In that situation I would add more RAM to give the system more room to work with.
 
Last edited:
Thanks Aaron for the detailed response.

I am investigating my options for additional memory on this server. It sounds like that will solve my issues.
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!