Can PVE ignore the cached RAM of a VM in it's usage calculation?

Taomyn · Sep 15, 2023

Since migrating all my VMs from an old Hyper-V to PVE v8 I've had very little troubles which is great, but there is one thing I cannot seem to get sorted.

I have one VM, running Fedora Server 38 that hosts a few containers and a MariaDB server, where PVE over time alerts me to memory use being over 90%. I've tried various ways to mitigate the issue e.g. reducing the caching of MariaDB, but the issue persists. I start the VM and for hours the memory use is around 4GB of 10GB, but then according to PVE it rises back up again to what you see below:

If the swap was under heavy use then I would be concerned, but it's not, so PVE telling me there's a problem is wrong (I use Zabbix to monitor my servers including PVE, and that's where I get the alerts from). My choices so far have been to reboot, use the monitor to lower then raise the max balloon size, or force a cache cleanup on the server. I know I could "adjust" the trigger on the PVE checks, but that just hides the issue and would rather PVE was reporting like everything else.

I never had do this when the VM was running on Hyper-V with RAM usage always reporting the non-cached amount.

LnxBil · Sep 15, 2023

Please read this thread why it's displayed the way it is.

Taomyn said:
If the swap was under heavy use then I would be concerned, but it's not, so PVE telling me there's a problem is wrong (I use Zabbix to monitor my servers including PVE, and that's where I get the alerts from). My choices so far have been to reboot, use the monitor to lower then raise the max balloon size, or force a cache cleanup on the server.

Please try to understand that this is NORMAL and nothing wrong. You WANT to have all of your RAM used.

Taomyn · Sep 15, 2023

LnxBil said:
Please read this thread why it's displayed the way it is.

Please try to understand that this is NORMAL and nothing wrong. You WANT to have all of your RAM used.

I'm perfectly aware that we want to use all the RAM, I was glad to see many years ago Windows finally adopting this attitude, but cached is not the same as in use, it's transient and can be used at any time by processes that actually use it, being relinquished by the cache as a priority.

Yes, seems to be normal for PVE, but as I said in Hyper-V it's not and nor is for VMware which is what I manage in my day job.

LnxBil · Sep 15, 2023

Taomyn said:
Yes, seems to be normal for PVE, but as I said in Hyper-V it's not and nor is for VMware which is what I manage in my day job.

View attachment 55458

View attachment 55457

Do those two images correlate? VMware shows 1 GB used, the VM itself 5 GB used.

scyto · Sep 15, 2023

LnxBil said:
Do those two images correlate? VMware shows 1 GB used, the VM itself 5 GB used.

It shows 1Gb active, not used, aka it is excluding the cache as that is transient. This seems to be a philosophy difference in what is the important metric. The OP is arguing there is no point alerting memory is 90% when the bulk is cache. It should only alert when there is true memory exhaustion. This is the way hyper-v and VMware do it. Reading the thread you posted it sounds like this would be an upstream change in the guest tools?

itNGO · Sep 15, 2023

Or you just monitor the Guest with Zabbix, where you get RAM-Usage and Cache separate and disable that trigger from PVE-Side-View.....

scyto · Sep 15, 2023

itNGO said:
Or you just monitor the Guest with Zabbix,

Indeed, i don't have a horse in the race. Just articulating what I believe the OP is getting at.

For my HomeLab i have been wondering what i should use, sounds like you favor zabbix over things like ceckmk etc. Is it easy to get started with it?

--edit-- scratch that, i remember this being a ridiculous number of containers, with crappy compose examples, and an insane number of mount points. So I moved on... if there is an easy button version i would be interested

LnxBil · Sep 16, 2023

scyto said:
It shows 1Gb active, not used, aka it is excluding the cache as that is transient.

The cache is already excluded in the 5 GBs, the cache filled the whole ram up to 20 GB (as seen in the htop output, including swapping), so the VMware 1 GB makes absolutely no sense.

scyto said:
This seems to be a philosophy difference in what is the important metric. The OP is arguing there is no point alerting memory is 90% when the bulk is cache. It should only alert when there is true memory exhaustion. This is the way hyper-v and VMware do it.

A hypervisor should not do alerting, and PVE does not. Monitoring should always be extern.

I really don't like to display this "memory value" at all, because even the one that is shown, is too low (was seen in the other threads). You have more RAM like for GPU and other device virtualization, the KVM process overhead, etc. that can blow your VM up to 110-115% depending on your configured memory. On the other side, you have KSM that potentially deduplcates it again, so you're totally lost. So, it can only be wrong, the question is only what kind of wrong you prefer.

IMHO: The important metric for a hypervisor is the ACTUAL used memory. That's what is relevant, not was the guest thinks it uses. This is the memory that runs out first. Imagine you have 10 Windows VMs with such "fake numbers" and think, yeah, I can add 10 more because they only use 10% of what I configured, but you get OOMs or heavy swapping by adding just 2 because you compared the wrong numbers. As an hypervisor administrator, I just want to see how much space I can spare for new VMs without overprovisioning it. If you're the actual service or VM admin, I can see that you want to have your numbers, that you would also have inside of your VM guest windows taskmanager.

scyto said:
Reading the thread you posted it sounds like this would be an upstream change in the guest tools?

The guest tools are around for decades and that this has not happend yet is a good indicator that it never will. People like me (and others that answered) are not interesed in the "lies" (harsch, I know

) VMware and Hyper-V are telling about the memory usage. Only because they do this, why should other hypervisors do this? For Windows, you need(ed?) to install this strange service that it'll report the memory value windows want to see. I never understood this.

scyto · Sep 16, 2023

LnxBil said:
has not happend yet is a good indicator that it never will

agreed

LnxBil said:
Monitoring should always be extern.

what's your preferred poison when it comes to monitoring tools?

UdoB · Sep 16, 2023

scyto said:
what's your preferred poison when it comes to monitoring tools?

Proxmox comes with an integrated and pre-prepared "outlet" for a metric server. --> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#external_metric_server

The obvious approach is to run an InfluxDB and a Grafana instance. (Off-cluster if possible.) There might be easier solutions than this; I run it just because Dashboards like this --> https://grafana.com/grafana/dashboards/19119-proxmox-ve-cluster-flux/ are both beautiful and helpful.

Most other solution (I am using Zabbix) require modifications on each node by installing an "agent" or at least activation/configuration of SNMP.

Probably all of this is overkill for just keeping an eye on limits like Temperature too high / Disk full / Sysload skyrocketing...

Have fun!

Taomyn · Sep 18, 2023

LnxBil said:
Do those two images correlate? VMware shows 1 GB used, the VM itself 5 GB used.

Yeah, they normally do just that VMware tends to be very dynamic, often to it's detriment e.g. when rebooting a VM and VMware tools is not running our alerting kicks off far too often and it's quite difficult to fine tune. Just wish that it knew the difference between "dead" and "not running because of a reboot, so it will back soon".

But back to the original topic, thanks all for the feedback, I suppose as someone coming from two environments where because these alerts rarely happen to where it happens all the time, I need to adjust the monitoring. Giving the VM more RAM won't help as it will simply use it all up - which I tried by going from 8GB to 10GB. I don't want to simply stop monitoring the memory usage of each VM through PVE, so I will probably change in Zabbix to warn at 98% and critical at 99% - see how that goes and hopefully one less alert to worry about.

LnxBil · Sep 19, 2023

scyto said:
what's your preferred poison when it comes to monitoring tools?

What @UdoB said is what I currently use. We also have Icinga running for actions.

Search

Search

Can PVE ignore the cached RAM of a VM in it's usage calculation?

Taomyn

Member

LnxBil

Distinguished Member

Taomyn

Member

Attachments

LnxBil

Distinguished Member

scyto

Active Member

itNGO

Renowned Member

scyto

Active Member

LnxBil

Distinguished Member

scyto

Active Member

UdoB

Distinguished Member

Taomyn

Member

LnxBil

Distinguished Member