PVE host swapping issue

Pavel Hruška · Mar 28, 2019

Hello all, I think this is neverending question, but sometimes I'm facing swapping issue that I really do not undesrtand.

I have two hosts in cluster (not HA), both running 96G of RAM. I keep enough of free RAM on each to be able to migrate all of VMs to one or other if needed, so total provisioned RAM of all VMs in this cluster is not greater than 90GB something-ish).

I've noticed higher swap usage on node 2 yesterday. It was about 8GB of swap that was used. So I started thinking why this could have happened. Because what looks strange to me is that the host has ~40GB free RAM when running (you can look at the attached image) and yet it's swapping quite a lot. I have set the vm.swappiness to 1 on both nodes when freshly installed. First node is running with no swap used at all...

There is one VM with 40GB of RAM assigned that runs with high RAM usage (>95%) and had balloning set to 32GB/40GB (I've just fixed it to single 40GB for now, ballon still enabled). I saw that swapping decreased a lot when I've lowered the VM RAM usage, also that is strange to me why host swaps out when guest was runing out of RAM...

I am not even able to finish *swapoff -a* on the second node, because it kinda hangs on last ~10MB of swap used and then never ends. And as you can see swap usage has rised to ~350MB since this morning...

Any ideas? Can balloning device cause host to be swaping when guest running low on ram? Even when host has enough free RAM available? Or why did kernel swapped out 8GB of RAM even when there was ~40GB RAM available...???

Andrew Hart · Mar 28, 2019

It is my opinion that linux will use filesystem cache and buffers in preference to rarely used memory. i.e. If you do a lot of filesystem work then some other things will get swapped out. It could be that your backup is pushing things into swap.

As for "because it kinda hangs on last ~10MB of swap" - no idea what that is.

Pavel Hruška · Mar 28, 2019

I am not sure if swapping can ocur under memory pressure on NUMA nodes, because those cluster nodes (hosts) are in fact 2 CPU NUMA nodes each with 48GB RAM...

To Andrew: Well I am not able to turn off swap with *swapoff -a*. It never releases those few remaining megabytes and the command never completes.

e100 · Mar 28, 2019

I have wrestled with this problem for years and never found a great solution, most of my nodes are NUMA too.

Changing swappiness never prevented it.
Any process that is idle will end up having its RAM swapped to disk if the kernel thinks that the RAM would be better used for buffer/cache.
In my case I mostly noticed VMs that were idle all weekend performing poorly on Monday because all their RAM was swapped to disk during weekend backups.

I finally decided to use zRam and this helped quite a bit.
zRam is preferred over disk so any swapping was in zRAM, but recently zRam has caused numerous crashes on my systems.
So now I am running without any swap configured.

Pavel Hruška · Mar 28, 2019

Thank you for your reply, e100!

For some reason I am bit scared to run without swap, just don't want to let kernel kill my VM because he thinks it is good candidate to free up memory when needed. But I will give it a try... What is your experience with such setup, how do you provision RAM, do you leave some (or significant) free RAM space to avoid memory pressures? Do you have your VMs configured as NUMA aware?

e100 · Mar 28, 2019

We currently have 20 nodes in production with no swap.
The nodes range from 32GB RAM to 256GB RAM with the majority of them having 128GB RAM.

Most of the VMs are configured NUMA aware. I usually do not set it if the VM uses little RAM and very few cores.

I cannot recall ever having a VM get killed by OOM.
I try not to allocate any more than 80% of RAM to VMs, sometimes less depending on the particular nodes configuration.

I also do not run swap in most of my VMs, each VM is given the amount of RAM needed to perform its tasks.
This is where I sometimes do see some OOMs, we either fix the process using too much RAM or allocate more to prevent future occurrences.
Memory balloon has only ever caused us issues so we do not use it.

We set zfs_arc_max when using zfs and make sure to leave plenty of RAM for zfs.

On NUMA systems we set:
kernel/mm/ksm/merge_across_nodes=0

This has prevented lots of memory related issues, especially memory allocation failures caused by memory fragmentation during backups:
vm.min_free_kbytes = 4194304

These have also helped with RAM usage during backups:
vm.dirty_ratio = 3
vm.dirty_background_ratio = 1

I encourage you to read documentation about the settings mentioned and decide for yourself if changing them is appropriate and what values to use.

Andrew Hart · Mar 28, 2019

"In my case I mostly noticed VMs that were idle all weekend performing poorly on Monday because all their RAM was swapped to disk during weekend backups."

THIS! And if you get this all you can do is turn swap off completely or keep the VM artificially busy.

Pavel Hruška · Mar 28, 2019

Hello e100, thank you for your very informative reply.

Do I get it right that your VMs are Linux machines (as you talk about OOMs in VMs)? I have Windows VMs, but I think that does not matter for this topic.

I will use your examples as a starting point to dig deeper into virtual memory (vm.*) and it's features and settings, thank you.

As for now I'm gonna try to run hosts with swap disabled.

e100 · Mar 28, 2019

Yes, we have very few windows servers. We do leave swap enabled for Windows but usually set it to a specific size instead of allowing Windows to manage it.

klcstaysbusy · Jul 14, 2021

I know this is an old thread, but I've been dealing with this issue too for a few years.

In my case, I have about a dozen VM's actively running in production, but the elephant in the room amoung them is a Windows VM in which its allocated 64GB RAM (PVE has 512GB), (most other guests are Debian based, using <16GB RAM) but for this particular Windows VM, balloning is disabled, and its running NUMA as the only guest to use NUMA.

From watching Webmin (Webmin installed on PVE host), after backups complete of all guest systems including this Windows guest, Swap memory will instantly tip from less than 1% used (before startup of this NUMA enabled guest), to spiking over 99% of the swap memory (post startup of NUMA enabled guest).

I will heed recommendations by @e100, but up until now, even though it may seem crude, the best way to avoid issues caused by swap memory exhaustion (if left unattended, have seen interuption and corruption of guest primary V-Disks) is to reboot the host (post-guest system shutdown).

Post-Reboot, on intial system startup (host or guests), theirs no issues, nor is their issues in-between shutdown/startup of this NUMA enabled guest, however the recipe to trigger the onset of the issue occurs post-backup (backup type = shutdown), on startup of NUMA host from my obvservations.

e100 · Jul 19, 2021

klcstaysbusy said:
PVE has 512GB

With that much ram just disable swap on the host.

Much has changed since I last posted on this thread.
We no longer use zram on hosts or in guests.
On any host with more than 64GB of RAM we completely disable swap.

I have never seen a situation where swap on the Proxmox host was beneficial but I have seen countless times were it caused all sorts of problems.
On the host we still tune the sysctl settings mentioned in my post above https://forum.proxmox.com/threads/pve-host-swapping-issue.52854/#post-244515

glauco.lins · Oct 13, 2021

Old thread, but you can avoid VM swap by enabling Hugepages in the VM config
This way, VM memory is never swapped out

Check if your processor support hugepages first

Code:

grep -oE "(pse|pdpe1g)" /proc/cpuinfo | sort | uniq

# pse      2MB hugepages
# pdpe1g   1GB Hugepages

Enable hugepages at the VM config (NUMA must be enabled in the VM processor options)

Code:

qm set $vmid -numa 1
qm set $vmid -hugepages 2        #2MB hugepages (VM memory must be multiple of 2MB)
qm set $vmid -hugepages 1024     #1GB hugepages (VM memory must be multiple of 1024MB)

That done, pre allocate hugepages at boot, so you can shutdown and start VM's at any time
If you do not pre allocate, you may run into errors like "could not allocate enough hugepages" when you start or reboot a VM using QEMU.

Pages that are pre allocated won't be available for the host, and will only be used by VM's with the Hugepages option

Code:

nano /etc/default/grub

# Edit the line to include hugepages options
# Replace NUMBER-OF-PAGES with the amount of memory needed
# default_hugepagesz=1G and hugepagesz=1G are only needed if you intend to use 1GB hugepages
#
# GRUB_CMDLINE_LINUX_DEFAULT="[...] default_hugepagesz=1G hugepagesz=1G hugepages=NUMBER-OF-PAGES transparent_hugepage=always"
#
# save file

update-grub

reboot

With the command below, you can check you hugepages allocation and utilization

Code:

grep Huge /proc/meminfo

Search

Search

PVE host swapping issue

Pavel Hruška

Member

Andrew Hart

Member

Pavel Hruška

Member

e100

Renowned Member

Pavel Hruška

Member

e100

Renowned Member

Andrew Hart

Member

Pavel Hruška

Member

e100

Renowned Member

klcstaysbusy

Member

e100

Renowned Member

glauco.lins

Member

We value your privacy