How should Transparent Hugepages be configured?

acziryak

New Member
Jul 26, 2023
3
0
1
Hello all,

Over the past couple of days, I have been reading up on THP in order to implement it in our production environment as a method of optimization.

It turns out that it's highly customizable, but with a dearth of information on best practices for setup.

After our setup of our latest cluster (version 7.X) I see that THP seems to be enabled in `madvise` mode:

Code:
root@pve-node-12:~# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
root@pve-node-12:~# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

The system call madvise(2): MADV_HUGEPAGE documentation states:
This feature is primarily aimed at applications that use
large mappings of data and access large regions of that
memory at a time (e.g., virtualization systems such as
QEMU).
Great! So what's the difference between always and madvise? Percona's blog explains:
HugePages are for not for every application. For example, an application that wants to allocate only one byte of data would be better off using a 4k page rather than a huge one. That way, memory is more efficiently used. To prevent this, one option is to configure THP to “madvise”. By doing this, HugePages are disabled system-wide but are available to applications that make a madvise call to allocate THP in the madvise memory region.
I'm assuming that PVE's virtualization takes appropriate advantage of madvise, but could not verify.

Q: Would the PVE hypervisor or any other program that PVE runs benefit from using always instead?

There is also another setting that's available in the kernel shipped with PVE 7: defer and defer+madvise.
For some reason, the default strategy to respond to THP fault fallbacks
is still just madvise, meaning stall if the program wants transparent
hugepages, but don’t trigger a background reclaim / compaction if THP
begins to fail allocations. This creates a snowball affect where we
still use the THP code paths, but we almost always fail once a system
has been active and busy for a while.

The option “defer” was created for interactive systems where THP can
still improve performance. If we have to fallback to a regular page due
to an allocation failure or anything else, we will trigger a background
reclaim and compaction so future THP attempts succeed and previous
attempts eventually have their smaller pages combined without stalling
running applications.

We still want madvise to stall applications that explicitely want THP,
so defer+madvise does make a ton of sense. Make it the default for
interactive systems, especially if the kernel maintainer left
transparent hugepages on “always”.

Reasoning and details in the original patch:

https://lwn.net/Articles/711248/
Q: Would this be a beneficial defrag method for PVE to use? If so, in what ways might it be beneficial?

There look to be additional settings that are tunables relevant to THP that should also be considered mentioned in the kernel documentation:
khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".

khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
also possible to disable defrag in khugepaged by writing 0 or enable
defrag in khugepaged by writing 1:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

You can also control how many pages khugepaged should scan at each
pass:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

and how many milliseconds to wait in khugepaged between each pass (you
can set this to 0 to run khugepaged at 100% utilization of one core):

/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

and how many milliseconds to wait in khugepaged if there's an hugepage
allocation failure to throttle the next allocation attempt.

/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

max_ptes_none specifies how many extra small pages (that are
not already mapped) can be allocated when collapsing a group
of small pages into one large page.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

A higher value leads to use additional memory for programs.
A lower value leads to gain less thp performance. Value of
max_ptes_none can waste cpu time very little, you can
ignore it.

max_ptes_swap specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

A higher value can cause excessive swap IO and waste
memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.
These settings do look to have default values set, but very sparse documentation on benchmarking and intelligent ways to derive what the correct values should be.

Q: Should any of these be re-evaluated on new installs? And if so, how?

Lastly, this is apparently not recommended for databases:
Conversely, workloads with sparse memory access patterns (like databases) may perform poorly with THP. In such cases it may be preferable to disable THP
We have 2 PVE hosts in our cluster that are dedicated to just host one DB VM each. Since these hosts will not be hosting any other VMs, would enabling THP on their hosts have the same negative ramifications as related above? In other words...

Q: should THP be turned off on those hosts that will only be running DB VMs?

Even if I am not able to get direct answers to these questions, I still hope that this is able to spark beneficial conversation.

Regards,
-- Andrew
 
By default some form of THP is used, I know as I felt the downsides a while back, was getting OOM's forcefully shutting down VM's (host had 32 gigs of RAM, with several gigs free when this occurred).

Setting THP to never on the kernel went a long way to stabilising the issue, also is this post explaining things better.

https://forum.proxmox.com/threads/hugepages-or-anon-hugepages.117580/post-508888

I wouldnt touch it unless you have hundreds of gigs of ram with 10s of gigs not utilised.
 
I prefer 'madvise' if lot of ram, 'never' when low ram, its better then oom, lol
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!