Hello all,
Over the past couple of days, I have been reading up on THP in order to implement it in our production environment as a method of optimization.
It turns out that it's highly customizable, but with a dearth of information on best practices for setup.
After our setup of our latest cluster (version 7.X) I see that THP seems to be enabled in `madvise` mode:
The system call madvise(2): MADV_HUGEPAGE documentation states:
Q: Would the PVE hypervisor or any other program that PVE runs benefit from using
There is also another setting that's available in the kernel shipped with PVE 7:
There look to be additional settings that are tunables relevant to THP that should also be considered mentioned in the kernel documentation:
Q: Should any of these be re-evaluated on new installs? And if so, how?
Lastly, this is apparently not recommended for databases:
Q: should THP be turned off on those hosts that will only be running DB VMs?
Even if I am not able to get direct answers to these questions, I still hope that this is able to spark beneficial conversation.
Regards,
-- Andrew
Over the past couple of days, I have been reading up on THP in order to implement it in our production environment as a method of optimization.
It turns out that it's highly customizable, but with a dearth of information on best practices for setup.
After our setup of our latest cluster (version 7.X) I see that THP seems to be enabled in `madvise` mode:
Code:
root@pve-node-12:~# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
root@pve-node-12:~# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
The system call madvise(2): MADV_HUGEPAGE documentation states:
Great! So what's the difference betweenThis feature is primarily aimed at applications that use
large mappings of data and access large regions of that
memory at a time (e.g., virtualization systems such as
QEMU).
always
and madvise
? Percona's blog explains:I'm assuming that PVE's virtualization takes appropriate advantage ofHugePages are for not for every application. For example, an application that wants to allocate only one byte of data would be better off using a 4k page rather than a huge one. That way, memory is more efficiently used. To prevent this, one option is to configure THP to “madvise”. By doing this, HugePages are disabled system-wide but are available to applications that make a madvise call to allocate THP in the madvise memory region.
madvise
, but could not verify.Q: Would the PVE hypervisor or any other program that PVE runs benefit from using
always
instead?There is also another setting that's available in the kernel shipped with PVE 7:
defer
and defer+madvise
.Q: Would this be a beneficial defrag method for PVE to use? If so, in what ways might it be beneficial?For some reason, the default strategy to respond to THP fault fallbacks
is still just madvise, meaning stall if the program wants transparent
hugepages, but don’t trigger a background reclaim / compaction if THP
begins to fail allocations. This creates a snowball affect where we
still use the THP code paths, but we almost always fail once a system
has been active and busy for a while.
The option “defer” was created for interactive systems where THP can
still improve performance. If we have to fallback to a regular page due
to an allocation failure or anything else, we will trigger a background
reclaim and compaction so future THP attempts succeed and previous
attempts eventually have their smaller pages combined without stalling
running applications.
We still want madvise to stall applications that explicitely want THP,
so defer+madvise does make a ton of sense. Make it the default for
interactive systems, especially if the kernel maintainer left
transparent hugepages on “always”.
Reasoning and details in the original patch:
https://lwn.net/Articles/711248/
There look to be additional settings that are tunables relevant to THP that should also be considered mentioned in the kernel documentation:
These settings do look to have default values set, but very sparse documentation on benchmarking and intelligent ways to derive what the correct values should be.khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
also possible to disable defrag in khugepaged by writing 0 or enable
defrag in khugepaged by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
You can also control how many pages khugepaged should scan at each
pass:
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
and how many milliseconds to wait in khugepaged between each pass (you
can set this to 0 to run khugepaged at 100% utilization of one core):
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
and how many milliseconds to wait in khugepaged if there's an hugepage
allocation failure to throttle the next allocation attempt.
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
max_ptes_none specifies how many extra small pages (that are
not already mapped) can be allocated when collapsing a group
of small pages into one large page.
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
A higher value leads to use additional memory for programs.
A lower value leads to gain less thp performance. Value of
max_ptes_none can waste cpu time very little, you can
ignore it.
max_ptes_swap specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page.
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
A higher value can cause excessive swap IO and waste
memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.
Q: Should any of these be re-evaluated on new installs? And if so, how?
Lastly, this is apparently not recommended for databases:
We have 2 PVE hosts in our cluster that are dedicated to just host one DB VM each. Since these hosts will not be hosting any other VMs, would enabling THP on their hosts have the same negative ramifications as related above? In other words...Conversely, workloads with sparse memory access patterns (like databases) may perform poorly with THP. In such cases it may be preferable to disable THP
Q: should THP be turned off on those hosts that will only be running DB VMs?
Even if I am not able to get direct answers to these questions, I still hope that this is able to spark beneficial conversation.
Regards,
-- Andrew