Hugepages and Multiple VMs

jstruebel · Apr 8, 2017

I'm trying to run multiple KVM VMs with hugepages enabled on Proxmox VE 4.4-12/e71b7a74. I've followed instructions on the web to set the vm.nr_hugepages using sysctl in order to reserve the amount of memory that I plan to use with my VMs. However, once I start the first VM, the number of free hugepages gets set to 0 and I get the error below when I try to start my second VM.

TASK ERROR: start failed: hugepage allocation failed at /usr/share/perl5/PVE/QemuServer/Memory.pm line 532.

If I run sysctl to re-reserve my pool of hugepages, I'm then able to start the second VM, but again the number of free hugepages gets set back to 0.

After digging around the internet for several days trying to figure out what was going on, I think that the patch in http://pve.proxmox.com/pipermail/pve-devel/2016-June/021585.html was what enabled this behavior. Since this behavior only works with a single VM that uses hugepages I suggest that the behavior be modified to not set the number of free hugepages to 0 in order to allow multiple VMs with hugepages.

Thanks,
Jonathan

spd010273 · May 20, 2020

Sorry for the necropost, but I wanted to post a resolution for people that stumble upon this issue in the future (as it's still a problem in 6.1).

Quick Fix:
You can patch the problem by removing the calls to PVE::QemuServer::Memory::hugepages_allocate() and PVE::QemuServer::Memory::hugepages_pre_deallocate() in /usr/share/perl5/PVE/QemuServer.pm::vm_start() in the code section that checks the VM's hugepage config directive.

Description of the Problem:
In the link OP posted, the original change was made such that memory utilization would show appropriately once VMs are up an running. The problem with how Proxmox does it is by setting the hugepages lower than the configured value, the kernel will more than likely not able to get those pages back, as hugepages need to be allocated contiguously (As per https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt), and I quote:

The success or failure of huge page allocation depends on the amount of
physically contiguous memory that is present in system at the time of the
allocation attempt. If the kernel is unable to allocate huge pages from
some nodes in a NUMA system, it will attempt to make up the difference by
allocating extra pages on other nodes with sufficient available contiguous
memory, if any.

The issue here, is that as VMs do <things>, or any other applications run, they'll create page cache, allocate memory for themselves, or otherwise disrupt the contiguous section of memory that was previously reserved for hugepages. If a VM were to go through a stop/start cycle, the likelyhood of it being able to start without the 'hugepage allocation failed' message decreases the longer the host system spends in steady-state operation. Without the fix, the system would have to be restarted and all VMs started enmasse at boot time to get the memory they are configured to use.

Recommended Fix:
To correct the original problem patched in June 2016, PVE will need to be a little more clever, and should read in the free_hugepages value for each NUMA node compared against the nr_hugepages value for each NUMA node ( plus computing non-hugepage used / available RAM) to calculate the available memory, rather than fiddling with vm.nr_hugepages directly and impacting guest reliability.

MasterPhi · May 20, 2020

I've proposed this https://forum.proxmox.com/threads/h...-keep-hugepages-on-vm-stop.62039/#post-304091
There is a patch but still not merged.

spd010273 · May 21, 2020

Oh, that is a much better fix! Thank you for sharing it, i'm implementing that across our cluster now!

Chumblys · Aug 5, 2020

Were are getting this same error sometimes (on a multi-VM PVE 6.2 with NUMA) when starting VMs via the GUI, but a manual start usually succeeds if we run the command shown by qm showcmd <id>.

I don't quite follow the explanation above on why this happens, but: if you can work around it by simply commenting out hugepages_allocate() and hugepages_pre_deallocate(), is there a reason not to do it -- i.e. why are they there in the first place? Couldn't it simply attempt to start the VM and show an error if it fails instead of trying to preallocate the pages in advance?

BTW: Sometimes trying to make a VM stopped -type backup gives the same error. The real kicker: it succeeds if the VM is running! Is this a simple bug or does PVE really need to allocate full amount of hugepages during a backup even when the VM is not running..?

Search

Search

Hugepages and Multiple VMs

jstruebel

New Member

spd010273

New Member

MasterPhi

Well-Known Member

spd010273

New Member

Chumblys

Active Member

We value your privacy