Hugepages and Multiple VMs

jstruebel

New Member
Apr 8, 2017
1
0
1
42
I'm trying to run multiple KVM VMs with hugepages enabled on Proxmox VE 4.4-12/e71b7a74. I've followed instructions on the web to set the vm.nr_hugepages using sysctl in order to reserve the amount of memory that I plan to use with my VMs. However, once I start the first VM, the number of free hugepages gets set to 0 and I get the error below when I try to start my second VM.

TASK ERROR: start failed: hugepage allocation failed at /usr/share/perl5/PVE/QemuServer/Memory.pm line 532.

If I run sysctl to re-reserve my pool of hugepages, I'm then able to start the second VM, but again the number of free hugepages gets set back to 0.

After digging around the internet for several days trying to figure out what was going on, I think that the patch in http://pve.proxmox.com/pipermail/pve-devel/2016-June/021585.html was what enabled this behavior. Since this behavior only works with a single VM that uses hugepages I suggest that the behavior be modified to not set the number of free hugepages to 0 in order to allow multiple VMs with hugepages.

Thanks,
Jonathan
 
Sorry for the necropost, but I wanted to post a resolution for people that stumble upon this issue in the future (as it's still a problem in 6.1).

Quick Fix:
You can patch the problem by removing the calls to PVE::QemuServer::Memory::hugepages_allocate() and PVE::QemuServer::Memory::hugepages_pre_deallocate() in /usr/share/perl5/PVE/QemuServer.pm::vm_start() in the code section that checks the VM's hugepage config directive.

Description of the Problem:
In the link OP posted, the original change was made such that memory utilization would show appropriately once VMs are up an running. The problem with how Proxmox does it is by setting the hugepages lower than the configured value, the kernel will more than likely not able to get those pages back, as hugepages need to be allocated contiguously (As per https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt), and I quote:

The success or failure of huge page allocation depends on the amount of
physically contiguous memory that is present in system at the time of the
allocation attempt. If the kernel is unable to allocate huge pages from
some nodes in a NUMA system, it will attempt to make up the difference by
allocating extra pages on other nodes with sufficient available contiguous
memory, if any.

The issue here, is that as VMs do <things>, or any other applications run, they'll create page cache, allocate memory for themselves, or otherwise disrupt the contiguous section of memory that was previously reserved for hugepages. If a VM were to go through a stop/start cycle, the likelyhood of it being able to start without the 'hugepage allocation failed' message decreases the longer the host system spends in steady-state operation. Without the fix, the system would have to be restarted and all VMs started enmasse at boot time to get the memory they are configured to use.

Recommended Fix:
To correct the original problem patched in June 2016, PVE will need to be a little more clever, and should read in the free_hugepages value for each NUMA node compared against the nr_hugepages value for each NUMA node ( plus computing non-hugepage used / available RAM) to calculate the available memory, rather than fiddling with vm.nr_hugepages directly and impacting guest reliability.
 
Oh, that is a much better fix! Thank you for sharing it, i'm implementing that across our cluster now!
 
Were are getting this same error sometimes (on a multi-VM PVE 6.2 with NUMA) when starting VMs via the GUI, but a manual start usually succeeds if we run the command shown by qm showcmd <id>.

I don't quite follow the explanation above on why this happens, but: if you can work around it by simply commenting out hugepages_allocate() and hugepages_pre_deallocate(), is there a reason not to do it -- i.e. why are they there in the first place? Couldn't it simply attempt to start the VM and show an error if it fails instead of trying to preallocate the pages in advance?

BTW: Sometimes trying to make a VM stopped -type backup gives the same error. The real kicker: it succeeds if the VM is running! Is this a simple bug or does PVE really need to allocate full amount of hugepages during a backup even when the VM is not running..?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!