PVE crashes running VMs because it is not checking free host memory before starting VMs

Larion

New Member
Mar 6, 2021
15
2
3
Im coming from hyper-V and am testing proxmox for a production environment.
I have found a big problem with how PVE manages Host/Guest memory which leads to VMs crashing:


Tested Scenario 1:
Single Host with 128gb RAM

Host with 2 Win10 VMs with 96gb memory each in stopped state

Starting VM1 - no issues
Host memory 100gb allocated, ~20gb free
Starting VM2 -> NO CHECK BY PVE HOW MUCH HOST MEMORY IS LEFT
VM2 starts !!!

VM1 crashes !!!

Is this expected behaviour ?
Why is VM2 allowed to start ?


----------------------------------------------------

Tested Scenario 2:
proxmox cluster 3 Node (128gb Host RAM each) with ceph hyper converged
HA group with all nodes set up (default settings; no node priority, max restart&relocate=1)
all VM are Win10

Node1:
VM1 with 96gb RAM in started state
~20gb free Node1 memory

Node2:
VM2 with 96gb RAM in started state
~20gb free Node2 memory

Node3:
VM3 with 96gb RAM in started state
~20gb free Node3 memory


> cutting power of Node1 to test migration behaviour
Result:
VM1 migrated to Node2
VM1 gets started on Node2
already running VM2 on Node2 crashes because its memory is allocated to VM1 !!!

VM2 gets restarted (because of HA group setting)
VM1 crashes!
VM1 gets restarted
VM2 crashes
VM2 gets restarted
VM1 crashes again
and so on in a loop!


Is this expected behaviour ?
Why is the migrated VM1 allowed to start ?
How can I stop that from happening ?

----------------------------------------------------

This is a gigantic problem...
Why is there no check of free host memory before VMs are allowed to start ???
Other hypervisors do this!!!
Is there a setting I have overlooked ?
Is this a priority on the roadmap ? (which it fracking should be...)
 
Last edited:
Single Host with 128gb RAM

Host with 2 Win10 VMs with 96gb memory each in stopped state

Memory over-allocation can never work though, so why configuring that setup in the first place if you plan to run both at the same time?

Is this expected behaviour ?
Why is VM2 allowed to start ?
You cannot tell for sure if there's enough memory left, it may be barely enough left on checking but not any more on actual start and the reverse can be true too. Some basic check can be still added, as said they won't be 100% accurate... They did not exist yet as this was not really a problem for our users.

Is this expected behaviour ?

Your test is a bit meaningless, as it can never can work, you configure HA for a case that can never work.

How can I stop that from happening ?
Design your setup such that there's enough resources for the scenarios you want to be able to run successfully.

If you want HA and fail-over to work, you need enough memory to allow both VMs to run at the same time.
In your case the actual fix is getting >> 192 GiB of memory.
 
  • Like
Reactions: mailinglists
This is a test scenario for testing worst case edge behaviour.
I thought i made that clear.

The use case is you have dozens or even hundreds of VMs running in a HA-setup.
In case of a node failure you want migration to work without putting running VMs in danger. (guest databases are thankful for that)
You cannot always make sure to keep an excel sheet up to date with your VM ressources.
Thats manual labor.
IT guys hate manual labor!
We use software to automate things or prevent human error.

Philosophy aside, im trying to find a solution.
I tried:
nano /etc/sysctl.conf
VM.overcommit_memory = 0
which had no effect.

Does a pve.conf file exist, where I can set things right


It should be easy for you to add a hardware ressource check prior to VM start attempts.
(read VM qemu.conf file for memory size; check free -m on Host; if free -m on Host smaller than qemu dont start vm ....)
There could be a on/off toggle switch for this check, so that users who want to overcommit memory can do so....
A user defined value for the amount of memory to be reserved for the host could be implemented too.
 
Last edited:
This is a test scenario for testing worst case edge behaviour.
I thought i made that clear.
That is clear, but it's a bogus test, three VMs on three nodes, where every node is dimensioned such that they can only host that single VM is not a worst-case edge case test, it's just an artificial test matching no real scenario (i.e., not a worst case test).

The use case is you have dozens or even hundreds of VMs running in a HA-setup.
You still need to setup the nodes in such a way that they can cope with taking over the load from a failed node, else you did not setup HA.

The HA services from a failed node are already spread out to the other nodes, they won't get recovered all on the same node...

Does a pve.conf file exist, where I can set things right
There's nothing to "set right", if you want that two VMs be able to run then setup enough memory so that one can actually fail-over.

It should be easy for you to add a hardware ressource check prior to VM start attempts.
(read VM qemu.conf file for memory size; check free -m on Host; if free -m on Host smaller than qemu dont start vm ....)
There could be a on/off toggle switch for this check, so that users who want to overcommit memory can do so....
A user defined value for the amount of memory to be reserved for the host could be implemented too.
I already agreed that checks can be added, so not really sure what you're arguing for...
Some basic check can be still added, as said they won't be 100% accurate...
If you have concrete patches already then check out the developer documentation about how to send them:
https://pve.proxmox.com/wiki/Developer_Documentation
 
I was hoping for this problems severity to be acknowledged.
Your answers arent exactly diplomatic, considering I am a potential paying customer.
And you apparently expect this potential new customer to be familiar with the proxmox source code and supply a patch.
Very strange...since you don't even give a hint of where to look and what to patch.
Also you could have said: we will put this on the roadmap for version 6.X.X or something

As I said I am considering proxmox for a production environment and was rather shocked that such killer issues still exist in the latest version v6.3.6.
Yes, the problems source is human error or bad VM ressource planning or whatever, but other hypervisors prevent data loss in such cases.
This can happen easier than you might think, if for expample, several admins work on the cluster and one not as familiar with the system is not careful enough. Or for example, two admins migrate VMs independently from each other without planning every step beforehand...

Imagine a VM with a database getting crashed by a starting or migrating Win10-VM (allocates all memory right from the start)...
Not much fun cleaning that up, if even possible.
 
Last edited:
I was hoping for this problems severity to be acknowledged.
I already stated that this check can be added, but noted that it ever only will be best-effort, as there's no global memory allocation lock, thus there'll be always a race. So, again:
I already agreed that checks can be added, so not really sure what you're arguing for...

Your answers arent exactly diplomatic, considering I am a potential paying customer.
FYI, the customer portal is over there: https://my.proxmox.com/en if you're eligible for support you can open a ticket there, I heard that they are very good and diplomatic at answering :)

And you apparently expect this potential new customer to be familiar with the proxmox source code and supply a patch.
Very strange...since you don't even give a hint of where to look and what to patch.
Where did I expect that exactly? I suggested that if, and I quote, "If you have concrete patches already" and pointed to the documentation about how to submit them.

This can happen easier than you might think, if for expample, several admins work on the cluster and one not as familiar with the system is not careful enough. Or for example, two admins migrate VMs independently from each other without planning every step beforehand...
That means you need some change-management, as else you'll still run into issues independent of this check. If admins work uncoordinated on shared resources, no matter of how many checks and handrails there can be, it'll always result in problems sooner or later.
Same with resource planning, a good plan for the resources and failure case that a setup is expected to handle, is one of the most important things to do. If HA or migration is expected then add enough memory/CPU resources for that to be at least theoretically possible. For that it's important to actually test as closely as possible to the real world.

As said, the check can and will be added, but it's fighting symptoms. If you want that case to work you still will need more memory in the end, and then your test wouldn't have failed in the first place.
 
It should be easy for you to add a hardware ressource check prior to VM start attempts.
(read VM qemu.conf file for memory size; check free -m on Host; if free -m on Host smaller than qemu dont start vm ....)

To clarify, this is a band-aid at best, in setups with both VMs and CTs, the VM set to be able to use more or less memory, and dynamic memory workloads in general, the check at start is rather useless, it'd only fix your artificial test case but not the ones happening more likely in real systems, i.e., where no VM is started but memory workload shifts to an overuse and the OOM (out of memory) killer is invoked.

So, a better, more general solution that actually would actually fix some behaviour could be using the OOM score adjustments. I.e., one could decrease the score value for long-running VMs, so that non-VMs or freshly started VMs are more likely to be killed if we come into a situation where remaining memory becomes so low that the kernel needs to invoke OOM killer in the first place. Even if that means that some PVE management process would be killed, which while non-ideal, it'd be better than killing a potential production VM.
In the next major release we may have got the building blocks ready for user-space OOM killer management, which could make this a bit easier and less hacky.
 
I'm generally not a fan of reviving old threads, but I'm pretty much in the same boat as Larion, and this thread is exactly what I was curious about as well. The only difference is that I'm coming from VMware, but the principle is still the same.

Is there a bug report or any status on your "more general solution" idea of killing the most recently started VM versus killing older VMs? From a UX perspective, either prevent me from starting an 8G VM with 2G left on the host, suggest I start it on a different host (if in a cluster), or kill *that* VM versus killing older ones.

If not, I'm more than happy to submit something to track, but the concern is around someone doing something quickly and not sitting down to do the math before they click the *START* button and have random VMs being powered off. And if the killed ones are setup for HA, really fun things happen...
 
Whilst this is an older thread now, I just wanted to chime in that I have experienced this issue as well. My example is a bit more realistic - I have a cloud hosted Promox host with 4GB RAM that was running some basic VPN services. Normally only 60% of the memory is utilised, and the largest VM averages around 400MB of RAM used. One of my colleagues duplicated some VMs for upgrade purposes. Unfortunately he accidentally assigned 2GB for a VM (default) instead of 512MB, and crashed the host in the middle of the day, cutting off a team of people. Since we have a BTRFS RAID 1, there was no swap, and the whole thing died and had to be forcefully restart. A simple check to see if starting a VM would break the whole host would be nice.
 
After 4.5 years later.
Is there any community-accepted workaround for this?
GPT suggests either setting a cgroup MemoryMax for pve-qemu.slice (which I assume doesn’t really stop the OOM killer) or using a custom hook script that checks free memory before starting or creating VMs.
Before I try those, just wondering if there’s a proven or commonly used workaround people rely on?
 
Before I try those, just wondering if there’s a proven or commonly used workaround people rely on?
Besides common sense like "don't overprovision" or "looking"?

The only one I can come up with is to use hugepages everywhere. If those are not available, the VM will not start.
 
Before I try those, just wondering if there’s a proven or commonly used workaround people rely on?

Use ballooning memory, so the VM is using 100% of the RAM allocated all the time. It's been a long long time since I've actually been able to max out a node. Even with 24GB of RAM, you can run quite a bit of VMs with ballooning enabled.