resource management on set of VMs / containers?

hkais

Well-Known Member
Feb 3, 2020
32
1
48
45
Hi there,

I am pretty fresh with proxmox but using over +15y vmware already. So not fully fresh on virtual environments.

I was looking for an way to manage sets of VMs.
e.g. would like to define

realtime VMs:
- low disk IO requirements
- high net IO requirements
- very high CPU requirements (on bursts)

critical VMs:
- high disk IO
- medium net IO
- high CPU

also had, high, normal, low, best effort groups too

With this grouping my admins immediately could see, what VM is critical and has to be handled very carefully. (depending on the operation processed on the node)


with vmware I had an option to define resources, and to assign VMs to them. But on proxmox the only thing I have found is to define it VM vise and not group wise.
So I could pull the realtime VMs, which are mostly idle to a low prio group and on defined schedules I moved it to realtime. This was resource effective and from end user perspective also very successful

What I do not want to do, to mess around with VMs settings directly.

How to define such "resource groups" (afaik cgroups) and how to assign VMs and cotainers to it?
 
Hi,

this is currently not implemented. There's an open issue on the bugtracker [1], but I'm not sure if this is exactly what you need. Otherwise, feel free to open a feature request on the bugtracker for others to discuss :).

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=1141
 
  • Like
Reactions: Lukas Wagner
hi there,

just a follow up after 2.5 years. Are there any progresses on making the vms more reliable in terms of priorizing them?
e.g. on CPU and IO?
This is one of the crucial features compared to vmware to kick finally vmware..

Any updates on this?
 
Hi fabian,

any plans to incorporate the patches? The feature is a must have in terms of having a reliable QOS on top of proxmox.
Anyway wondering why it is not a higher wish list on the community.

Is it too critical for proxmox to integrate? Afaik I would assume it is using the common known cgroups for such kernel level operation, or is it a different solution?
 
cgroups don't really work, because they are not cluster aware (although cgroups are used for containers to limit resource usage locally). like I said, I or somebody else needs to pick the series up, rebase, incorporate feedback and resubmit it. the feature is still on our todo list.
 
i am not that deep in the code, but this may not be the issue, if I fully understand the problem.
So just to understand the problem.

1. the features locally on one machine is working via cgroups and LXC
2. the feature in the clustered mode is currently not working due to some objections about cgroups in clustered environments

Can you help me out of the dark?
Why not syncing the structure of cgroups over all nodes and using the common cgroup names over all clusters. So the feature would be easily to incorporate.
And the other feature to have RRD over all nodes, would be of course a better utilization of all node, but the major pain is not to utilize the nodes better, the pain is that you have NO control about having specific vms running amok and tearing other more critical vms down.
And this later issue could be easily managed by extending the current features on node level in the first step and in the second step to do the bigger change, which will be of course more complex to solve.

maybe I do not understand something well here?
 
cgroups are a feature provided by the kernel, and the kernel only knows about local resources. pools are cluster-wide entities. there is some rough code in the patches I linked, but it is neither complete nor would it work with the current code. like I said, somebody needs to pick them up and pull them over the finish line, there is no "simple" fix for this.
 
can you clarify the both requirements?
cluster wide usage and local resources?
why do you require to bind these 2 requirements?
I am not getting the need of this strict coupling
 
pools are the only groups of guests PVE has. pools are cluster wide. so any limits placed on pools need to be cluster wide as well.
 
pools are the only groups of guests PVE has. pools are cluster wide. so any limits placed on pools need to be cluster wide as well.
okay getting this. But not getting the other parts. Let us separate the pooling from the resource handling.

But still unclear if you see the cgroups with pools for a better cluster 2 node utilization or if you see the cgroups as a feature to ensure a better user experience of the users of the guest systems in extreme situations of the node (high cpu, high disk IO, high network IO, high mem)

IMHO where is no need to couple pools to resource (cpu, IO, mem) handlings, if i get it properly. Of course it would be nicer if we can define the pools and simply assign the pools to resource groups like pools.

So let me try to narrow it more specifically down.
Let us assume we define following cpu cgroup rules:
- realtime
- high
- normal
- low

Afaik for the most cases this simple rule set would be sufficient.

Now we need to replicate this rule configs on all nodes. Afaik this shall not be the big issue, since conf replication seems to be very well implemented in proxmox.

Now we need to ensure that the cgroups are initialized on the OS level on each node identically. Afaik this maybe a task, which maybe requires some programming, or is this already done?

And finally we have the options to tell the VMs and the LXC-containers in which of the cgroups to reside in.
If now a failover or handover on different node is taking in place, this shall work since the cgroups are identically named.
And currently the cgroups are coupled to the pools, if I understand you well.

So the first question is, do we need this coupling of cgroups to pools? IMHO, it could be decoupled in the first step.
If it works decoupled well, it shall be easier to be integrated in later "v2" into pools as well.

I hope I could name my thoughts a bit better.


If I still haven't got it well, could you narrow down the issues you see right now with the cgroups and pools? Especially in the terms of CRUD operations over the cluster. Maybe it is easier to understand where you see right now the problems around cgroups in pools in the cluster. E.g. with a few examples, which explains the problem better.
 
we already have per-guest limits (those work on the node level, either in Qemu or with cgroups). anything that covers more than one guest needs to be cluster-wide by design, so cannot simply use cgroups.

it makes no sense to say guest A and B and C form a group X that can at most use 2G of RAM total, if that only works as long as they run on the same node..
 
we already have per-guest limits (those work on the node level, either in Qemu or with cgroups). anything that covers more than one guest needs to be cluster-wide by design, so cannot simply use cgroups.

it makes no sense to say guest A and B and C form a group X that can at most use 2G of RAM total, if that only works as long as they run on the same node..
I fully confirm to you it does not make sense to limit cluster wide the RAM usage.
But how about CPU and IO utilization?

So specific question:
if you have a set of "realtime/very high" priority VMs, which shall be always priorized over all others, but the resources shall be available to other as long as they are freely available, how would you solve it in the current proxmox?
It shall be considered for DiskIO and for CPU.
How to solve it in proxmox currently - without requiring the HA / clustering features of it

And how would you also implement the "high priority" VMs and the "low priority" VMs?
for real life examples:
- realtime prio => e.g. video and audio conferings VMs
- high prio => e.g. web applications with end users on it
- normal prio => default for all VMs
- low prio => e.g. CI/CD build server environments. E.g. if a Build takes 10 Minutes or 15 minutes it does not really matter, but it shall finish as soon as possible

now my questions to understand the current situtaion

The best approach which estabilished in our business is to utilize the resources by priorities. As long no other resource user is available, give the full resources to the requestor. But as soon as the resource is getting utilized by more partner, the behavior shall be priorized in the terms of a rule set.

The limitation is doing the opposite. It is always limiting the resources to the max boundary instead of letting the priorities decide.
Lets assume the disk limits which are implemented.
Here you can limit to 2MB/s for example. Even if the system is idle and could provide 1000GB/s the limit will never exceed this limitation of 2MB/s.
This approach is maybe helpful in some corner cases, but for a self organizing resource handling it is the worst solution to choose.
If you take a look on the IP level as QoS was introduced, where have been competitive solutions IntServ (limits) and DiffServ (rules). Guess, which one make the run? You can still somehow use IntServ solutions but nearly everything relevant QoS has a DiffServ implementation choosen as it allows to utilize the full available resources in any case.


Now going back to a simplier issue, the CPU usage.
Right now, if I understand you is a limitation possible. But the system is always limiting it to the configured bound? Have I got it?
There is no option to say use as much as possible but if you get into overload, limit to this amount...
The way how it would be helpful is to say:
Use best effort the CPU from system, except there are many resource users of the CPU workload, in this case limit it perceptual or another solution?

To be sure I understand you well, the per-guest limits are my named hard disk-IO limit e.g. 2MB/s and for CPUs you refer to the cpulimits?
But i do not see here an option to share the resources by rules? Or do I not see some points?
 
Last edited:
I think I now understand the issue - we were talking about something entirely different when we were talking about limits.

you can already give guests a certain "weight" for CPU usage (which is only relative to other guests on the same node):

https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu_resource_limits
https://pve.proxmox.com/pve-docs/chapter-pct.html#pct_cpu

and for example use tags or pools to group the guests visually.

the same is not true for memory (how should that even work? once you give memory to a guest you can't easily take it away again..) or I/O (where it might be possible to implement something like this, but it would be very involved to support it across all the storage types/backends we have).
 
thank you for the links!
Rechecked the cpuunits, which seem to have something relative on configuration. This looks promising

We could script around the thing if we can somehow handle our needs.
So e.g. if we could use the tags prio-Realtime, prio-high, prio-normal, prio-low
we could assign the shares based on some predefined rules.

Is there a command or script to assign cpuunits by vm-tags?
e.g. 400 for realtime, 200 for high, 100 normal, 50 for low?


about the disk thing, would like to catch up.
We deinstalled all NAS and SAN storages and went to DAS only. So cannot say anymore about it. But from my memory I would say we could limit also on SANs and NASs the bandwidths of the disk. Also on our ceph test setups afaik we coud define limits on disk level.
So to have these behavior the most reliable one would be to let the admin decide a max bandwidth per node and to have a shares/units behavior like you have with cpuunits.
Or if you want to have a bit more fancy automagic thing. You can measure the average and max bandwith over time and persist this values.
So you could define a rule:
total available bandwidth to share = sum(persistedMaxBandwidths) / samplesPersisted

e.g. we have a measuring over 7d * 24h = 168 samples
and assuming in that week our storage was upgraded with more disks for more bandwidth
day1-4: all samples around 1200MB/s
day5-7: all samples around 1950MB/s
we would have roughly 1521MB/s after the 7th day.
And the max value would be smoothed over time and would reach "automagically" the max value after measuring the system. So a self managing and easy to understand system.
Alternatively the admin can set it fully manually to override this behavior. E.g. by setting 1500MB/s as max

based on this you could now introduce a dynamic diskunits value, since you can define limits if the system exceeds the total requested state.
e.g. 100 default, 1000 realtime, 400 high and so on (max 1000)

I hope I could explain the mechanism. AFAIK the implementation shouldn't be dependent on any disk system, since it only measures+persists the values which are already available.

Would the solution be perfect? IMHO No, but it would simple to understand and simple to use. IMHO also not very complex to implement.

Similar behavior could be implemented on network level too, but here are already available stacks in the kernel, which could be utilized out of the box.
I worked at the end of 90s on CBQ DiffServer which made it possible to "share" the network by priorizing and requeing, which made the networking also more reliable in terms of
 
Last edited:
Is there a command or script to assign cpuunits by vm-tags?
e.g. 400 for realtime, 200 for high, 100 normal, 50 for low?
no, you'd have to script that yourself
about the disk thing, would like to catch up.
We deinstalled all NAS and SAN storages and went to DAS only. So cannot say anymore about it. But from my memory I would say we could limit also on SANs and NASs the bandwidths of the disk. Also on our ceph test setups afaik we coud define limits on disk level.
So to have these behavior the most reliable one would be to let the admin decide a max bandwidth per node and to have a shares/units behavior like you have with cpuunits.
Or if you want to have a bit more fancy automagic thing. You can measure the average and max bandwith over time and persist this values.
So you could define a rule:
total available bandwidth to share = sum(persistedMaxBandwidths) / samplesPersisted

e.g. we have a measuring over 7d * 24h = 168 samples
and assuming in that week our storage was upgraded with more disks for more bandwidth
day1-4: all samples around 1200MB/s
day5-7: all samples around 1950MB/s
we would have roughly 1521MB/s after the 7th day.
And the max value would be smoothed over time and would reach "automagically" the max value after measuring the system. So a self managing and easy to understand system.
Alternatively the admin can set it fully manually to override this behavior. E.g. by setting 1500MB/s as max

based on this you could now introduce a dynamic diskunits value, since you can define limits if the system exceeds the total requested state.
e.g. 100 default, 1000 realtime, 400 high and so on (max 1000)

I hope I could explain the mechanism. AFAIK the implementation shouldn't be dependent on any disk system, since it only measures+persists the values which are already available.

Would the solution be perfect? IMHO No, but it would simple to understand and simple to use. IMHO also not very complex to implement.

implementing such a feedback loop is really tricky and really hard to get right (just look at how many variants and iterations of IO schedulers there are in the linux kernel). you can already set bwlimits on virtual disks and try to implement it yourself matching your particular requirements ;)

Similar behavior could be implemented on network level too, but here are already available stacks in the kernel, which could be utilized out of the box.
I worked at the end of 90s on CBQ DiffServer which made it possible to "share" the network by priorizing and requeing, which made the networking also more reliable in terms of

similarly, for vnics we also have tc-based bwlimits ;)