Questions about the dynamic CRS

Cookiefamily

Renowned Member
Jan 29, 2020
147
42
68
Germany
Hello,

I noticed that a dynamic mode was introduced to the CRS (yay! Been waiting for this for so long and it really comes in handy now with us planning our migration away from VMware, thank you so much Team!!!).
I enabled it on all my testing environments and it seemed to work pretty well.

There are two modes the CRS can run in, TOPSIS and "Brute Force" with the latter one being the default. What are the differences in practice between the two modes? Are there scenarios where you should choose one over the other?

In my small test clusters it distributed the load really well, but they are under almost no load CPU and memory wise so I couldn't really try some scenarios yet where ProxLB from credativ would fail in our production environment.
The issue was VMs with big "imbalances" of a lot of RAM and little CPU and vice versa.
What metrics does the CRS take into account? Both memory and CPU? How does it weigh between those (or does it do any weighting at all)?
 
Hi!

Thanks for the feedback!

The issue was VMs with big "imbalances" of a lot of RAM and little CPU and vice versa.
What metrics does the CRS take into account? Both memory and CPU? How does it weigh between those (or does it do any weighting at all)?
The load balancer takes both memory and CPU in account. Ad weighing, see the next paragraphs.

There are two modes the CRS can run in, TOPSIS and "Brute Force" with the latter one being the default. What are the differences in practice between the two modes? Are there scenarios where you should choose one over the other?
The load balancer can score the balancing migrations by either one of these methods.

The brute-force method (as in 'greedily find the best balancing migration') does currently weigh average CPU load and memory usage as equal. Though it might be weighed differently in the future as well, it is a well-balanced starting point as both resources (CPU and memory) can cause pressure and therefore degradation in resource utilization over time.

The TOPSIS method does weigh memory as more important than the CPU load: a 5:1 ratio for average CPU/memory usage and a 10:5 ratio for CPU/memory peaks to signify that memory is a truly limited resource while high CPU pressure does 'only' degrade the processing time. This is already the method we used for scoring nodes to start new HA resources (if rebalance-on-start is enabled).

The TOPSIS method might be helpful for more memory-bound applications, though as for many applications an equal balance for both resources is useful as well as cpu pressure often being the more common problem, the brute force method is the current default.

Hope this helps!

PS: There is a patch series in review, which overhauls the CRS section itself and adds documentation for the new load balancing system here [0]. External feedback on these patches are also very welcome if things could be made clearer or certain things should be elaborated on more!

[0] https://lore.proxmox.com/pve-devel/20260415091635.162224-20-d.kral@proxmox.com/
 
  • Like
Reactions: Cookiefamily
Using the GUI to set the HA scheduling to "dynamic load" and checking the "Automatically rebalanceHA resources" leads to this error:

Code:
crs: invalid format - format error crs.ha: value 'dynamic' does not have a value in the enumeration 'basic, static' crs.ha-auto-rebalance: property is not defined in schema and the schema does not allow additional properties

This is on a PVE 9.1.9 (enterprise repo) server. `pve-ha-manager` is installed as version 5.1.3 though? The patch is for 5.2.0+
 
Last edited:
Also: the docs seem to be missing the difference between the dynamisch and the static load scheduler. And the differences between the "brute force" and "TOPSIS" method. Google AI gave me an answer, but idk if its correct.

There are more patches in the patch series - that explain these. Sorry!
 
Last edited:
Hi!

Thanks for the feedback!


The load balancer takes both memory and CPU in account. Ad weighing, see the next paragraphs.


The load balancer can score the balancing migrations by either one of these methods.

The brute-force method (as in 'greedily find the best balancing migration') does currently weigh average CPU load and memory usage as equal. Though it might be weighed differently in the future as well, it is a well-balanced starting point as both resources (CPU and memory) can cause pressure and therefore degradation in resource utilization over time.

The TOPSIS method does weigh memory as more important than the CPU load: a 5:1 ratio for average CPU/memory usage and a 10:5 ratio for CPU/memory peaks to signify that memory is a truly limited resource while high CPU pressure does 'only' degrade the processing time. This is already the method we used for scoring nodes to start new HA resources (if rebalance-on-start is enabled).

The TOPSIS method might be helpful for more memory-bound applications, though as for many applications an equal balance for both resources is useful as well as cpu pressure often being the more common problem, the brute force method is the current default.

Hope this helps!

PS: There is a patch series in review, which overhauls the CRS section itself and adds documentation for the new load balancing system here [0]. External feedback on these patches are also very welcome if things could be made clearer or certain things should be elaborated on more!

[0] https://lore.proxmox.com/pve-devel/20260415091635.162224-20-d.kral@proxmox.com/

Thanks for the insights! Have you thought about a exlude for containers, as these get automatically moved which causes a downtime?
 
@dakralex Thank you very much for the answer! That clears things up a lot.

As for the modes, I think we will just need to try the modes and see what happens. TOPSIS sounds best for our production clusters as they are usually memory limited.

One thing I would have as feedback is that it is way more "expensive" to do live migrations of VMs with vGPU resources as a migration halts them for extended periods of time (8gb takes ~6s for us, 24gb ~20s etc.). So ideally we would move those last as long as there are better options for shuffling VMs around.
For now I guess I can create an affinity group to "pin" them to one host with higher priority to not get them to balance except for in the case of host failures.
 
This is on a PVE 9.1.9 (enterprise repo) server. `pve-ha-manager` is installed as version 5.1.3 though? The patch is for 5.2.0+
Yes, some recent security fixes for the pve-manager package forced us to ship the package, which already includes the load balancer options in the web interface, earlier to all repositories, while we're still waiting to move pve-ha-manager 5.2.0 and pve-cluster 9.1.2 to the enterprise repositories as well.

See this post [0] for a little more information.

[0] https://forum.proxmox.com/threads/183143/#post-850798
 
  • Like
Reactions: Johannes S
Thanks for the insights! Have you thought about a exlude for containers, as these get automatically moved which causes a downtime?
Yes, good idea! We already thought about this during development, though we were focusing on the core feature first. Feel free to create a Bugzilla entry [0] for this in the mean time, though as this is relatively trivial to implement and as you said causes downtime while moving and restarting the containers on the target host, this should be included relatively fast.

[0] https://bugzilla.proxmox.com/enter_bug.cgi?product=pve&component=HA
 
Last edited:
As for the modes, I think we will just need to try the modes and see what happens. TOPSIS sounds best for our production clusters as they are usually memory limited.

One thing I would have as feedback is that it is way more "expensive" to do live migrations of VMs with vGPU resources as a migration halts them for extended periods of time (8gb takes ~6s for us, 24gb ~20s etc.). So ideally we would move those last as long as there are better options for shuffling VMs around.
For now I guess I can create an affinity group to "pin" them to one host with higher priority to not get them to balance except for in the case of host failures.
Thanks for the feedback!

We also thought about including more terms to the "cost function" of a migration, though we mainly focused on the core feature first.

The current load balancing implementation focuses on reducing the imbalance between the nodes as much as possible. The larger the imbalance, usually the more 'expensive' the balancing migrations will be (in terms of memory, etc. and therefore migration time) to minimize the total amount of migrations that are needed to rebalance the HA resources within the cluster. Without knowing the rest of the cluster, in these situations it might be the best option to move these 'heavy' HA resources first. Though it might be expensive if this becomes a transient state, where these HA resources are moved quite often. Does this occur for you?
 
Yes, good idea! We already thought about this during development, though we were focusing on the core feature first. Feel free to create a Bugzilla entry [0] for this in the mean time, though as this is relatively trivial to implement and as you said causes downtime while moving and restarting the containers on the target host, this should be included relatively fast.

[0] https://bugzilla.proxmox.com/enter_bug.cgi?product=pve&component=HA
Thanks! Done: https://bugzilla.proxmox.com/show_bug.cgi?id=7557
 
  • Like
Reactions: Johannes S
Hi Daniel, thanks for the reply!
Though it might be expensive if this becomes a transient state, where these HA resources are moved quite often. Does this occur for you?
not right now, no. I was just thinking ahead to the future :D
Right now it is only active on two small 3-node test clusters, one of which has some Nvidia GPUs with NVAIE (yes, works the same as "normal" vGPUs, didn't experience any issues apart from the usual nvidia licensing hell that you experience everywhere). Those clusters and the load on them is pretty static so it is only migrating if we create some new VMs.
Over the next year we will migrate ~900 VMs and ~50 Hosts from VMware to Proxmox VE, the dynamic CRS will definitely be helpful in the bigger clusters and in clusters where K8s does automatic autoscaling of workers. I hope it is stable enough once we get into the larger migrations, currently we are still just testing, planning, adapting code.

But especially in bigger clusters imbalance might happen more often. There are currently some ways to tune the dynamic CRS with minimum imbalance improvement and threshold so it doesn't go too crazy and I read somewhere on one of the mailings that there are plans to do "better" statistics to filter out short peaks etc. which should also help.
Migrations are always "costly" in terms of performance, for normal VM migrations it isn't too big of a deal as long as it doesn't happen too often, freezes are sub 1s. For vGPUs the freeze when copying the vfio memory is just a lot longer so you might actually notice it.
I think the suggestion @jsterr made would also help for now, we could just not actively migrate the VMs that have GPUs attached.

One more question about the calculations done: I know the dynamic CRS only migrates resources managed by the HA Manager. But how does it calculate the host resource usage? Does it only take the ha managed resources into account or the "general" host CPU/Memory usage which also contains non-ha VMs or other processes like Ceph?

Thank you!
 
  • Like
Reactions: Johannes S