[TUTORIAL] ProxLB - (Re)Balance your CT/VM workloads across nodes in your Proxmox cluster

gyptazy

New Member
Mar 25, 2024
8
8
3
gyptazy.ch

Introduction​

ProxLB (PLB) is an advanced tool designed to enhance the efficiency and performance of Proxmox clusters by optimizing the distribution of virtual machines (VMs) or Containers (CTs) across the cluster nodes by using the Proxmox API. ProxLB meticulously gathers and analyzes a comprehensive set of resource metrics from both the cluster nodes and the running VMs. These metrics include CPU usage, memory consumption, and disk utilization, specifically focusing on local disk resources. So basically, it is something like DRS for VMware.

PLB collects resource usage data from each node in the Proxmox cluster, including CPU, (local) disk and memory utilization. Additionally, it gathers resource usage statistics from all running VMs/CTs, ensuring a granular understanding of the cluster's workload distribution.

Intelligent rebalancing is a key feature of ProxLB where it re-balances VMs based on their memory, disk or CPU usage, ensuring that no node is overburdened while others remain underutilized. The rebalancing capabilities of PLB significantly enhance cluster performance and reliability. By ensuring that resources are evenly distributed, PLB helps prevent any single node from becoming a performance bottleneck, improving the reliability and stability of the cluster. Efficient rebalancing leads to better utilization of available resources, potentially reducing the need for additional hardware investments and lowering operational costs.

Automated rebalancing reduces the need for manual actions, allowing operators to focus on other critical tasks, thereby increasing operational efficiency.

How does it work?​

ProxLB is a load-balancing system designed to optimize the distribution of virtual machines (VMs) and containers (CTs) across a cluster. It works by first gathering resource usage metrics from all nodes in the cluster through the Proxmox API. This includes detailed resource metrics for each VM and CT on every node. ProxLB then evaluates the difference between the maximum and minimum resource usage of the nodes, referred to as "Balanciness." If this difference exceeds a predefined threshold (which is configurable), the system initiates the rebalancing process.

Before starting any migrations, ProxLB validates that rebalancing actions are necessary and beneficial. Depending on the selected balancing mode — such as CPU, memory, or disk — it creates a balancing matrix. This matrix sorts the VMs by their maximum used or assigned resources, identifying the VM with the highest usage. ProxLB then places this VM on the node with the most free resources in the selected balancing type. This process runs recursively until the operator-defined Balanciness is achieved. Balancing can be defined for the used or max. assigned resources of VMs/CTs.

Features​

  • Free & Open-Source
  • GPLv3
  • Rebalance the cluster by:
    • Memory
    • Disk (only local storage)
    • CPU
  • Performing
    • Periodically
    • One-shot solution
  • Types
    • Rebalance only VMs
    • Rebalance only CTs
    • Rebalance all (VMs and CTs)
  • Filter
    • Exclude nodes
    • Exclude virtual machines
  • Grouping (affinity/anti-affinity)
    • Include groups (VMs that are rebalanced to nodes together)
    • Exclude groups (VMs that must run on different nodes)
    • Ignore groups (VMs that should be untouched)
  • Dry-run support
    • Human readable output in CLI
    • JSON output for further parsing
  • Migrate VM workloads away (e.g. maintenance preparation)
  • Fully based on Proxmox API
  • Usage
    • One-Shot (one-shot)
    • Periodically (daemon)
    • Proxmox Web GUI Integration (optional)

Upcoming Features​

Where to get?​

As written on GitHub, you can simply use the code from there, but there're also additional options:
  • Distribution Package(s)
    • .deb
    • .rpm
    • .pkg (FreeBSD)
  • Repository
    • Debian
  • Container Image
    • Docker
    • Podman
This should fit most needs and also shows that by using only the API, this can simply run on mostly any system that can reach the Proxmox API.

Notes​

Personally, I really like to have an auto-node patching included. This means, that nodes periodically validate on their own for new available packages and check if those would need a reboot. If no reboot is needed, they can simply be installed. If a reboot would be needed, the CT/VM workloads need to be shifted to other nodes across the cluster. Validate that all workloads are migrated away (which may take some time, especially on local storage) and then perform the upgrade. Since I personally prefer to only use the API and not everything is included there, I need to patch the API (which is really a dirty way and therefore would require the additional package proxlb-additions which includes that. So, this is still draft and also said an upcoming feature and looking for a better solution but only by invoking the API. The current solution to get this in the API looks like this).

Another important and often requested feature is DPM from the VMware world, where nodes can determinate if they're needed in the cluster of nodes can be shut down for saving further power consumption. In this case, we need to validate if there're enough resources available to fit the workload in the cluster (there will be some additional parameters to allow over provisioning or also to make sure that there're still enough free resources available to ensure one or mode nodes can still fail) and migrate the things. A hook can be used to re-enable the nodes by WoL requests.

Also a big thing is the own API which was already slightly integrated in an own adaption, but "the not invented here" approach wasn't a that good idea and I should stick to stable frameworks. This API interface allows to collect metrics, dry-run test outputs but also other things like obtaining the best new node for placing CT or VM workloads.

I'm happy to hear your opinions about this project - let me know if you're missing something or if this could fit your needs. Maybe you also already found the Reddit thread about it where I got advised to also mention this here, because it looks helpful for the most ones.

Resources​

Source: https://github.com/gyptazy/ProxLB
Blogpost: https://gyptazy.ch/blog/proxlb-rebalance-vm-workloads-across-nodes-in-proxmox-clusters/
 
This looks promising. Thank you for working on it.

  • Have you checked with Proxmox if you can use the prefix prox or if they object?
  • Is there a patching of PVE required to get the new menu entry visible?
  • You wrote it uses the API, do you need to run it as root or is there a special internal user that has only limited permissions used?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!