HA Proxmox VE

devis

Member
Mar 2, 2023
45
2
13
Dear forum members,

I have encountered several questions regarding the operation of High Availability in Proxmox VE and would like to discuss them with you:


  • Why does the pve-ha-crm failover process exceed a processing time of more than 60 seconds, with a significant increase in CPU usage? This leads to the loss of the master node and the inability to select a new master. What recommendations can you provide to address this issue?

  • When initiating a failover, the hardware configuration is not taken into account, leading to memory and CPU overload on nodes when attempting to start the cluster. How can the cluster settings be optimized for better resource allocation?

  • After the failover process, all virtual machines automatically start on the working servers, causing them to be overloaded in terms of memory and CPU. This may be related to how we allocate resources for virtual machines, and Proxmox may not adequately predict when loads reach their limits. How can we better organize resource allocation to avoid overloads?


I would appreciate any ideas and solutions regarding these questions!

Sincerely,
Devis
 
Hello! I'ļl refrain from commenting on your first issue, as I'm not that familiar with what happens behind the scenes on such an occasion and do not want to speculate.

Problem #2 could be addressed with Cluster Resource Scheduling , explained here.
Problem #3 could be addressed by adding a startup delay to your VMs, explained here. This would give you a staggered start-up of VMs.
 
Hi,
@paulsk answered that nicely, I'd say! If you have further needs in term of load balancing, and don't want to wait for the upcoming CRS evolution, there is a nice tool that can help you load balance using live metrics: ProxLB !

As for the high CPU usage, I assume you have a lot of VMs / nodes there? maybe use HA Groups to avoid the CRM having to deal with a lot of information, and provide simpler solution in case of a failover? And check its logs!
 
Hi,
@paulsk answered that nicely, I'd say! If you have further needs in term of load balancing, and don't want to wait for the upcoming CRS evolution, there is a nice tool that can help you load balance using live metrics: ProxLB !

As for the high CPU usage, I assume you have a lot of VMs / nodes there? maybe use HA Groups to avoid the CRM having to deal with a lot of information, and provide simpler solution in case of a failover? And check its logs!
hello, thanks for your answers, yes everything is correct we have a large number of virtual machines in a cluster of 24 nodes

Let's consider a solution with ProxLB
 
Thank you very much, Gilou and paulsk.
Judging by the description, this is indeed a very interesting project, and we will try implementing it within our infrastructure.

However, one question arises: why doesn’t the official Proxmox team include these developments in their repositories and provide out-of-the-box support? It seems to me that this is something users are eagerly anticipating, along with the feature for managing multiple clusters from a single web interface.
 
This is due to two reasons:

Proxmox VE is still relatively new compared to other hypervisors.
This is also why things like SDN via the GUI where introduced only sincr version 8 & 8.1.

The second reason is VMware.
Since broadcom pushed out a lot of customers, the customers needed a new hypervisor.
A lot of these customers changed to Proxmox VE or are in the process of changing to Proxmox VE.
So the Proxmox team needed to dedicate there staff to help with and create tools to transission from ESXi to Proxmox VE and thus reducing the time to work on other things.

And in some cases it is also not a big prioraty since only a small part of the users will use it, and thus gets pushed back as other things have more prioraty.
 
  • Like
Reactions: Johannes S
Thank you very much, Gilou and paulsk.
Judging by the description, this is indeed a very interesting project, and we will try implementing it within our infrastructure.

However, one question arises: why doesn’t the official Proxmox team include these developments in their repositories and provide out-of-the-box support? It seems to me that this is something users are eagerly anticipating, along with the feature for managing multiple clusters from a single web interface.
I guess since they don't have any influence what a third party do so it would be difficult to take accountability for all possible use- and edgecases.
For example a lot of folks coming over from Vmware expect something like vmfs for shared storage which is quite uncommon in the ProxmoxVE world. But since it's a Linux people can use a not-supported cluster file ystem like OCFS2 or GFS. In theory Proxmox Server Solutions could add them to their supported filesystems I'm sure that would some customers they won't have otherwise. The trouble is just, that both are not developed by them and they propably don't have the ressources to put their own developers on it if any problems arise. Thus it's far easier to just avoid the hassle and say: "These are the options we have a stake in and thus are offically supported on it. But it's Linux and opensource, you can use anything which is working for you, just don't expect support for each edgecase".

As I said: I'm not employed or otherwise related to Proxmox Server Solutions GmbH but this is what be my approach if I would be in their place.
 
And in some cases it is also not a big prioraty since only a small part of the users will use it, and thus gets pushed back as other things have more prioraty.
This. Working on stuff like the mentioned multi-cluster-managment are propably more important for businesses considering migrating from VMware to Proxmox than stuff like ProxLB so it also has more priority for the developers.