HA migration on node shutdown: Selection of failover nodes?

Aug 20, 2020
7
1
8
I am evaluating an 5 node cluster (nodeA to nodeE) where every VM is bound to run on a specific node by assigning a specific HA group I have for every node. Example:

Code:
group: prefer-nodeA
    nodes nodeA:1
    nofailback 0
    restricted 0

my datacenter.cfg contains ha: shutdown_policy=migrate to allow live migration during node maintenance

when I shutdown nodeA, I expect all VMs running on that node to be more or less evenly distributed to the other nodes, however, all VMs are migrated to nodeB only.

Is this behaviour expected? Is there maybe even a chance to change the placement algorithm?
 
You can do some experiments using the HA simulator. For example, if I have a 3 node cluster with all 6 virtual machines on node 1 and turn node 1 off, then node 2 & node 3 each get 3 virtual machines assigned.
Code:
info    09:22:43     hardware: execute power node1 off
info    09:22:43     hardware: execute network node1 off
info    09:24:41    node3/crm: got lock 'ha_manager_lock'
info    09:24:41    node3/crm: status change slave => master
info    09:24:41    node3/crm: node 'node1': state changed from 'online' => 'unknown'
info    09:25:41    node3/crm: service 'vm:101': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: service 'vm:102': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: service 'vm:103': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: service 'vm:104': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: service 'vm:105': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: service 'vm:106': state changed from 'started' to 'fence'
info    09:25:41    node3/crm: node 'node1': state changed from 'unknown' => 'fence'
email   09:25:41    node3/crm: FENCE: Try to fence node 'node1'
info    09:25:41    node3/crm: got lock 'ha_agent_node1_lock'
info    09:25:41    node3/crm: fencing: acknowledged - got agent lock for node 'node1'
info    09:25:41    node3/crm: node 'node1': state changed from 'fence' => 'unknown'
email   09:25:41    node3/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node1'
info    09:25:41    node3/crm: recover service 'vm:101' from fenced node 'node1' to node 'node2'
info    09:25:41    node3/crm: service 'vm:101': state changed from 'fence' to 'started'  (node = node2)
info    09:25:41    node3/crm: recover service 'vm:102' from fenced node 'node1' to node 'node2'
info    09:25:41    node3/crm: service 'vm:102': state changed from 'fence' to 'started'  (node = node2)
info    09:25:41    node3/crm: recover service 'vm:103' from fenced node 'node1' to node 'node3'
info    09:25:41    node3/crm: service 'vm:103': state changed from 'fence' to 'started'  (node = node3)
info    09:25:41    node3/crm: recover service 'vm:104' from fenced node 'node1' to node 'node3'
info    09:25:41    node3/crm: service 'vm:104': state changed from 'fence' to 'started'  (node = node3)
info    09:25:41    node3/crm: recover service 'vm:105' from fenced node 'node1' to node 'node2'
info    09:25:41    node3/crm: service 'vm:105': state changed from 'fence' to 'started'  (node = node2)
info    09:25:41    node3/crm: recover service 'vm:106' from fenced node 'node1' to node 'node3'
info    09:25:41    node3/crm: service 'vm:106': state changed from 'fence' to 'started'  (node = node3)
info    09:25:41    node2/lrm: starting service vm:101
info    09:25:41    node2/lrm: starting service vm:105
info    09:25:42    node3/lrm: starting service vm:103
info    09:25:42    node3/lrm: starting service vm:104
info    09:25:42    node3/lrm: starting service vm:106
info    09:25:43    node2/lrm: service status vm:101 started
info    09:25:43    node2/lrm: service status vm:105 started
info    09:25:44    node3/lrm: service status vm:103 started
info    09:25:44    node3/lrm: service status vm:104 started
info    09:25:44    node3/lrm: service status vm:106 started
info    09:25:51    node2/lrm: starting service vm:101
info    09:25:52    node3/lrm: starting service vm:103
info    09:25:52    node3/lrm: starting service vm:104
info    09:25:53    node2/lrm: service status vm:101 started
info    09:25:54    node3/lrm: service status vm:103 started
info    09:25:54    node3/lrm: service status vm:104 started
info    09:26:02    node3/lrm: starting service vm:103
info    09:26:04    node3/lrm: service status vm:103 started


I expect all VMs running on that node to be more or less evenly distributed (...) Is there maybe even a chance to change the placement algorithm?
You can Ctrl+F search the chapter about high availability of our documentation for "equally split". I think that could be what you're looking for.

How many are "all"?
 
Hi Dominic, thanks for looking into this issue.

The simulator behaves as expected, and I am aware of the documentation you mentioned. However, I hit "reboot" on NodeA, and all 6 VMs currently running onto that node are migrated to NodeB only.

Wouldn't you expect that at least one VM will be migrated to NodeC D or E?
 
Wouldn't you expect that at least one VM will be migrated to NodeC D or E?
Currently the target node for such migrations should be determined by the number of guests on each node of the cluster. Does NodeB have less guests than NodeC to NodeE?
 
Does NodeB have less guests than NodeC to NodeE?

Indeed, NodeB hasn't more VMs running than the other nodes. But after migration, NodeB definitely has more VMs running... My pratical problem here is that the nodes are sized very differently (speaking number of sockets/cores and RAM) and the number of running guests is not a very fitting heuristic, because in my case, the weakest node gets overloaded with VMs during a node shutdown.

Could you point me to the source code part of that heuristic? My colleagues and I would love to have a look at how to optimize that.

Your input is highly appreciated!
Best regards,
Michael
 
There is a function my $recover_fenced_service = sub { ... in pve-ha-manager that should be a good starting point.
 
Just to close and roundup this issue:

The "Basic Scheduler" takes the number of running resources into account to find a suitable migration host. On our very differently sized hosts that lead to problems.

The newer "Static-Load Scheduler" (see https://pve.proxmox.com/wiki/High_Availability#ha_manager_crs) uses CPU and memory usage information and thus reflects our situation much better.

We had no issues of this kind after using the "Static-Load Scheduler" feature (which basically resembles the idea we had in mind when thinking about an optimization)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!