Migration loop when removing host from HA group

Dragonn

Member
May 23, 2020
21
4
23
Prague
Hello guys,

I am reinstalling two Proxmox hypervisors to new hardware, so I need to migrate all VMs from old hypervisors. I have lowered HA priority (from 0 to 1) of given hosts and VMs started to get migrated out (as expected). Unfortunately single VM (ID 204) migration failed (unable to allocate enough memory, that completely OK and expected) into target hypervisor, but CRM daemon didn't tried to choose another hypervisor from cluster and was stucked in endless loop.

Migration task details was kind of expected (migration started, failed to start VM on target host, migration canceled):
Code:
task started by HA resource agent
2022-09-12 13:05:46 use dedicated network address for sending migration traffic (10.30.40.34)
2022-09-12 13:05:47 starting migration of VM 204 to node 'ovirt17' (10.30.40.34)
2022-09-12 13:05:47 starting VM 204 on remote node 'ovirt17'
2022-09-12 13:05:50 [ovirt17] kvm: cannot set up guest memory 'pc.ram': Cannot allocate memory
2022-09-12 13:05:51 [ovirt17] start failed: QEMU exited with code 1
2022-09-12 13:05:51 ERROR: online migrate failure - remote command failed with exit code 255
2022-09-12 13:05:51 aborting phase 2 - cleanup resources
2022-09-12 13:05:51 migrate_cancel
2022-09-12 13:05:53 ERROR: migration finished with problems (duration 00:00:07)
TASK ERROR: migration problems

VM is in HA group cluster which consists of every hypervisor in cluster, so there is plenty of room in whole cluster.
Code:
-> ha-manager config | grep vm:204 -A 4
vm:204
    group cluster
    max_relocate 2
    state started

VM was in migrate state as expected
Code:
-> ha-manager status | grep 204
service vm:204 (ovirt2, migrate)

Took a look into CRM status file, where I cannot see anything wrong:
Code:
-> cat /etc/pve/ha/manager_status  | jq '.service_status."vm:204"'
{
  "target": "ovirt17",
  "uid": "kZY7FM3JZ5u3yiFZxs5J0w",
  "node": "ovirt8",
  "state": "migrate"
}

Here's some log from active CRM
Code:
-> journalctl -u pve-ha-crm | cat
-- Logs begin at Sun 2022-09-11 21:22:10 CEST, end at Mon 2022-09-12 13:26:20 CEST. --
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:113' to node 'ovirt16' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:113': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt16)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:115' to node 'ovirt3' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:115': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt3)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:118' to node 'ovirt9' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:118': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt9)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:130' to node 'ovirt16' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:130': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt16)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:135' to node 'ovirt3' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:135': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt3)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:149' to node 'ovirt4' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:149': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt4)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:151' to node 'ovirt9' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:151': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt9)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:168' to node 'ovirt13' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:168': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt13)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:196' to node 'ovirt14' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:196': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt14)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:198' to node 'ovirt15' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:198': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt15)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:199' to node 'ovirt16' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:199': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt16)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:206' to node 'ovirt3' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:206': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt3)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:209' to node 'ovirt4' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:209': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt4)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:220' to node 'ovirt9' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:220': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt9)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:221' to node 'ovirt10' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:221': state changed from 'started' to 'migrate'  (node = ovirt7, target = ovirt10)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: migrate service 'vm:224' to node 'ovirt11' (running)
Sep 12 12:50:01 ovirt15 pve-ha-crm[10330]: service 'vm:224': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt11)
Sep 12 12:50:21 ovirt15 pve-ha-crm[10330]: service 'vm:113': state changed from 'migrate' to 'started'  (node = ovirt16)
Sep 12 12:50:21 ovirt15 pve-ha-crm[10330]: service 'vm:118': state changed from 'migrate' to 'started'  (node = ovirt9)
Sep 12 12:50:33 ovirt15 pve-ha-crm[10330]: service 'vm:115': state changed from 'migrate' to 'started'  (node = ovirt3)
Sep 12 12:50:33 ovirt15 pve-ha-crm[10330]: service 'vm:130': state changed from 'migrate' to 'started'  (node = ovirt16)
Sep 12 12:50:33 ovirt15 pve-ha-crm[10330]: service 'vm:135': state changed from 'migrate' to 'started'  (node = ovirt3)
Sep 12 12:50:33 ovirt15 pve-ha-crm[10330]: service 'vm:168': state changed from 'migrate' to 'started'  (node = ovirt13)
Sep 12 12:50:43 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:50:43 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:50:43 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:50:43 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:50:53 ovirt15 pve-ha-crm[10330]: service 'vm:149': state changed from 'migrate' to 'started'  (node = ovirt4)
Sep 12 12:50:53 ovirt15 pve-ha-crm[10330]: service 'vm:151': state changed from 'migrate' to 'started'  (node = ovirt9)
Sep 12 12:50:53 ovirt15 pve-ha-crm[10330]: service 'vm:196': state changed from 'migrate' to 'started'  (node = ovirt14)
Sep 12 12:50:53 ovirt15 pve-ha-crm[10330]: service 'vm:198': state changed from 'migrate' to 'started'  (node = ovirt15)
Sep 12 12:51:03 ovirt15 pve-ha-crm[10330]: service 'vm:199': state changed from 'migrate' to 'started'  (node = ovirt16)
Sep 12 12:51:03 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:51:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:51:03 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:51:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:51:13 ovirt15 pve-ha-crm[10330]: service 'vm:206': state changed from 'migrate' to 'started'  (node = ovirt3)
Sep 12 12:51:13 ovirt15 pve-ha-crm[10330]: service 'vm:220': state changed from 'migrate' to 'started'  (node = ovirt9)
Sep 12 12:51:23 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:51:23 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:51:23 ovirt15 pve-ha-crm[10330]: service 'vm:221': state changed from 'migrate' to 'started'  (node = ovirt10)
Sep 12 12:51:23 ovirt15 pve-ha-crm[10330]: service 'vm:224': state changed from 'migrate' to 'started'  (node = ovirt11)
Sep 12 12:51:24 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:51:24 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:51:43 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:51:43 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:51:43 ovirt15 pve-ha-crm[10330]: service 'vm:209': state changed from 'migrate' to 'started'  (node = ovirt4)
Sep 12 12:51:43 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:51:43 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:52:03 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:52:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:52:03 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:52:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:52:23 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:52:23 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:52:23 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:52:23 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:52:43 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:52:43 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:52:44 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:52:44 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)
Sep 12 12:53:03 ovirt15 pve-ha-crm[10330]: service 'vm:204' - migration failed (exit code 1)
Sep 12 12:53:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'migrate' to 'started'  (node = ovirt8)
Sep 12 12:53:03 ovirt15 pve-ha-crm[10330]: migrate service 'vm:204' to node 'ovirt17' (running)
Sep 12 12:53:03 ovirt15 pve-ha-crm[10330]: service 'vm:204': state changed from 'started' to 'migrate'  (node = ovirt8, target = ovirt17)

As last chance I tried to stop leader CRM to enforce daemon re-election, but nothing really changed and new leader just continued in migration loop.

I am using older version of Proxmox
Code:
-> pveversion
pve-manager/6.4-13/9f411e79 (running kernel: 5.4.174-2-pve)

Do you have any clue what I might be doing wrong? I am pretty sure that this worked absolutely flawlessly before so this migration loop was quite surprise to me, but I am unable to find any misconfiguration.
 
Hi,
can you share the output of cat /etc/pve/ha/groups.cfg? Are there multiple nodes with the same priority? It seems that it selects a node once and always tries to migrate to that node rather than switching targets at some point.
 
Sure, all hypervisors are in given group cluster with priority 1. When I want to do maintanance on some host, I just set its priority to 0.

Interesting part of groups.cfg looked like this
Code:
group: cluster
    comment Whole cluster group
    nodes ovirt9:1,ovirt17:1,ovirt14:1,ovirt11:1,ovirt3:1,ovirt12:1,ovirt6:1,ovirt4:1,ovirt16:1,ovirt13:1,ovirt10:1,ovirt2:1,ovirt5:1,ovirt1:1,ovirt18:1,ovirt8,ovirt7:1,ovirt15:1
    nofailback 0
    restricted 0

All VMs which was migrated away was also in same HA group cluster. Every other worked on first try and the only VM had issue and didn't fit.