[SOLVED] HA migration stuck / is doing nothing

johanng

Member
Jan 9, 2021
12
1
8
30
Hi there,

since today, I have an issue with HA online Migration. If I try to migrate an online VM, the following output occurs:

Code:
Requesting HA migration for VM 127 to node srv1
TASK OK

After that nothing is doing.

If I shut down the VM, I can migrate with HA. And if I remove the HA entry for the VM, I can also migrate. Also, when the VM is online.



I recently tried to change the ring-IP of corosync. Because there was an issue I switched back to the old configs of /etc/pve/corosync.conf and /etc/hosts.

Has anyone an idea?
 
postscript:

It seems that the GUI is not synced well with the CLI.
Bildschirmfoto 2021-08-13 um 23.30.51.png Bildschirmfoto 2021-08-13 um 23.31.11.png

The states are not the same.
 
I have found out some more things:


ha-manager status -v shows that node 1 hangs permanently in "fence" mode


Code:
"srv1" : {
         "mode" : "active",
         "results" : {
            "25njyxxxxxxxDPIZFYQPw" : {
               "exit_code" : 0,
               "sid" : "vm:127",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1629829993
      },
      "srv2" : {
         "mode" : "active",
         "results" : {
            "UZOxxxxxxf0QopsNzdSQ" : {
               "exit_code" : 7,
               "sid" : "ct:116",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1629829996
      },
      "srv3" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1629829993
      }


In addition, VMs and CTs are displayed in the "results" for srv2 and srv1, which were not configured with HA via the GUI. srv3 has been hanging in "wait_for_agent_lock" mode for days.


Can someone please help me please?
 
When i start a VM migration with HA, the log says:

Code:
Aug 24 20:41:06 srv2 pve-ha-crm[2634]: crm command error - node not online: migrate vm:105 srv1

I don't understand why. Quorum is okay and the servers can ping each other.



Sorry for posting so much. But i don't know, what to do.
 
Last edited:
Hi,
could you share the full output of ha-manager status -v?

On the relevant nodes (enpoints of the migration or HA master), is there anything interesting in /var/log/syslog? Did you already try to restart the HA services systemctl restart pve-ha-crm.service pve-ha-lrm.service? Please also provide the output of pveversion -v.

If there are no services, the wait_for_agent_lock state is the idle state.
 
Hi,

I recently restarted the nodes due to updates. Now srv1 is for about 2 weeks in "fence" state.

HA migration, etc. is working on srv2 and srv3. This is why all VMs are currently on these nodes.

Full output of
Code:
ha-manager status -v

Code:
root@srv2:~# ha-manager status -v
quorum OK
master srv2 (active, Thu Oct  7 20:45:47 2021)
lrm srv1 (active, Thu Oct  7 20:45:44 2021)
lrm srv2 (active, Thu Oct  7 20:45:52 2021)
lrm srv3 (active, Thu Oct  7 20:45:44 2021)
service ct:107 (srv3, stopped)
service ct:110 (srv2, stopped)
service ct:111 (srv2, started)
service ct:112 (srv3, started)
service ct:114 (srv2, started)
service ct:115 (srv2, stopped)
service ct:116 (srv2, started)
service ct:117 (srv2, stopped)
service ct:119 (srv2, started)
service ct:120 (srv2, stopped)
service ct:121 (srv2, stopped)
service ct:122 (srv2, started)
service ct:124 (srv2, stopped)
service ct:125 (srv2, started)
service ct:126 (srv2, started)
service ct:151 (srv3, started)
service vm:100 (srv2, stopped)
service vm:101 (srv3, started)
service vm:102 (srv3, stopped)
service vm:103 (srv3, started)
service vm:104 (srv2, started)
service vm:105 (srv3, stopped)
service vm:106 (srv3, started)
service vm:108 (srv2, started)
service vm:109 (srv3, started)
service vm:113 (srv3, started)
service vm:118 (srv3, started)
service vm:127 (srv2, stopped)
service vm:152 (srv3, started)
full cluster state:
{
   "lrm_status" : {
      "srv1" : {
         "mode" : "active",
         "results" : {
            "IttCKFmEqRR0oQppcCK/dw" : {
               "exit_code" : 0,
               "sid" : "vm:118",
               "state" : "started"
            },
            "pVtvNn/ke6wGqbPy8T00SQ" : {
               "exit_code" : 0,
               "sid" : "ct:120",
               "state" : "stopped"
            }
         },
         "state" : "active",
         "timestamp" : 1633632344
      },
      "srv2" : {
         "mode" : "active",
         "results" : {
            "5gZ+cGO2HAca6BHjQLwOSQ" : {
               "exit_code" : 0,
               "sid" : "ct:111",
               "state" : "started"
            },
            "8jU8EMFHFCmE3gkNGMG1cw" : {
               "exit_code" : 0,
               "sid" : "ct:114",
               "state" : "started"
            },
            "Bg1b/Fq7ZIn2Up8t9jvHZg" : {
               "exit_code" : 0,
               "sid" : "ct:115",
               "state" : "stopped"
            },
            "Cun557sFiGdukKHrNINgyw" : {
               "exit_code" : 0,
               "sid" : "ct:121",
               "state" : "stopped"
            },
            "EFg5yCXr5ixVyl0rgGtdiw" : {
               "exit_code" : 0,
               "sid" : "ct:119",
               "state" : "started"
            },
            "HFbk3wL93BZyICge4V/D4w" : {
               "exit_code" : 0,
               "sid" : "vm:100",
               "state" : "stopped"
            },
            "KUs7Mw6xhjmIcFGF/c7UCA" : {
               "exit_code" : 0,
               "sid" : "ct:116",
               "state" : "started"
            },
            "MU7FkOF+ggDsi/NzG2/WCA" : {
               "exit_code" : 0,
               "sid" : "ct:120",
               "state" : "stopped"
            },
            "NYHWPcZ9Z1d6G/ih0GBJyw" : {
               "exit_code" : 0,
               "sid" : "ct:125",
               "state" : "started"
            },
            "NvVKVQ2m0BSOBuyX65XtjA" : {
               "exit_code" : 0,
               "sid" : "ct:126",
               "state" : "started"
            },
            "Qm+sOIYL3SCtelhOHYLtmg" : {
               "exit_code" : 0,
               "sid" : "vm:108",
               "state" : "started"
            },
            "cBwwULZhFhwb80/aqYA/aw" : {
               "exit_code" : 0,
               "sid" : "ct:117",
               "state" : "stopped"
            },
            "gbAws1VvZW6/oXUfluiGbg" : {
               "exit_code" : 0,
               "sid" : "vm:127",
               "state" : "stopped"
            },
            "hgIGXFjTY+VdL0uT6TJ27g" : {
               "exit_code" : 0,
               "sid" : "ct:110",
               "state" : "stopped"
            },
            "j5A0UFAbwzbcroDQs/M8Zw" : {
               "exit_code" : 0,
               "sid" : "ct:122",
               "state" : "started"
            },
            "sqfexfgijfMdVcL+Zl98Lw" : {
               "exit_code" : 0,
               "sid" : "ct:124",
               "state" : "stopped"
            },
            "u0wlhH07y73mOIwQFlpECA" : {
               "exit_code" : 0,
               "sid" : "vm:104",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1633632352
      },
      "srv3" : {
         "mode" : "active",
         "results" : {
            "7eMlItnAtPgHJr10Leml3A" : {
               "exit_code" : 0,
               "sid" : "vm:101",
               "state" : "started"
            },
            "94E3RtSxW3ckMqV1YhP3Lw" : {
               "exit_code" : 0,
               "sid" : "vm:106",
               "state" : "started"
            },
            "Av+CNxI2ZzxR5e1Vqybp0A" : {
               "exit_code" : 0,
               "sid" : "vm:113",
               "state" : "started"
            },
            "Io8L9xr3064Dc3Y1P4IcZQ" : {
               "exit_code" : 0,
               "sid" : "vm:102",
               "state" : "stopped"
            },
            "JYRFepCENRlJKRwprdK2pQ" : {
               "exit_code" : 0,
               "sid" : "vm:118",
               "state" : "started"
            },
            "PU57sfk+pF7GCteIPJ7scg" : {
               "exit_code" : 0,
               "sid" : "ct:112",
               "state" : "started"
            },
            "SWBdzgnNiNOugcS1nQHnHw" : {
               "exit_code" : 0,
               "sid" : "vm:103",
               "state" : "started"
            },
            "TOq09/GoFHvV5FtH9ah8lw" : {
               "exit_code" : 0,
               "sid" : "vm:105",
               "state" : "stopped"
            },
            "kjeaA8o17lNQVMvjOMMFJw" : {
               "exit_code" : 0,
               "sid" : "ct:107",
               "state" : "stopped"
            },
            "o8A4biCqJ4FfWvrDponYOA" : {
               "exit_code" : 0,
               "sid" : "vm:152",
               "state" : "started"
            },
            "r6e7M2RvXRueEj7y2T5I+g" : {
               "exit_code" : 0,
               "sid" : "ct:151",
               "state" : "started"
            },
            "wj4TZ9V9nMg2eI36//EhcQ" : {
               "exit_code" : 0,
               "sid" : "vm:109",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1633632344
      }
   },
   "manager_status" : {
      "master_node" : "srv2",
      "node_status" : {
         "srv1" : "fence",
         "srv2" : "online",
         "srv3" : "online"
      },
      "service_status" : {
         "ct:107" : {
            "node" : "srv3",
            "state" : "stopped",
            "uid" : "kjeaA8o17lNQVMvjOMMFJw"
         },
         "ct:110" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "hgIGXFjTY+VdL0uT6TJ27g"
         },
         "ct:111" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "5gZ+cGO2HAca6BHjQLwOSQ"
         },
         "ct:112" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "lSlMUFTHf/ynyN8j0DyNjA"
         },
         "ct:114" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "8jU8EMFHFCmE3gkNGMG1cw"
         },
         "ct:115" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "Bg1b/Fq7ZIn2Up8t9jvHZg"
         },
         "ct:116" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "KUs7Mw6xhjmIcFGF/c7UCA"
         },
         "ct:117" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "cBwwULZhFhwb80/aqYA/aw"
         },
         "ct:119" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "EFg5yCXr5ixVyl0rgGtdiw"
         },
         "ct:120" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "MU7FkOF+ggDsi/NzG2/WCA"
         },
         "ct:121" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "Cun557sFiGdukKHrNINgyw"
         },
         "ct:122" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "j5A0UFAbwzbcroDQs/M8Zw"
         },
         "ct:124" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "sqfexfgijfMdVcL+Zl98Lw"
         },
         "ct:125" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "NYHWPcZ9Z1d6G/ih0GBJyw"
         },
         "ct:126" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "NvVKVQ2m0BSOBuyX65XtjA"
         },
         "ct:151" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "7HpJYP2IzTLwkXB+MyBaFQ"
         },
         "vm:100" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "HFbk3wL93BZyICge4V/D4w"
         },
         "vm:101" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "lmpzgzxVHz+oNCIub7cKmg"
         },
         "vm:102" : {
            "node" : "srv3",
            "state" : "stopped",
            "uid" : "Io8L9xr3064Dc3Y1P4IcZQ"
         },
         "vm:103" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "1QiDCArnWEWGqXvP9qu2eg"
         },
         "vm:104" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "u0wlhH07y73mOIwQFlpECA"
         },
         "vm:105" : {
            "node" : "srv3",
            "state" : "stopped",
            "uid" : "TOq09/GoFHvV5FtH9ah8lw"
         },
         "vm:106" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "lZLVcWUNe7w4EiSJApyaxg"
         },
         "vm:108" : {
            "node" : "srv2",
            "running" : 1,
            "state" : "started",
            "uid" : "Qm+sOIYL3SCtelhOHYLtmg"
         },
         "vm:109" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "BB3cHdg4hh+OlsTPV1jV+w"
         },
         "vm:113" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "sOHdiCI+9yTQG9OSyhvjgg"
         },
         "vm:118" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "uif7361JSqJb0HSMUzXwqg"
         },
         "vm:127" : {
            "node" : "srv2",
            "state" : "stopped",
            "uid" : "gbAws1VvZW6/oXUfluiGbg"
         },
         "vm:152" : {
            "node" : "srv3",
            "running" : 1,
            "state" : "started",
            "uid" : "ToJcc0X6Eb5rTSqm60a/Zg"
         }
      },
      "timestamp" : 1633632347
   },
   "quorum" : {
      "node" : "srv2",
      "quorate" : "1"
   }
}

I restarted pve-ha-crm.service and pve-ha-lrm.service multiple times. And i restarted the nodes.
 
Output of
Code:
pveversion -v

srv1:
Code:
root@srv1:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1~bpo10
ceph-fuse: 15.2.14-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1


srv2:
Code:
root@srv2:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1~bpo10
ceph-fuse: 15.2.14-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1


srv3:
Code:
root@srv3:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.124-1-pve: 5.4.124-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1~bpo10
ceph-fuse: 15.2.14-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1


Here is the output of syslog, when i added a new HA service, and trying migration from srv2 to srv1:
Code:
Oct  7 20:53:57 srv2 pve-ha-crm[2439]: adding new service 'ct:123' on node 'srv2'
Oct  7 20:54:00 srv2 systemd[1]: Starting Proxmox VE replication runner...
Oct  7 20:54:01 srv2 systemd[1]: pvesr.service: Succeeded.
Oct  7 20:54:01 srv2 systemd[1]: Started Proxmox VE replication runner.
Oct  7 20:54:10 srv2 pvedaemon[146860]: <root@pam> starting task UPID:srv2:00229A8B:1407ADBC:615F4252:hamigrate:123:root@pam:
Oct  7 20:54:11 srv2 pvedaemon[146860]: <root@pam> end task UPID:srv2:00229A8B:1407ADBC:615F4252:hamigrate:123:root@pam: OK
Oct  7 20:54:17 srv2 pve-ha-crm[2439]: crm command error - node not online: migrate ct:123 srv1


I can ping each other node. And /etc/hosts seems also to be configured correctly.
 
Last edited:
I think I was able to reproduce the issue and have sent a patch for discussion. It has not been reviewed yet, so if you'd like to try it, do so at your own risk!

Alternatively, you could try and stop all the pve-ha-crm.service services and after making sure they stopped, edit the file /etc/pve/ha/manager_status and change the node status from fence to unknown. After starting the services again, it should recognize the node as online.
 
Thanks for your reply.

I want to test the way with stop pve-ha-crm.service first.

Removing all HA Services and Groups should be fine, and this action has no effect on running VMs and CTs, or?
 
Thank you!!

Stopping pve-ha-crm.service, and editing the manager_status has solved the problem.

The Status unknown to online changed, when I created a HA Group and added a service.

Thanks for your help. Hope this thread help others too.

Regards,
johann
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!