I'm getting a live migration failure between any two Proxmox nodes in a cluster (of 5) where
one node is version 4.0-16 of proxmox-ve and the other node is version 4.0-19. The end of
the pretty unhelpful error mesages is no doubt familiar:
Nov 04 16:54:17 ERROR: online migrate failure - aborting
Nov 04 16:54:17 aborting phase 2 - cleanup resources
Nov 04 16:54:17 migrate_cancel
Nov 04 16:54:18 ERROR: migration finished with problems (duration 00:00:06)
TASK ERROR: migration problems
This tells exactly zero useful info and the various logs on the system don't reveal anything of any help either. It would
be nice to know what the "migration problems" actually are! I'm using iSCSI for storage, so it's not a local storage
issue.
OIder node:
proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie
Newer node:
proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie
I tried running the older kernel (4.2.2-1-pve) on the newer node since it was still installed and that didn't fix things.
In desperation, I upgraded the older node to the newer node's version and - bingo - live migration finally worked.
So it looks like one of the recent Proxmox packages breaks live migration (not for the first time!) between it and slightly
older 4.0 nodes and I've now got to do a whole load of offline migrations to get all my nodes up to the 4.0-19 release...grrr!
My suspicion is that very little QA is done between two consecutive releases (whether that's between two minor releases
like this one or between the last of a major version and the first of the next major version after that) to ensure that
live mgration works both backwards and forwards. Live migration is *crucial* for upgrades so that downtime can be
avoided. All the Proxmox upgrade docs tell you to migrate your VMs away before upgrading a cluster node and yet half the
time it seems like live migration between releases is broken and we have to suffer downtime during upgrades :-(
one node is version 4.0-16 of proxmox-ve and the other node is version 4.0-19. The end of
the pretty unhelpful error mesages is no doubt familiar:
Nov 04 16:54:17 ERROR: online migrate failure - aborting
Nov 04 16:54:17 aborting phase 2 - cleanup resources
Nov 04 16:54:17 migrate_cancel
Nov 04 16:54:18 ERROR: migration finished with problems (duration 00:00:06)
TASK ERROR: migration problems
This tells exactly zero useful info and the various logs on the system don't reveal anything of any help either. It would
be nice to know what the "migration problems" actually are! I'm using iSCSI for storage, so it's not a local storage
issue.
OIder node:
proxmox-ve: 4.0-16 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-50 (running version: 4.0-50/d3a6b7e5)
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-23
qemu-server: 4.0-31
pve-firmware: 1.1-7
libpve-common-perl: 4.0-32
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-27
pve-libspice-server1: 0.12.5-1
vncterm: 1.2-1
pve-qemu-kvm: 2.4-10
pve-container: 1.0-10
pve-firewall: 2.0-12
pve-ha-manager: 1.0-10
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.3-1
lxcfs: 0.9-pve2
cgmanager: 0.37-pve2
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie
Newer node:
proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie
I tried running the older kernel (4.2.2-1-pve) on the newer node since it was still installed and that didn't fix things.
In desperation, I upgraded the older node to the newer node's version and - bingo - live migration finally worked.
So it looks like one of the recent Proxmox packages breaks live migration (not for the first time!) between it and slightly
older 4.0 nodes and I've now got to do a whole load of offline migrations to get all my nodes up to the 4.0-19 release...grrr!
My suspicion is that very little QA is done between two consecutive releases (whether that's between two minor releases
like this one or between the last of a major version and the first of the next major version after that) to ensure that
live mgration works both backwards and forwards. Live migration is *crucial* for upgrades so that downtime can be
avoided. All the Proxmox upgrade docs tell you to migrate your VMs away before upgrading a cluster node and yet half the
time it seems like live migration between releases is broken and we have to suffer downtime during upgrades :-(