Can't live migrate after dist-upgrade

churnd

Active Member
Aug 11, 2013
43
2
28
I have a 3 node cluster & all nodes were on the latest version of Proxmox v4. Last night got an email that upgrades were available in the Enterprise repo. I live migrated all VM's off of one node, dist-upgraded, rebooted, then tried to live-migrate back & it would not work. I had to shut down the VMs to migrate back to their original node. That's the second time I've had this problem that I can recall... the last time was a few months ago & there have been a few update cycles since.

The same behavior was observed with the remaining two nodes. They could not live migrate to a node that had just been upgraded. Right now all nodes are upgraded since we shut the VMs down. The version we're currently on is pve-manager/4.1-13/cfb599fb (running kernel: 4.2.8-1-pve). The version we were on before was the latest version prior to that.

What information can I provide to help pinpoint why this happens, because we'd like to prevent running into this problem again in the future.
 
qemu has been upgraded to 2.5 (pve-qemu-kvm package) with last updates, that could explain the bug.

do you have migration problem only newnode->oldnode ? (could be normal) or also oldnode->newnode ? (seem strange).

can you send the migration task log ?
 
I also see it randomly (not every time) between oldhosts (patched yesterday with glibc issue) so it's not tied to qemu-kvm 2.5 vs 2.4... alone

root@n2:~# pveversion -verbose
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
openvswitch-switch: 2.3.2-2
 
in the task log
"cannot assign requested address"
Seem to be a problem with ssh tunnel creation (so not related to qemu version indeed).

As workaround, can you try to add in

/etc/pve/datacenter.cfg
migration_unsecure: 1

It's disable ssh tunnel for migration, and tell me if migration works ? (just to be sure that it's not qemu version related).
 
in the task log
"cannot assign requested address"
Seem to be a problem with ssh tunnel creation (so not related to qemu version indeed).
This has 'always' been issued in pve 4.x, see other posts about this, migration still works :)

As workaround, can you try to add in

/etc/pve/datacenter.cfg
migration_unsecure: 1

It's disable ssh tunnel for migration, and tell me if migration works ? (just to be sure that it's not qemu version related).
Not using ssh will it try rsh, which properly doesn't work between my nodes?

I still see the migration failure between 'old' hosts as of last patch yesterday (qemu-kvm 2.4.4), before not.
 
with datacenter.migration_unsecure:1 it still fails from oldhost to newhost:

task started by HA resource agent
Feb 18 13:48:13 starting migration of VM 204 to node 'n2' (10.45.71.2)
Feb 18 13:48:13 copying disk images
Feb 18 13:48:13 starting VM 204 on remote node 'n2'
Feb 18 13:48:19 starting ssh migration tunnel
Feb 18 13:48:19 starting online/live migration on 10.45.71.2:60000
Feb 18 13:48:19 migrate_set_speed: 8589934592
Feb 18 13:48:19 migrate_set_downtime: 0.1
Feb 18 13:48:21 ERROR: online migrate failure - aborting
Feb 18 13:48:21 aborting phase 2 - cleanup resources
Feb 18 13:48:21 migrate_cancel
Feb 18 13:48:24 ERROR: migration finished with problems (duration 00:00:16)
TASK ERROR: migration problems
 
From oldhost 2 oldhost, it worked fine as w/secure migration:

Feb 18 13:51:05 starting migration of VM 204 to node 'n5' (10.45.71.5)
Feb 18 13:51:05 copying disk images
Feb 18 13:51:05 starting VM 204 on remote node 'n5'
Feb 18 13:51:10 starting ssh migration tunnel
Feb 18 13:51:11 starting online/live migration on 10.45.71.5:60001
Feb 18 13:51:11 migrate_set_speed: 8589934592
Feb 18 13:51:11 migrate_set_downtime: 0.1
Feb 18 13:51:13 migration status: active (transferred 2273201437, remaining 504193024), total 4304740352)
Feb 18 13:51:13 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
Feb 18 13:51:15 migration speed: 1024.00 MB/s - downtime 79 ms
Feb 18 13:51:15 migration status: completed
Feb 18 13:51:19 migration finished successfully (duration 00:00:19)
TASK OK
 
Maybe a bug in qemu-kvm 2.4.4, as sometimes it also lock-ups on any connection to a VM running in this version, thus only remediation seems to shut it with a stop OP...
 
I have also faced peculiarities with qemu-2.4 and its backward compatibility. Running nested virtualization with a qemu-2.4 over qemu < 2.4 was impossible - the qemu-2.4 client was unable to boot on both a qemu 2.3 and 2.2 host.
 
My cluster is fully upgraded now so I can't reproduce. My experience was exactly like above stefws said. I migrated off node1 to node2 (both were on the prior version), upgraded node1 & rebooted, then could not migrate from node2 back to node1. The errors I saw were exactly similar to what he posted.

Where can I find the migration task log, as it seems to have been removed from the webUI?
 
I can confirm that I had the very same issue. Migrated VM's to one node, updated, rebooted and couldn't migrate back. Same error as everyone else. Once I got all the nodes rebooted and on the same version, migration started working as expected. So far no more issues.
 
does the back migration (newnode->oldnode) works if you only update pve-qemu-kvm package on oldnode ?
(apt-get install pve-qemu-kvm should be enough to upgrade only qemu)



Note :

And just to be sure, the vm has not been stop/start on new node, before trying migrate back ? Because in this case, this is normal that migration don't works.

What should work is :

migrate from oldnode (qemu 2.4) to newnode (qemu2.5 with machineconfig 2.4) , then migrateback (qemu 2.5 with machineconfig 2.4) to old node (qemu 2.4).
 
Ok, qemu team has sent a patch (that was fast).
I have build a pve-qemu-kvm with patch.

can you download:
http://odisoweb1.odiso.net/pve-qemu-kvm_2.5-6_amd64.deb

then install it on your "newnode" with
dpkg -i pve-qemu-kvm_2.5-6_amd64.deb


Then migrate a vm from "oldnode" qemu 2.4 to "newnode" with qemu2.5 patched,

then try to migrate back to oldnode.
 
Ok, qemu team has sent a patch (that was fast).
I have build a pve-qemu-kvm with patch. can you download:
http://odisoweb1.odiso.net/pve-qemu-kvm_2.5-6_amd64.deb

then install it on your "newnode" with dpkg -i pve-qemu-kvm_2.5-6_amd64.deb
Then migrate a vm from "oldnode" qemu 2.4 to "newnode" with qemu2.5 patched,
then try to migrate back to oldnode.
Bugger, saw this too late, already patched all hypervisor nodes by migrating my VMs offline :/
Hopefully a fix/path for others ending up in this senario.

Currently live migration seems to work again with all nodes patched & rebooted.
But any potential value of this patch compared w/currentl enterprise repo with these:
qemu-server: 4.0-55
pve-qemu-kvm: 2.5-5
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!