No online migration possible to upgraded cluster nodes

woodstock

Renowned Member
Feb 18, 2016
47
2
73
Hi all,

the latest package updates also include a new kernel and today I observed this:

While upgrading all Nodes in our 8-node cluster (version 4.1) I was unable to do a online migration to already upgraded nodes.
This applied to HA and non-HA VMs and also to Containers.
Offline migrations did work.

Questions:
Is this expected or did I miss something?
How can we do a rollung upgrade of all cluster nodes without downtimes?

Thanks.

Versions before upgrade:
Code:
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-30
qemu-server: 4.0-46
pve-firmware: 1.1-7
libpve-common-perl: 4.0-43
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-21
pve-container: 1.0-37
pve-firewall: 2.0-15
pve-ha-manager: 1.0-18
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

Versions after upgrade:
Code:
proxmox-ve: 4.1-37 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-13 (running version: 4.1-13/cfb599fb)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-37
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-32
qemu-server: 4.0-55
pve-firmware: 1.1-7
libpve-common-perl: 4.0-48
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-40
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-5
pve-container: 1.0-44
pve-firewall: 2.0-17
pve-ha-manager: 1.0-21
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
 
Sorry, I forgot to test from console. So all I have are the task details from gui.
Not sure if that helps:

Code:
task started by HA resource agent
Feb 22 09:55:43 starting migration of VM 101 to node 'pve***2' (xxx.yyy.zzz.12)
Feb 22 09:55:43 copying disk images
Feb 22 09:55:43 starting VM 101 on remote node 'pve***2'
Feb 22 09:55:46 starting ssh migration tunnel
Feb 22 09:55:46 starting online/live migration on localhost:60000
Feb 22 09:55:46 migrate_set_speed: 8589934592
Feb 22 09:55:46 migrate_set_downtime: 0.1
Feb 22 09:55:48 ERROR: online migrate failure - aborting
Feb 22 09:55:48 aborting phase 2 - cleanup resources
Feb 22 09:55:48 migrate_cancel
Feb 22 09:55:49 ERROR: migration finished with problems (duration 00:00:06)
TASK ERROR: migration problems

Code:
Feb 22 09:32:12 starting migration of VM 103 to node 'pve***3' (xxx.yyy.zzz.13)
Feb 22 09:32:12 copying disk images
Feb 22 09:32:12 starting VM 103 on remote node 'pve***3'
Feb 22 09:32:13 starting ssh migration tunnel
Feb 22 09:32:14 starting online/live migration on localhost:60000
Feb 22 09:32:14 migrate_set_speed: 8589934592
Feb 22 09:32:14 migrate_set_downtime: 0.1
Feb 22 09:32:16 ERROR: online migrate failure - aborting
Feb 22 09:32:16 aborting phase 2 - cleanup resources
Feb 22 09:32:16 migrate_cancel
Feb 22 09:32:17 ERROR: migration finished with problems (duration 00:00:05)
TASK ERROR: migration problems
 
Ok, they are chance that it could be related to qemu upgrade from 2.4 to 2.5.

We have found a bug recently, but it was for migrateback from qemu 2.5 to 2.4 , but maybe it's related.
can you try to install on target node:
http://odisoweb1.odiso.net:/pve-qemu-kvm_2.5-7_amd64.deb

and test.



Before doing that, if you want to help for debug,

you can check the kvm command on the source host:

(ps -aux|grep kvm|grep vmid).

Then, on target host, copy/paste the kvm command line, and add at the end

-machine pc-i440fx-2.4 -incoming tcp:hostipaddress:60000 -S
and remove "-daemonize"
and add PVE_MIGRATED_FROM="yourfirstnodehostname" at the begin


PVE_MIGRATED_FROM="node1" kvm ........ -machine pc-i440fx-2.4 -incoming tcp:hostipaddress:60000 -S


It should start the vm in pause mode, waiting for migration. (keep you ssh console open here)


Then, in the source node, in the vm monitor in gui, send
"migrate tcp:targethostip:60000"

it should start the migrate, and if something bad happen, you should have the log in the target process crash.
 
Last edited:
can you try to edit on the source node
/usr/share/perl5/PVE/QemuServer.pm

edit
Code:
sub qemu_machine_pxe {
    my ($vmid, $conf, $machine) = @_;

    $machine =  PVE::QemuServer::get_current_qemu_machine($vmid) if !$machine;

    foreach my $opt (keys %$conf) {
        next if $opt !~ m/^net(\d+)$/;
        my $net = PVE::QemuServer::parse_net($conf->{$opt});
        next if !$net;
        my $romfile = PVE::QemuServer::vm_mon_cmd_nocheck($vmid, 'qom-get', path => $opt, property => 'romfile');
        return $machine.".pxe" if $romfile =~ m/pxe/;
        last;
    }

}

and add at the end "return $machine"

Code:
sub qemu_machine_pxe {
    my ($vmid, $conf, $machine) = @_;

    $machine =  PVE::QemuServer::get_current_qemu_machine($vmid) if !$machine;

    foreach my $opt (keys %$conf) {
        next if $opt !~ m/^net(\d+)$/;
        my $net = PVE::QemuServer::parse_net($conf->{$opt});
        next if !$net;
        my $romfile = PVE::QemuServer::vm_mon_cmd_nocheck($vmid, 'qom-get', path => $opt, property => 'romfile');
        return $machine.".pxe" if $romfile =~ m/pxe/;
        last;
    }
   return $machine
}

then restart

/etc/init.d/pvedaemon restart

and start the migration again