Live migration reliability

Thread starter FinnTux
Start date Oct 6, 2011

F

FinnTux

Renowned Member

Oct 6, 2011

#1

Anyone tested live migration yet? I have problems with migration reliability. Especially ubuntu 11.10 beta2 (and xubuntu 11.10 beta2) seems to be a bit tricky. Migration succeeds often, sometimes VM is frozen after migration but quite often VM just shuts down on the target host. Web UI says "TASK OK" in all cases. Setting lover migration speed seems to help a bit and so does setting migration downtime to 0 (default is 1 second). There seems to be no difference between virtio vs ide or virtio vs e1000.

I have two node cluster and two types of shared storage. Drbd and iSCSI target both as backend for logical volumes. Same problem with both. Host machines are quite different though. Other is AMD Athlon X4 and the other one is Dual Core 2 E8500. I realize best practice is to use indentical machines but on the other hand migration is supposed to work across CPU vendors (and it does but not reliably).

I'm not sure that this is purely proxmox problem because I noticed the same thing when I was developing my own management script.

So, could someone test (preferably ubuntu 11.10 beta2) and report how live migration works.

Thanks

tom

Proxmox Staff Member

Staff member

Oct 6, 2011

#2

See https://bugzilla.proxmox.com/show_bug.cgi?id=7

can you test without a nic?

F

FinnTux

Renowned Member

Oct 6, 2011

#3

First migration without NIC and VM froze (speed 100 and downtime 1)

[TABLE="class: x-grid-table x-grid-table-resizer, width: 777"]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:12 starting migration of VM 104 to node 'vmsrv2' (192.168.2.172)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:12 copying disk images

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:13 starting VM on remote node 'vmsrv2'

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

*** EHCI support is under development ***

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:13 starting migration tunnel

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:13 starting online/live migration

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:15 migration status: active (transferred 104567KB, remaining 509056KB), total 803136KB)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:17 migration status: active (transferred 217207KB, remaining 398120KB), total 803136KB)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:19 migration status: active (transferred 326743KB, remaining 289996KB), total 803136KB)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:21 migration status: active (transferred 435327KB, remaining 182460KB), total 803136KB)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:23 migration status: active (transferred 548312KB, remaining 67768KB), total 803136KB)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:25 migration status: completed

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:25 migration speed: 64.00 MB/s

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

Oct 06 22:46:27 migration finished successfuly (duration 00:00:15)

[/TD]
[/TR]
[TR="class: x-grid-row x-selectable"]
[TD="class: x-grid-cell x-grid-cell-gridcolumn-14776 x-grid-cell-first"]

TASK OK

[/TD]
[/TR]
[/TABLE]

F

FinnTux

Renowned Member

Oct 6, 2011

#4

Problem still there. No NIC, speed 50 and downtime 0.

J

janzun

New Member

Oct 7, 2011

#5

Be carefull between AMD and Intel KVM live migration, i've experienced strange behaviours in some cases (maybe NX extension?)

I dont remember the direction, but it works for me only in one way and, in the opposite, VM is frozen (Windows XP often), or lost network (Linux often)... so they need a reboot. In the other hand, it works great for same CPU type.

In my case, I know its not a Proxmox error, but a KVM issue. Same thing in vmware hypervisor, for example.

Thats my experience, however yours can be different.

Regars,

Jesús Feliz Fernández.

F

FinnTux

Renowned Member

Oct 7, 2011

#6

janzun said:
I dont remember the direction, but it works for me only in one way and, in the opposite, VM is frozen (Windows XP often), or lost network (Linux often)... so they need a reboot. In the other hand, it works great for same CPU type.

Yes, I'm aware of the possible problems when migrating between different CPUs. On the other hand it occasionally works either way but hangs/crashes way too often.

Just tested ubuntu 11.10beta migration from E8500 to Q6600 and the same problem persist. VM was stopped after migration.

F

Frido Roose

Guest

Jan 4, 2012

#7

Hey, I also noticed unreliable online migrations from time to time. Guest is a SLES 11 SP1 machine. Also tried without a NIC attached, but the results are the same: the web gui tells the migration is ok, and then within 10 seconds later, the virtual machine freezes (I've seen 100% guest cpu time at this point) or appears to be no longer running.

I don't get much information from the logs on the target host, except for the messages below in red.

Is there a way to add more verbose logging, or perhaps another log file?

Online migration test that failed:

Proxmox webgui task log:
Jan 04 16:50:17 starting migration of VM 100 to node 'raiti' (192.168.110.55)
Jan 04 16:50:17 copying disk images
Jan 04 16:50:17 starting VM 100 on remote node 'raiti'
Jan 04 16:50:17 starting migration tunnel
Jan 04 16:50:18 starting online/live migration on port 60000
Jan 04 16:50:20 migration status: active (transferred 116894KB, remaining 154184KB), total 541056KB)
Jan 04 16:50:22 migration status: completed
Jan 04 16:50:22 migration speed: 128.00 MB/s
Jan 04 16:50:23 migration finished successfuly (duration 00:00:07)
TASK OK

daemon.log:
Jan 4 16:50:33 raiti pmxcfs[1805]: [status] received log
Jan 4 16:50:34 raiti qm[3153]: <root@pam> starting task UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam:
Jan 4 16:50:34 raiti qm[3154]: start VM 100: UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam:
Jan 4 16:50:34 raiti multipathd: dm-4: add map (uevent)
Jan 4 16:50:34 raiti qm[3153]: <root@pam> end task UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam: OK
Jan 4 16:50:40 raiti pmxcfs[1805]: [status] received log

syslog:
Jan 4 16:50:33 raiti pmxcfs[1805]: [status] received log
Jan 4 16:50:34 raiti qm[3153]: <root@pam> starting task UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam:
Jan 4 16:50:34 raiti qm[3154]: start VM 100: UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam:
Jan 4 16:50:34 raiti multipathd: dm-4: add map (uevent)
Jan 4 16:50:34 raiti kernel: device tap100i0 entered promiscuous mode
Jan 4 16:50:34 raiti kernel: vmbr0: port 2(tap100i0) entering forwarding state
Jan 4 16:50:34 raiti qm[3153]: <root@pam> end task UPID:raiti:00000C52:0002C08B:4F04754A:qmstart:100:root@pam: OK
Jan 4 16:50:38 raiti kernel: vmbr0: port 2(tap100i0) entering disabled state
Jan 4 16:50:38 raiti kernel: vmbr0: port 2(tap100i0) entering disabled state
Jan 4 16:50:40 raiti pmxcfs[1805]: [status] received log

Another online migration test that failed:

Proxmox webgui task log:
Jan 04 17:10:34 starting migration of VM 100 to node 'raiti' (192.168.110.55)
Jan 04 17:10:34 copying disk images
Jan 04 17:10:34 starting VM 100 on remote node 'raiti'
Jan 04 17:10:34 starting migration tunnel
Jan 04 17:10:35 starting online/live migration on port 60000
Jan 04 17:10:37 migration status: active (transferred 115124KB, remaining 146652KB), total 541056KB)
Jan 04 17:10:39 migration status: completed
Jan 04 17:10:39 migration speed: 128.00 MB/s
Jan 04 17:10:40 migration finished successfuly (duration 00:00:07)
TASK OK

syslog:
Jan 4 17:10:50 raiti pmxcfs[1805]: [status] received log
Jan 4 17:10:50 raiti qm[5503]: <root@pam> starting task UPID:raiti:00001580:00049BD8:4F047A0A:qmstart:100:root@pam:
Jan 4 17:10:50 raiti qm[5504]: start VM 100: UPID:raiti:00001580:00049BD8:4F047A0A:qmstart:100:root@pam:
Jan 4 17:10:51 raiti multipathd: dm-4: add map (uevent)
Jan 4 17:10:51 raiti qm[5503]: <root@pam> end task UPID:raiti:00001580:00049BD8:4F047A0A:qmstart:100:root@pam: OK
Jan 4 17:10:51 raiti ntpd[1732]: bind(24) AF_INET6 fe80::c488:5fff:fe4e:ac5b%9#123 flags 0x11 failed: Cannot assign requested address
Jan 4 17:10:51 raiti ntpd[1732]: unable to create socket on tap100i0 (12) for fe80::c488:5fff:fe4e:ac5b#123
Jan 4 17:10:51 raiti ntpd[1732]: failed to init interface for address fe80::c488:5fff:fe4e:ac5b
Jan 4 17:10:57 raiti pmxcfs[1805]: [status] received log

Last edited by a moderator: Jan 4, 2012

L

lievenva

New Member

Jan 6, 2012

#8

Can we do anything to help debugging the problem mentioned by Frido in previous post?

Thanks,

Lieven

F

FinnTux

Renowned Member

Jan 7, 2012

#9

Just for the record, I tested live migration on the same hardware using my own script, stock debian 2.6.32 kernel and qemu-kvm-1.0 and have not had s single failed migration. Shared storage, incremental block and full block migration work ok.

E

e100

Renowned Member

Proxmox Subscriber

Jan 8, 2012

#10

I posted a "fix" here: https://bugzilla.proxmox.com/show_bug.cgi?id=7#c3

The proxmox code has changed quite a bit since then so the info on the bug is not helpful.

Edit /usr/share/perl5/PVE/AbstractMigrate.pm
About line 171 add "sleep (2);"

Code:

       # vm is now owned by other node
        # Note: there is no VM config file on the local node anymore

        if ($self->{running}) {

            &$eval_int($self, sub { $self->phase2($self->{vmid}); });
            my $phase2err = $@;
            [B]sleep (2);[/B]
            if ($phase2err) {
                $self->{errors} = 1;
                $self->log('err', "online migrate failure - $phase2err");
            }

After making the edit you need to reboot.
Do this on at least two nodes, then test live migration between them.

I tested with a 2008 R2 VM that was running "ping -t 192.168.1.1"(ping the gateway) while it was being live migrated back and forth.
Live migration worked flawless for me once I made the edits and rebooted.

When I was testing this I noticed that the higher I set migrate_speed the more often live migration failed.
The first thought I had was a race condition since speed seemed to influence failure rate.
Thus I concluded slowing the final stages of the migration down would fix the problem.

I do not feel this is a proper fix, but maybe this can lead to discovering the root cause so a proper fix can be created.

dietmar

Proxmox Staff Member

Staff member

Jan 8, 2012

#11

e100 said:
I do not feel this is a proper fix, but maybe this can lead to discovering the root cause so a proper fix can be created.

I am also unable to see why that helps.

F

Frido Roose

Guest

Jan 9, 2012

#12

Thanks,

I can confirm that the sleep-workaround provided by e100 actually does seem to help! This test was done with Linux guests (SLES11 SP1).
Where before the machine wouldn't keep on running after a live migration once in 3-4 times, now the problem didn't occur anymore (tried up to 15 live migrations of the same VM without a problem).

K

ksosez

Guest

Jan 9, 2012

#13

Can we get that fix checked in? It would help a lot of us.

E

e100

Renowned Member

Proxmox Subscriber

Jan 9, 2012

#14

sleep(1); seems to work ok too.

A

alain

Renowned Member

Jan 9, 2012

#15

FinnTux said:
Just for the record, I tested live migration on the same hardware using my own script, stock debian 2.6.32 kernel and qemu-kvm-1.0 and have not had s single failed migration. Shared storage, incremental block and full block migration work ok.

So the problem coud be due to the openvz patches ? As I am only running KVM, and there are regularly bugs due to openvz patches, I would prefer a stock 2.6.32 RHEL kernel, in a separate branch, as it was the case for 2.6.35...

Alain

dietmar

Proxmox Staff Member

Staff member

Jan 9, 2012

#16

alain said:
I would prefer a stock 2.6.32 RHEL kernel, in a separate branch, as it was the case for 2.6.35...

Maintaining a kernel is really much work. so we do not plan to maintain another kernel.

E

e100

Renowned Member

Proxmox Subscriber

Jan 9, 2012

#17

I tried adding the sleep(2); to various locations in the code.
see my comments near the end of /usr/share/perl5/PVE/QemuMigrate.pm

Code:

sub phase3_cleanup {
    my ($self, $vmid, $err) = @_;

    my $conf = $self->{vmconf};

    # always stop local VM
    eval { PVE::QemuServer::vm_stop($self->{storecfg}, $vmid, 1, 1); };
    if (my $err = $@) {
        $self->log('err', "stopping vm failed - $err");
        $self->{errors} = 1;
    }

    [B][I]#sleep here seems to fix it[/I][/B]
    # always deactivate volumes - avoid lvm LVs to be active on several nodes
    eval {
        my $vollist = PVE::QemuServer::get_vm_volumes($conf);
        PVE::Storage::deactivate_volumes($self->{storecfg}, $vollist);
    };
    if (my $err = $@) {
        $self->log('err', $err);
        $self->{errors} = 1;
    }
    [B][I]#sleep here seems to fix it[/I][/B]
    # clear migrate lock
    my $cmd = [ @{$self->{rem_ssh}}, 'qm', 'unlock', $vmid ];
    $self->cmd_logerr($cmd, errmsg => "failed to clear migrate lock");
    [B][I]#sleep here seems to fix it[/I][/B]
}
sub final_cleanup {
    my ($self, $vmid) = @_;
    [I][B]#sleep here leads to failure[/B][/I]
    # nothing to do
}

1;

I doubt I have run enough trials for the above to be conclusive, maybe some others can help confirm.
If we know where the specific location where the pause helps we should be able to identify why it helps and fix it properly.

E

e100

Renowned Member

Proxmox Subscriber

Jan 10, 2012

#18

I have had some time to run more trials and I am much more confident on the location of the problem.

in phase3_cleanup in /usr/share/perl5/PVE/QemuMigrate.pm

Code:

    # clear migrate lock
    my $cmd = [ @{$self->{rem_ssh}}, 'qm', 'unlock', $vmid ];
    $self->cmd_logerr($cmd, errmsg => "failed to clear migrate lock");

If I sleep before that migration works fine.
If I sleep after that it sometimes works.

I captured some data from syslog during success and during a failure.
There are some slight differences.

syslog is the same on the source node for failure and success.
On the destination node they are slightly different

Working migration with sleep(2); before the block of code above:

Code:

Jan  9 20:20:28 vm5 qm[11322]: <root@pam> end task UPID:vm5:00002C3B:0004B7F2:4F0B925C:qmstart:100:root@pam: OK
Jan  9 20:20:38 vm5 pmxcfs[1793]: [status] received log
Jan  9 20:20:38 vm5 kernel: tap100i0: no IPv6 routers present

Failure where no sleep(2); was added to the code:

Code:

Jan  9 20:26:16 vm5 qm[12624]: <root@pam> end task UPID:vm5:00003151:00053FC7:4F0B93B8:qmstart:100:root@pam: OK
Jan  9 20:26:23 vm5 kernel: vmbr0: port 2(tap100i0) entering disabled state
Jan  9 20:26:23 vm5 kernel: vmbr0: port 2(tap100i0) entering disabled state
Jan  9 20:26:24 vm5 pmxcfs[1793]: [status] received log

The block of code above is changing the config file for the vm to remove the lock.
Maybe some latency in the pmxcfs is causing the failure.

pmxcfs does not exist on FinnTux's stock debian system that live migrates fine.
pmxcfs does not exist on Proxmox 1.9 that live migrates fine.
From my testing it appears the issue randomly happens around the time something is being edited on the pmxcfs.

dietmar

Proxmox Staff Member

Staff member

Jan 10, 2012

#19

e100 said:
If I sleep before that migration works fine.
If I sleep after that it sometimes works.

I don't really understand that. Migration is already finished at that point, and if 'qm unlock' fails we would get an error message? Is there any hint in the task log?

E

e100

Renowned Member

Proxmox Subscriber

Jan 10, 2012

#20

dietmar said:
I don't really understand that. Migration is already finished at that point, and if 'qm unlock' fails we would get an error message? Is there any hint in the task log?

We have both already admitted that the sleep makes no sense regardless of where it happens, the fact remains that the sleep does seem to fix it.

I have taken the time to find the section of code that seems to be responsible for the failure.
If I am wrong that is easy to proove.
Let's not get stuck on the fact it makes no sense, we need to find the root cause of the problem and then make sense of it.

You must log in or register to reply here.

Share:

Bluesky LinkedIn Reddit Email Link

Top Bottom

We value your privacy

We use essential cookies to make this site work, and optional cookies to enhance your experience.

See further information and configure your preferences

Accept all cookies Reject optional cookies
Essential cookies

These cookies are required to enable core functionality such as security, network management, and accessibility. You may not reject these.

Optional cookies

We deliver enhanced functionality for your browsing experience by setting these cookies. If you reject them, enhanced functionality will be unavailable.

Third-party cookies

Cookies set by third parties may be required to power functionality in conjunction with our captcha service provider for security and anti-spam purposes.

Detailed cookie usage

Privacy policy