Live migration reliability

Frido Roose · Jan 10, 2012

e100, like you noticed, I also found earlier that the following message appeared when the migration failed:

kernel: vmbr0: port 2(tap100i0) entering disabled state

But I suspect it's the consequence of the machine 'crashing', and not the cause.

e100 · Jan 10, 2012

The significance that I see is when it works the order is:
pmxcfs sees an update
Then the network interface comes up and detects that no IPv6 routers exist for it.

But when it fails, the order is:
Network interface fails to work, also note that it never seems to work because it never detects that no IPv6 routers exist.
Then pmxcfs sees an update

When I delay the write to pmxcfs for removing the lock in the vm config file the order of the syslog events change.

Lastly in your logs the pmxcfs update came 2 seconds after the network failure.
In my case it was 1 second after the network failure.

Yet when it is working ok the pmxcfs update comes before the network is up and during the same second, not 1 or 2 seconds after.

dietmar · Jan 10, 2012

e100 said:
Let's not get stuck on the fact it makes no sense, we need to find the root cause of the problem and then make sense of it.

Yes, we need to understand why it fails.

dietmar · Jan 10, 2012

e100 said:
Yet when it is working ok the pmxcfs update comes before the network is up and during the same second, not 1 or 2 seconds after.

Please can you test if it always work when you comment out the 'qm unlock' code (you need to manuall unlock after each migrate then)?

atinazzi · Jan 11, 2012

The solution suggested by e100 worked well for me too. (add Sleep(2); to /usr/share/perl5/PVE/AbstractMigrate.pm).
I had problems migrating SBS2011 from node 1 to node 2 but not vice-versa. The migration completed correctly but the VM failed to restart even if the log reported the teas as completed without errors. W2008R2 was migrating fine both ways even without patch.
I am running Proxmox 2.0-18/16283a5a on 2 identical HP DL380-G7 with single E5645

Frido Roose · Jan 11, 2012

dietmar said:
Please can you test if it always work when you comment out the 'qm unlock' code (you need to manuall unlock after each migrate then)?

I ran the test by uncommenting the unlock code on 2 proxmox hosts:

Code:

    # clear migrate lock
    #my $cmd = [ @{$self->{rem_ssh}}, 'qm', 'unlock', $vmid ];
    #$self->cmd_logerr($cmd, errmsg => "failed to clear migrate lock");

After the first online migration, the virtual machine ended in a 'stopped' state. This was in the destination host syslog:

Code:

Jan 11 15:55:45 yamu pmxcfs[2802]: [status] received log
Jan 11 15:55:46 yamu pmxcfs[2802]: [status] received log
Jan 11 15:56:26 yamu pmxcfs[2802]: [status] received log
Jan 11 15:56:27 yamu qm[3685]: start VM 101: UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam:
Jan 11 15:56:27 yamu qm[3684]: <root@pam> starting task UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam:
Jan 11 15:56:28 yamu kernel: device tap101i0 entered promiscuous mode
Jan 11 15:56:28 yamu kernel: vmbr314: port 2(tap101i0) entering forwarding state
Jan 11 15:56:28 yamu qm[3684]: <root@pam> end task UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam: OK
Jan 11 15:56:31 yamu kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:31 yamu kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:32 yamu pmxcfs[2802]: [status] received log

This was on the source host syslog:

Code:

Jan 11 15:56:31 raiti pvedaemon[3452]: <root@pam> starting task UPID:raiti:00000ECC:000099A5:4F0DA31F:qmigrate:101:root@pam:
Jan 11 15:56:32 raiti pmxcfs[2798]: [status] received log
Jan 11 15:56:33 raiti pmxcfs[2798]: [status] received log
Jan 11 15:56:36 raiti kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:36 raiti kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:37 raiti multipathd: dm-10: remove map (uevent)
Jan 11 15:56:37 raiti multipathd: dm-10: devmap not registered, can't remove
Jan 11 15:56:37 raiti multipathd: dm-10: remove map (uevent)
Jan 11 15:56:37 raiti multipathd: dm-10: devmap not registered, can't remove
Jan 11 15:56:37 raiti pvedaemon[3452]: <root@pam> end task UPID:raiti:00000ECC:000099A5:4F0DA31F:qmigrate:101:root@pam: OK

While in the webgui:

Code:

Jan 11 15:56:32 starting migration of VM 101 to node 'yamu' (192.168.108.24)
Jan 11 15:56:32 copying disk images
Jan 11 15:56:32 starting VM 101 on remote node 'yamu'
Jan 11 15:56:33 starting migration tunnel
Jan 11 15:56:33 starting online/live migration on port 60000
Jan 11 15:56:36 migration status: completed
Jan 11 15:56:36 migration speed: 170.67 MB/s
Jan 11 15:56:37 migration finished successfuly (duration 00:00:06)
TASK OK

The same vm works perfectly with the sleep in the code.

e100 · Jan 13, 2012

The problem is in SSH, look in your /var/log/auth.log on the destination when migrate fails.

Code:

fatal: Write failed: Connection reset by peer

or this:

Code:

fatal: Write failed: Broken pipe

I have not found the cause yet but I suspect some sort of arp issue likely related to the bridge that the VM network connects to.

e100 · Jan 13, 2012

I have not identified the issue but I am quite confident it has something to do with the bridge.
There are various bug reports in other distros with similar problems related to arp and bridge interfaces.

I can use brctl and change various settings to make the migration fail more often but have not found any settings to make it better.

dietmar · Jan 13, 2012

What is the overall duration of such failed migration task? Is the failure somehow related to the duration?

e100 · Jan 13, 2012

dietmar said:
What is the overall duration of such failed migration task? Is the failure somehow related to the duration?

The servers I have been migrating to test this migrate between 9 and 30 seconds.
Failures have occurred with the same durations that did not result in a failure.
I do not see any correlation between failure and migration time.

e100 · Jan 14, 2012

Failures occur if the migration is 9 seconds or 30 second.

Migration failures are more frequent when the migration completes quickly.

e100 · Jan 14, 2012

After many hours of testing I think the root cause of the problem has been identified.
It is not related to the bridge as I previously speculated.

Once the migration is completed the migration code sends "quit\n" to the destination server over the migration tunnel.
The destination then responds and the ssh tunnel can be destroyed.

The migration fails because the ssh tunnel is terminate before the destination has a chance to respond to or receive and process the "quit\n"
I know this is the situation because when it fails the /var/log/auth.log on the destination always had the sshd daemon reporting:

Code:

sshd[144301]: fatal: Write failed: Connection reset by peer

The only reason the ssh daemon on the destination would report "Connection reset by peer" is if the source server terminated the ssh client before the server had sent its final response or there is some sort of networking connectivity issue as I previously speculated.

I assumed that the destination considers the tunnel being broken before the quit has been processed as something going wrong and as such terminates the vm just as it would do for any other failure.

We can verify my assumption one of two ways.
We could terminate the tunnel before the quit, that should cause migration to always fail.
Sure this does not fix anything but it does prove that killing the ssh client before the destination processes the quit will cause the vm to be terminated.

The other method to verify the assumption is to simply comment out the section of code that terminates the ssh tunnel and let it die on its own.
This should prevent the migration failure since the destination will always be able to process the quit.

So all of you who can test this here is what you you need to do:

To make the migration always fail: Because the tunnel is terminated before the quit is sent
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 89 and comment out the lines that are bold:

Code:

sub finish_tunnel {
    my ($self, $tunnel) = @_;

    my $writer = $tunnel->{writer};

[B]#    eval {
#        PVE::Tools::run_with_timeout(30, sub {
#            print $writer "quit\n";
#            $writer->flush();
#        });
#    };
#    my $err = $@;[/B]

    $self->finish_command_pipe($tunnel);

[B]#    die $err if $err;[/B]
}

3. Reboot
4. Test live migration and report your success or failure here

To make the migration always work: Because the tunnel is never terminated before the quit is processed
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 52 and comment out the lines that are bold:

Code:

sub finish_command_pipe {
    my ($self, $cmdpipe) = @_;

    my $writer = $cmdpipe->{writer};
    my $reader = $cmdpipe->{reader};

    $writer->close();
    $reader->close();

    my $cpid = $cmdpipe->{pid};

[I][B]    #kill(15, $cpid) if kill(0, $cpid);

[/B][/I]    waitpid($cpid, 0);
}

3. Reboot
4. Test live migration and report your success or failure here

dietmar · Jan 14, 2012

But the tunnel is not needed at all after the quit - so how can that make the VM fail?

e100 · Jan 14, 2012

dietmar said:
But the tunnel is not needed at all after the quit - so how can that make the VM fail?

As I mentioned, the problem manifests when the tunnel is terminated before the quit is processed by the remote side.
Try my code changes, terminating it without sending quit causes the exact same problem we complain about:
The output on the GUI says all worked fine, yet the vm is stopped on the destination node.

Then try my fix where we simply do not kill it, once the quit is processed the ssh client will will die on its own.
No more migration failures.

udo · Jan 14, 2012

e100 said:
As I mentioned, the problem manifests when the tunnel is terminated before the quit is processed by the remote side.
Try my code changes, terminating it without sending quit causes the exact same problem we complain about:
The output on the GUI says all worked fine, yet the vm is stopped on the destination node.

Then try my fix where we simply do not kill it, once the quit is processed the ssh client will will die on its own.
No more migration failures.

Hi e100,
very good work! I have tried without kill... and my test of migrating one VM (with heavy IO inside) runs 30 times without trouble.

Udo

e100 · Jan 14, 2012

udo said:
Hi e100,
very good work! I have tried without kill... and my test of migrating one VM (with heavy IO inside) runs 30 times without trouble.

Udo

Glad to hear it works for you too!

I feel a proper fix is to have a loop that waits for the ssh client to die on it's own but if it takes too long, say 10 seconds kill the ssh client.
That should ensure we never end up with a deadlock in the event something unexpected does happen.

dietmar · Jan 15, 2012

e100 said:
I feel a proper fix is to have a loop that waits for the ssh client to die on it's own but if it takes too long, say 10 seconds kill the ssh client.
That should ensure we never end up with a deadlock in the event something unexpected does happen.

OK, will try to add such loop next week.

Frido Roose · Jan 16, 2012

Nice troubleshooting e100!

Albator · Jan 16, 2012

It's doing it here too. (comments out kill instruction) as suggested by e100.
Thanks e100.

dietmar · Jan 17, 2012

OK, just uploaded a patch to the git repository:

https://git.proxmox.com/?p=qemu-ser...;hpb=e95fe75f86e81e9f9d597e1d43cd757b928813eb

Please can you test?

Live migration reliability

Frido Roose

Guest

Famous Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Frido Roose

Guest

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Distinguished Member

Famous Member

Proxmox Staff Member

Frido Roose

Guest

Albator

Guest

Proxmox Staff Member

We value your privacy