Live migration reliability

e100, like you noticed, I also found earlier that the following message appeared when the migration failed:

kernel: vmbr0: port 2(tap100i0) entering disabled state

But I suspect it's the consequence of the machine 'crashing', and not the cause.
 
The significance that I see is when it works the order is:
pmxcfs sees an update
Then the network interface comes up and detects that no IPv6 routers exist for it.

But when it fails, the order is:
Network interface fails to work, also note that it never seems to work because it never detects that no IPv6 routers exist.
Then pmxcfs sees an update

When I delay the write to pmxcfs for removing the lock in the vm config file the order of the syslog events change.

Lastly in your logs the pmxcfs update came 2 seconds after the network failure.
In my case it was 1 second after the network failure.

Yet when it is working ok the pmxcfs update comes before the network is up and during the same second, not 1 or 2 seconds after.
 
Yet when it is working ok the pmxcfs update comes before the network is up and during the same second, not 1 or 2 seconds after.

Please can you test if it always work when you comment out the 'qm unlock' code (you need to manuall unlock after each migrate then)?
 
The solution suggested by e100 worked well for me too. (add Sleep(2); to /usr/share/perl5/PVE/AbstractMigrate.pm).
I had problems migrating SBS2011 from node 1 to node 2 but not vice-versa. The migration completed correctly but the VM failed to restart even if the log reported the teas as completed without errors. W2008R2 was migrating fine both ways even without patch.
I am running Proxmox 2.0-18/16283a5a on 2 identical HP DL380-G7 with single E5645
 
Please can you test if it always work when you comment out the 'qm unlock' code (you need to manuall unlock after each migrate then)?

I ran the test by uncommenting the unlock code on 2 proxmox hosts:
Code:
    # clear migrate lock
    #my $cmd = [ @{$self->{rem_ssh}}, 'qm', 'unlock', $vmid ];
    #$self->cmd_logerr($cmd, errmsg => "failed to clear migrate lock");

After the first online migration, the virtual machine ended in a 'stopped' state. This was in the destination host syslog:
Code:
Jan 11 15:55:45 yamu pmxcfs[2802]: [status] received log
Jan 11 15:55:46 yamu pmxcfs[2802]: [status] received log
Jan 11 15:56:26 yamu pmxcfs[2802]: [status] received log
Jan 11 15:56:27 yamu qm[3685]: start VM 101: UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam:
Jan 11 15:56:27 yamu qm[3684]: <root@pam> starting task UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam:
Jan 11 15:56:28 yamu kernel: device tap101i0 entered promiscuous mode
Jan 11 15:56:28 yamu kernel: vmbr314: port 2(tap101i0) entering forwarding state
Jan 11 15:56:28 yamu qm[3684]: <root@pam> end task UPID:yamu:00000E65:0000D799:4F0DA31B:qmstart:101:root@pam: OK
Jan 11 15:56:31 yamu kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:31 yamu kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:32 yamu pmxcfs[2802]: [status] received log

This was on the source host syslog:
Code:
Jan 11 15:56:31 raiti pvedaemon[3452]: <root@pam> starting task UPID:raiti:00000ECC:000099A5:4F0DA31F:qmigrate:101:root@pam:
Jan 11 15:56:32 raiti pmxcfs[2798]: [status] received log
Jan 11 15:56:33 raiti pmxcfs[2798]: [status] received log
Jan 11 15:56:36 raiti kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:36 raiti kernel: vmbr314: port 2(tap101i0) entering disabled state
Jan 11 15:56:37 raiti multipathd: dm-10: remove map (uevent)
Jan 11 15:56:37 raiti multipathd: dm-10: devmap not registered, can't remove
Jan 11 15:56:37 raiti multipathd: dm-10: remove map (uevent)
Jan 11 15:56:37 raiti multipathd: dm-10: devmap not registered, can't remove
Jan 11 15:56:37 raiti pvedaemon[3452]: <root@pam> end task UPID:raiti:00000ECC:000099A5:4F0DA31F:qmigrate:101:root@pam: OK

While in the webgui:
Code:
Jan 11 15:56:32 starting migration of VM 101 to node 'yamu' (192.168.108.24)
Jan 11 15:56:32 copying disk images
Jan 11 15:56:32 starting VM 101 on remote node 'yamu'
Jan 11 15:56:33 starting migration tunnel
Jan 11 15:56:33 starting online/live migration on port 60000
Jan 11 15:56:36 migration status: completed
Jan 11 15:56:36 migration speed: 170.67 MB/s
Jan 11 15:56:37 migration finished successfuly (duration 00:00:06)
TASK OK

The same vm works perfectly with the sleep in the code.
 
The problem is in SSH, look in your /var/log/auth.log on the destination when migrate fails.
Code:
fatal: Write failed: Connection reset by peer
or this:
Code:
fatal: Write failed: Broken pipe

I have not found the cause yet but I suspect some sort of arp issue likely related to the bridge that the VM network connects to.
 
I have not identified the issue but I am quite confident it has something to do with the bridge.
There are various bug reports in other distros with similar problems related to arp and bridge interfaces.

I can use brctl and change various settings to make the migration fail more often but have not found any settings to make it better.
 
What is the overall duration of such failed migration task? Is the failure somehow related to the duration?
 
After many hours of testing I think the root cause of the problem has been identified.
It is not related to the bridge as I previously speculated.

Once the migration is completed the migration code sends "quit\n" to the destination server over the migration tunnel.
The destination then responds and the ssh tunnel can be destroyed.

The migration fails because the ssh tunnel is terminate before the destination has a chance to respond to or receive and process the "quit\n"
I know this is the situation because when it fails the /var/log/auth.log on the destination always had the sshd daemon reporting:
Code:
sshd[144301]: fatal: Write failed: Connection reset by peer
The only reason the ssh daemon on the destination would report "Connection reset by peer" is if the source server terminated the ssh client before the server had sent its final response or there is some sort of networking connectivity issue as I previously speculated.

I assumed that the destination considers the tunnel being broken before the quit has been processed as something going wrong and as such terminates the vm just as it would do for any other failure.

We can verify my assumption one of two ways.
We could terminate the tunnel before the quit, that should cause migration to always fail.
Sure this does not fix anything but it does prove that killing the ssh client before the destination processes the quit will cause the vm to be terminated.

The other method to verify the assumption is to simply comment out the section of code that terminates the ssh tunnel and let it die on its own.
This should prevent the migration failure since the destination will always be able to process the quit.


So all of you who can test this here is what you you need to do:

To make the migration always fail: Because the tunnel is terminated before the quit is sent
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 89 and comment out the lines that are bold:
Code:
sub finish_tunnel {
    my ($self, $tunnel) = @_;

    my $writer = $tunnel->{writer};

[B]#    eval {
#        PVE::Tools::run_with_timeout(30, sub {
#            print $writer "quit\n";
#            $writer->flush();
#        });
#    };
#    my $err = $@;[/B]

    $self->finish_command_pipe($tunnel);

[B]#    die $err if $err;[/B]
}
3. Reboot
4. Test live migration and report your success or failure here


To make the migration always work: Because the tunnel is never terminated before the quit is processed
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 52 and comment out the lines that are bold:
Code:
sub finish_command_pipe {
    my ($self, $cmdpipe) = @_;

    my $writer = $cmdpipe->{writer};
    my $reader = $cmdpipe->{reader};

    $writer->close();
    $reader->close();

    my $cpid = $cmdpipe->{pid};

[I][B]    #kill(15, $cpid) if kill(0, $cpid);

[/B][/I]    waitpid($cpid, 0);
}
3. Reboot
4. Test live migration and report your success or failure here
 
Last edited:
But the tunnel is not needed at all after the quit - so how can that make the VM fail?

As I mentioned, the problem manifests when the tunnel is terminated before the quit is processed by the remote side.
Try my code changes, terminating it without sending quit causes the exact same problem we complain about:
The output on the GUI says all worked fine, yet the vm is stopped on the destination node.

Then try my fix where we simply do not kill it, once the quit is processed the ssh client will will die on its own.
No more migration failures.
 
As I mentioned, the problem manifests when the tunnel is terminated before the quit is processed by the remote side.
Try my code changes, terminating it without sending quit causes the exact same problem we complain about:
The output on the GUI says all worked fine, yet the vm is stopped on the destination node.

Then try my fix where we simply do not kill it, once the quit is processed the ssh client will will die on its own.
No more migration failures.
Hi e100,
very good work! I have tried without kill... and my test of migrating one VM (with heavy IO inside) runs 30 times without trouble.

Udo
 
Hi e100,
very good work! I have tried without kill... and my test of migrating one VM (with heavy IO inside) runs 30 times without trouble.

Udo

Glad to hear it works for you too!

I feel a proper fix is to have a loop that waits for the ssh client to die on it's own but if it takes too long, say 10 seconds kill the ssh client.
That should ensure we never end up with a deadlock in the event something unexpected does happen.
 
I feel a proper fix is to have a loop that waits for the ssh client to die on it's own but if it takes too long, say 10 seconds kill the ssh client.
That should ensure we never end up with a deadlock in the event something unexpected does happen.

OK, will try to add such loop next week.
 
It's doing it here too. (comments out kill instruction) as suggested by e100.
Thanks e100.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!