After many hours of testing I think the root cause of the problem has been identified.
It is not related to the bridge as I previously speculated.
Once the migration is completed the migration code sends "quit\n" to the destination server over the migration tunnel.
The destination then responds and the ssh tunnel can be destroyed.
The migration fails because the ssh tunnel is terminate
before the destination has a chance to respond to or receive and process the "quit\n"
I know this is the situation because when it fails the /var/log/auth.log on the destination always had the sshd daemon reporting:
Code:
sshd[144301]: fatal: Write failed: Connection reset by peer
The only reason the ssh daemon on the destination would report "Connection reset by peer" is if the source server terminated the ssh client before the server had sent its final response or there is some sort of networking connectivity issue as I previously speculated.
I
assumed that the destination considers the tunnel being broken before the quit has been processed as something going wrong and as such terminates the vm just as it would do for any other failure.
We can verify my assumption one of two ways.
We could terminate the tunnel before the quit, that should cause migration to always fail.
Sure this does not fix anything but it does prove that killing the ssh client before the destination processes the quit will cause the vm to be terminated.
The other method to verify the assumption is to simply comment out the section of code that terminates the ssh tunnel and let it die on its own.
This should prevent the migration failure since the destination will always be able to process the quit.
So all of you who can test this here is what you you need to do:
To make the migration always fail: Because the tunnel is terminated before the quit is sent
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 89 and comment out the lines that are bold:
Code:
sub finish_tunnel {
my ($self, $tunnel) = @_;
my $writer = $tunnel->{writer};
[B]# eval {
# PVE::Tools::run_with_timeout(30, sub {
# print $writer "quit\n";
# $writer->flush();
# });
# };
# my $err = $@;[/B]
$self->finish_command_pipe($tunnel);
[B]# die $err if $err;[/B]
}
3. Reboot
4. Test live migration and report your success or failure here
To make the migration always work: Because the tunnel is never terminated before the quit is processed
1. Start with unmodified code
2. Edit /usr/share/perl5/PVE/QemuMigrate.pm about line 52 and comment out the lines that are bold:
Code:
sub finish_command_pipe {
my ($self, $cmdpipe) = @_;
my $writer = $cmdpipe->{writer};
my $reader = $cmdpipe->{reader};
$writer->close();
$reader->close();
my $cpid = $cmdpipe->{pid};
[I][B] #kill(15, $cpid) if kill(0, $cpid);
[/B][/I] waitpid($cpid, 0);
}
3. Reboot
4. Test live migration and report your success or failure here