Live migration reliability

OK, just uploaded a patch to the git repository:

https://git.proxmox.com/?p=qemu-ser...;hpb=e95fe75f86e81e9f9d597e1d43cd757b928813eb

Please can you test?

I like the patch and it does seem to work.
Out of curiosity I added one line to the code to see if that loop is making a difference.

Code:
if ($timeout) {
        for (my $i = 0; $i < $timeout; $i++) {
[B]            $self->log('info', "Checking if tunnel exists\n");[/B]
            return if !PVE::ProcFSTools::check_process_running($cpid);
            sleep(1);
        }
    }

Nearly every time I do a live migration the check happens twice:

Code:
Jan 17 12:12:51 starting migration of VM 100 to node 'vm5' (192.168.8.5)
Jan 17 12:12:51 copying disk images
Jan 17 12:12:51 starting VM 100 on remote node 'vm5'
Jan 17 12:12:51 starting migration tunnel
Jan 17 12:12:51 starting online/live migration on port 60000
Jan 17 12:12:53 migration status: active (transferred 210559KB, remaining 2169920KB), total 4211136KB)
Jan 17 12:12:56 migration status: active (transferred 480429KB, remaining 284664KB), total 4211136KB)
Jan 17 12:12:58 migration status: completed
Jan 17 12:12:58 migration speed: 585.14 MB/s
[B]Jan 17 12:12:59 Checking if tunnel exists
Jan 17 12:13:00 Checking if tunnel exists[/B]
Jan 17 12:13:00 migration finished successfuly (duration 00:00:09)
TASK OK

Your patches fixes the problem, no more migration failures from the tunnel being killed prematurely.

Thanks for the patch, looking forward to seeing it in the next update!
 
3. Reboot
4. Test live migration and report your success or failure here

Hi,

Good stuff in this thread, thanks!
Is there any way to get around that reboot step? I need to migrate several VM's away from a beta3 node to an rc1 (which apparently has the patch already). I would like to avoid having to shut down all vm's first in order to make the patch work.
Thanks!

/K
 
Did that a couple of days ago without rebooting.
For the same reason as yours... upgrade to RC1.

In case of incertainty, try migration reliability using a test vm.

David

Hi,

Good stuff in this thread, thanks!
Is there any way to get around that reboot step? I need to migrate several VM's away from a beta3 node to an rc1 (which apparently has the patch already). I would like to avoid having to shut down all vm's first in order to make the patch work.
Thanks!

/K
 
Hi,

That's what I tried, but the migrations are still failing half of the time :(
edit: last test: 1 in 10 failed. Last failure was direction (patched) beta3 -> rc1

I'm testing migrations back and forth between a beta3 node and rc1. On the beta3 node I commented out line 52 (kill...). On the RC1 side I currently haven't changed anything.

Cheers,
K!

Did that a couple of days ago without rebooting.
For the same reason as yours... upgrade to RC1.

In case of incertainty, try migration reliability using a test vm.

David
 
Last edited:
Hi, could you try with new cpu model : cpu64-rhel6 ? (you need to apt-get update && apt-get dist-upgrade)

This must improve compatibility for migration between amd and intel.


also beta3 and rc1, doesn't use same qemu-kvm version, so that can explain crash.
 
Last edited:
Hi,

I can't try the new CPU model since it's not supported on beta3. It also defeats the purpose of my initial question since I want to avoid having to shut down the VM's.
Also, I doubt that this has anything to do with CPU models (beta3 and rc1 nodes are running the same hardware), plus the fact that I also see the SSH tunnel failure when a migration fails in /var/log/auth.log, so this should be the exact same issue as described by e100.

Best regards,
Koen

Hi, could you try with new cpu model : cpu64-rhel6 ? (you need to apt-get update && apt-get dist-upgrade)

This must improve compatibility for migration between amd and intel.


also beta3 and rc1, doesn't use same qemu-kvm version, so that can explain crash.
 
Hi Udo,

As mentioned before, a prematurely disconnected SSH tunnel is the cause of this issue, not a version mismatch.

Cheers,
Koen
 
So my question remains, is there any way to avoid having to reboot the node to make the changes to the perl files active? Restart of some service perhaps?
Also, another thought relating to the post of Udo: what's the purpose of live migration if not for disruption-free upgrades?

Cheers,
Koen
 
Yes. And why do you call the software 'beta3' and not 'ultimately-stable'?

Your point being?
I was merely pointing out that live migration is mostly used to be able to do upgrades without bringing down VM's, so telling me that you shouldn't and even can't use it for this purpose is simply bollocks, even when it's a beta. The source of my issue is clearly the ssh tunnel, nothing else.

Is it seriously impossible here to ask a normal question without being shot down with "IT'S A BETA"??? This is _not_ helping...

/K
 
Duly noted, this was obvious from reading the entire thread as well.
I'm simply asking if anybody knows if the provided workaround can work without rebooting the node or the VM's on it.

/K
 
I'm simply asking if anybody knows if the provided workaround can work without rebooting the node or the VM's on it.

Yes, that can work if you update all packages. But this is untested so I do not know if that will work.
 
Yes, that can work if you update all packages. But this is untested so I do not know if that will work.

Consider it tested, just live migrated about 40 VM's away from beta3 nodes to rc1 nodes in order to complete the upgrades. No issues encountered (just commented the kill line (line 51 in QemuMigrate.pm). No reboots were necessary.

/K
 
sorry to bring up this old thread but I have exactly the same issue and have a lot of Live Migration failures, since this feature is essential to me I would like to know if this patch is pushed in the latest stable (that I am using) or do I need to get the git?

thanks
 
sorry to bring up this old thread but I have exactly the same issue and have a lot of Live Migration failures, since this feature is essential to me I would like to know if this patch is pushed in the latest stable (that I am using) or do I need to get the git?

thanks

This particular migration issue is fixed, if the bugzilla still has it open I suspect someone simply forgot to close it.
If you are having migration issues please post the details of your issue in a new thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!