online migration never finishes, takes vm offline

athompso

Renowned Member
Sep 13, 2013
129
8
83
I've got a PVE 3.1 cluster up and running now, using sheepdog for shared storage. (I've tried on five separate occasions, I have yet to successfully build a CEPH cluster so I just gave up. The whole point, for me, is to run storage and VM on the same nodes!)

Everything's updated to v3.1-24/060bd5a6 from the no-subscription repo, as I only added the enterprise repo to these servers today.

I have a VM running happily on node#1, backed from Sheepdog storage. When I attempt to "online" migrate it to node#2, the migration starts, but apparently never finishes. The VM only has 512MB of RAM, but the migration has now been running for over 75 minutes. Each node has a 4-way LAG to a common switch; manual tests show that SCP between these nodes gets at least 20MB/sec using default ciphers, and ~75MB/sec using arcfour.

The "qm ... mtunnel" process is still running, and the SSH connection between the two nodes is still pumping a goodly amount of data over 75min later - what on earth is it transferring?

The VM, incidentally, is NOT responding on the network; the "online" migration has become an "offline" migration :-(.

The task log only shows this, and nothing else:
Dec 29 14:44:32 starting migration of VM 108 to node 'pve02' (192.168.160.28)
Dec 29 14:44:32 copying disk images
Dec 29 14:44:32 starting VM 108 on remote node 'pve02'
Dec 29 14:44:34 starting ssh migration tunnel
Dec 29 14:44:35 starting online/live migration on localhost:60000
Dec 29 14:44:35 migrate_set_speed: 8589934592
Dec 29 14:44:35 migrate_set_downtime: 0.1

How do I troubleshoot this migration process?

Thanks,
-Adam
 
Clicking "Stop" to cancel the migration produces no additional log output, but the status changes to "stopped: unexpected status".

Even better, all other attempts to control the VM on the original node now report "Error: VM is locked (migrate)".
 
Last edited:
You can unlock the vm from the commabnd line with

Code:
qm unlock <VM Id>

e.g

Code:
qm unlock 108


No idea whats happening with your migration unfortunately :( do live snapshots work?
 
I actually had to reboot Node#1 to clear the wedged KVM process; qm commands would all fail with some error about being unable to connect. I would up editing the vm config file and deleting "Locked: migration". (Going from memory, not necessarily an exact quote.)
 
Sheepdog is not considered stable, so I would not use that in a production environment.

Yes, I know.

I did not ask for a solution to my migration issue, I asked for suggestions on how to troubleshoot. I am not a QEMU expert, so I am not sure what to look at next, or even how to obtain debug logs.
I already know how to get this information under VMware, but I did not choose VMware for this project... so now I need to find out how to troubleshoot live migration with QEMU/KVM.
Links and pointers to documentation are welcome, as I do not see anything relevant on the PVE wiki.
 
Last edited:
  • Like
Reactions: Sven Jörns
Yes, I know.

I did not ask for a solution to my migration issue, I asked for suggestions on how to troubleshoot. I am not a QEMU expert, so I am not sure what to look at next, or even how to obtain debug logs.
I already know how to get this information under VMware, but I did not choose VMware for this project... so now I need to find out how to troubleshoot live migration with QEMU/KVM.
Links and pointers to documentation are welcome, as I do not see anything relevant on the PVE wiki.


Hi, you can try to disable ssh tunnel:

You can enable this by adding:
migration_unsecure: 1
to datacenter.cfg


I'm not sure that your problem is related to sheepdog, I never have had problem with it for live migration.
But if your vm is doing a lot of memory changes (like a database), your transfert speed need to be as fast as memory changes.

(qemu 1.7 have a new option to slowdown vcpu in guest for this kind of situation, I'll try to add it soon in proxmox)