[SOLVED] Live migration failed (Proxmox 3.3)

omen · Dec 1, 2015

I tried to do a live migration of a VM a couple hours ago and it failed (full output attached).

Code:

ERROR: failed to clear migrate lock: no such VM ('104')

It was a live migration from a Proxmox node named vm6 to a node named vm4. The GUI shows VM 104 in the list for vm4 (Proxmox node), but displays the error "no such VM ('104') (500)" when I click on it. VM 104 no longer shows in the GUI on the original node (vm6), however, it did still show as being on vm6 in /etc/pve/, so I tried a manual move of the file:

Code:

root@vm4:~# mv /etc/pve/nodes/vm6/qemu-server/104.conf /etc/pve/nodes/vm4/qemu-server/104.conf

The move has been running for 3 hours and has still not finished. qm status shows the same error:

Code:

root@vm4:~# qm status 104
no such VM ('104')

I have a quorum of nodes (6 in the cluster, 5 actively running now), so the quorum is 4 nodes.

I cannot interrupt the manual `mv' I kicked off, even `kill -9' is ignored. The VM I tried to move is not running, and cannot be started, "no such VM".

EDIT: one thing I forgot to mention is that I had a large (~1TB) VM disk image from a different VM being moved during the time I tried to live migrate the above VM.

I guess the question at this point is: how do I recover?

Thanks,
Omen

manu · Dec 1, 2015

Hi omen !
It looks what happened is that the source or the target node lost contact which the cluster during the live migration, which would explain why you got the message "no such VM"
As long as you still have the hard disk image you're safe though.

First of all I would check the stats of the cluster (pvecm status, pvecm nodes) and the status of the cluster filesystem ( service pve-cluster status )

omen · Dec 1, 2015

We run our back-end storage for all VMs on NFS via InfiniBand, the same InfiniBand the cluster uses to talk to itself. If we had major communication problems the VMs would have stopped working.

The output of all the commands looks good (these were all run on the node I moved the VM to, though the originating node looks the same).

At this point all cluster operations are failing. I tried to change a CD on a VM running on node 3 and it fails with "can't lock file '/var/lock/qemu-server/lock-142.conf' - got timeout (500)".

Code:

root@vm4:~# pvecm status
Version: 6.2.0
Config Version: 24
Cluster Name: METRO-HOLODECK
Cluster Id: 60725
Cluster Member: Yes
Cluster Generation: 10008
Membership state: Cluster-Member
Nodes: 5
Expected votes: 6
Total votes: 5
Node votes: 1
Quorum: 4  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: vm4
Node ID: 5
Multicast addresses: 239.192.237.35 
Node addresses: 172.16.40.4

Code:

root@vm4:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M   9940   2015-11-30 09:54:48  vm1
   2   X   9880                        vm2
   3   M   9904   2015-10-29 11:52:18  vm3
   4   M   9904   2015-10-29 11:52:18  vm5
   5   M   9480   2015-01-12 15:28:19  vm4
   6   M  10000   2015-11-30 11:16:42  vm6

Hmmm, looking at this I see that vm6 (the node the VM was originally on) re-joined the cluster about 1.5 hours before I did the VM migration.

Code:

root@vm4:~# service pve-cluster status
Checking status of pve cluster filesystem: pve-cluster running.

omen · Dec 2, 2015

This morning it looked like the cluster has completely shattered. `pvecm nodes' on vm6 looked like:

Code:

Node  Sts   Inc   Joined               Name
   1   M  10000   2015-11-30 11:16:42  vm1
   2   X   9880                        vm2
   3   M  10000   2015-11-30 11:16:42  vm3
   4   M  10000   2015-11-30 11:16:42  vm5
   5   M  10000   2015-11-30 11:16:42  vm4
   6   M      4   2014-12-09 12:39:59  vm6

While it looked like this on the rest of the nodes:

Code:

Node  Sts   Inc   Joined               Name
   1   M   9940   2015-11-30 09:54:48  vm1
   2   X   9880                        vm2
   3   M   9904   2015-10-29 11:52:18  vm3
   4   M   9248   2015-01-07 14:23:13  vm5
   5   M   9904   2015-10-29 11:52:18  vm4
   6   M  10000   2015-11-30 11:16:42  vm6

I took a chance and shutdown all the VMs on vm4 and rebooted it. This resurrected our cluster and I can now do VM operations (create/change CD image/etc), which I could not do before the reboot. So, back to normal, until I break something again.

Search

Search

[SOLVED] Live migration failed (Proxmox 3.3)

omen

Member

Attachments

manu

Proxmox Staff Member

omen

Member

omen

Member