Proxmox 4 Migration Issues

adamb

Famous Member
Mar 1, 2012
1,329
77
113
I have a 3 node Proxmox 4 cluster which I am running into odd issues with live migration. Up until now it has worked great. Kicked off two VM's to be migrated but the migration process never actually started and they are in a "migration" state, not doing much of anything.

How can I get it out of this state. This issue seems to only effect some of the VM's on the cluster.

root@ccsmiscrit1:~# ha-manager status
quorum OK
master ccsmiscrit2 (active, Fri Jan 22 06:51:32 2016)
lrm ccsmiscrit1 (active, Fri Jan 22 06:51:32 2016)
lrm ccsmiscrit2 (active, Fri Jan 22 06:51:33 2016)
lrm ccsmiscrit3 (active, Fri Jan 22 06:51:26 2016)
service vm:100 (ccsmiscrit1, started)
service vm:101 (ccsmiscrit2, migrate)
service vm:102 (ccsmiscrit3, started)
service vm:103 (ccsmiscrit3, started)
service vm:104 (ccsmiscrit2, started)
service vm:105 (ccsmiscrit2, migrate)
service vm:106 (ccsmiscrit1, started)
service vm:107 (ccsmiscrit3, started)
service vm:108 (ccsmiscrit1, started)
service vm:109 (ccsmiscrit2, started)

vm:104 was also just migrated and it went from ccsmiscrit3 to ccsmiscrit2 with no issues. However, the two which are stuck arn't doing anything. No errors reported, they all share the same storage and the cluster network has no issues.

VM's which are stuck in this state are still running and don't seem to have any issues other than not being migrated.
 
Last edited:
Tried using the "Monitor" to cancel the migration with "migrate_cancel" but that does nothing.

Running "info migrate" doesn't really tell me much either.

# info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: on zero-blocks: off compress: off events: off
Migration status: failed
total time: 0 milliseconds

I have been migrating VM's for weeks every day with no issues on this cluster. This is very odd. Im sure a reboot would fix the issue but this is in production.

root@ccsmiscrit3:~# pveversion -v
proxmox-ve: 4.1-28 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-2 (running version: 4.1-2/78c5f4a2)
pve-kernel-4.2.6-1-pve: 4.2.6-28
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-42
pve-firmware: 1.1-7
libpve-common-perl: 4.0-42
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-18
pve-container: 1.0-35
pve-firewall: 2.0-14
pve-ha-manager: 1.0-16
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
 
On the destination node. This is being repeated in the logs over and over.

Jan 22 10:32:34 ccsmiscrit2 pve-ha-crm[3205]: migrate service 'vm:101' to node 'ccsmiscrit3' (running)
Jan 22 10:32:34 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101': state changed from 'started' to 'migrate' (node = ccsmiscrit2, target = ccsmiscrit3)
Jan 22 10:32:37 ccsmiscrit2 pve-ha-lrm[1738]: service 'vm:101' not on this node
Jan 22 10:32:37 ccsmiscrit2 pve-ha-lrm[1740]: service 'vm:105' not on this node
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105' - migration failed (exit code 3)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105': state changed from 'migrate' to 'started' (node = ccsmiscrit2)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101' - migration failed (exit code 3)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101': state changed from 'migrate' to 'started' (node = ccsmiscrit2)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: migrate service 'vm:105' to node 'ccsmiscrit3' (running)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105': state changed from 'started' to 'migrate' (node = ccsmiscrit2, target = ccsmiscrit3)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: migrate service 'vm:101' to node 'ccsmiscrit3' (running)
Jan 22 10:32:44 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101': state changed from 'started' to 'migrate' (node = ccsmiscrit2, target = ccsmiscrit3)
Jan 22 10:32:47 ccsmiscrit2 pve-ha-lrm[1757]: service 'vm:101' not on this node
Jan 22 10:32:47 ccsmiscrit2 pve-ha-lrm[1756]: service 'vm:105' not on this node
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105' - migration failed (exit code 3)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105': state changed from 'migrate' to 'started' (node = ccsmiscrit2)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101' - migration failed (exit code 3)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101': state changed from 'migrate' to 'started' (node = ccsmiscrit2)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: migrate service 'vm:105' to node 'ccsmiscrit3' (running)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:105': state changed from 'started' to 'migrate' (node = ccsmiscrit2, target = ccsmiscrit3)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: migrate service 'vm:101' to node 'ccsmiscrit3' (running)
Jan 22 10:32:54 ccsmiscrit2 pve-ha-crm[3205]: service 'vm:101': state changed from 'started' to 'migrate' (node = ccsmiscrit2, target = ccsmiscrit3)
Jan 22 10:32:57 ccsmiscrit2 pve-ha-lrm[1769]: service 'vm:105' not on this node
Jan 22 10:32:57 ccsmiscrit2 pve-ha-lrm[1770]: service 'vm:101' not on this node
 
I rebooted the destination node and the VM's went back into a started state. I was then able to live migrate without issues. Doesn't leave me with a warm fuzzy feeling though. Going to get this cluster on the latest packages this weekend.