cluster iscsi/lvm - live migration not working.

trey85stang

New Member
Feb 5, 2013
5
0
1
I can't seem to get a live migration to work. I might be missing something simple I dont know. offline works fine but online tries to transfer data and looking at all the processes associated with it it appears it just waits/hangs.

Im not sure where to start so ill just post some configs and logs...

Any suggestions on where to look to troubleshoot further would be appreciated.

First here is a failed live migration attempt:
Code:
  Found duplicate PV 47dqI64G6bWoGB7h9tLWAut36DBnBNi9: using /dev/sdc not /dev/sdb
  Found duplicate PV 47dqI64G6bWoGB7h9tLWAut36DBnBNi9: using /dev/sdc not /dev/sdb
Feb 04 23:50:12 starting migration of VM 201 to node 'prox01' (redacted)
Feb 04 23:50:12 copying disk images
Feb 04 23:50:12 starting VM 201 on remote node 'prox01'
Feb 04 23:50:13 starting migration tunnel
Feb 04 23:50:13 starting online/live migration on port 60002
Feb 04 23:50:13 migrate_set_speed: 8589934592
Feb 04 23:50:13 migrate_set_downtime: 0.1
Feb 04 23:50:15 migration status: active (transferred 4251821, remaining 540758016), total 545718272) , expected downtime 0
Write failed: Broken pipe 
Feb 05 00:05:42 ERROR: online migrate failure - aborting
Feb 05 00:05:42 aborting phase 2 - cleanup resources
Feb 05 00:05:42 migrate_cancel
Feb 05 00:05:43 ERROR: migration finished with problems (duration 00:15:32)
TASK ERROR: migration problems

Does the duplicate PV have something to do with this? I am not sure, but when i transfer from prox01 to prox02 I dont see that message but the error is them same..

storage config
Code:
root@prox01:~# cat /etc/pve/storage.cfg 
dir: local
    path /var/lib/vz
    content images,iso,vztmpl,rootdir
    maxfiles 0

iscsi: eq-iscsi-san01
    target iqn.2001-05.com.equallogic:0-8a0906-6e9890f02-9ca089df8f9510fe-proxmox-lun1
    portal redacted
    content none

lvm: a101
    vgname eqsan01
    base eq-iscsi-san01:0.0.0.scsi-36090a028f090986efe10958fdf89a09c
    shared
    content images


cluster.conf
Code:
root@prox01:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="13" name="cluster">
  <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" transport="udpu" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="redacted" login="redacted" name="prox01-ipmi" passwd="redacted" power_wait="5"/>
    <fencedevice agent="fence_ipmilan" ipaddr="redacted" login="redacted" name="prox02-ipmi" passwd="redacted" power_wait="5"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="prox01" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="prox01-ipmi"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="prox02" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="prox02-ipmi"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm/>
</cluster>



last, the actual vm config
Code:
root@prox01:/etc/pve/nodes/prox02/qemu-server# cat 201.conf 
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
lock: migrate
memory: 512
name: node01
net0: e1000=12:72:7A:B9:C5:07,bridge=vmbr0
ostype: l26
scsi0: a101:vm-201-disk-1,size=32G
sockets: 1
 
I fixed the duplicate PV thing... same issue applies; I then updated to the lastest test version. Same applies except the expected downtime now registers as "453" rather then 0.

Code:
# pveversion -v
pve-manager: 2.3-7 (pve-manager/2.3/1fe64d18)
running kernel: 2.6.32-18-pve
proxmox-ve-2.6.32: 2.3-88
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-8
pve-firmware: 1.0-21
libpve-common-perl: 1.0-44
libpve-access-control: 1.0-25
libpve-storage-perl: 2.3-2
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.3-18
ksm-control-daemon: 1.1-1
 
I was able to fix the duplicate pv issues, but am still having the same problem. So I tried updating to the testing version and I am getting the same thing although slightly different. One things that might be of importance; all these failures happen after 15 minutes and 32 seconds. it appears whatever is supposed to be happening on port 60000 does not happen.. Im not quit sure what that is though.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!