Issue with HA live migration of OVZ containers

pdolan

New Member
Sep 12, 2012
2
0
1
Hi all,

I have deployed a 2 node cluster to become familiar with the Proxmox2 HA features. Am having some issues running a live migration on an ovz container between cluster nodes using NFS shared storage - using both the GUI and commandline. Had a quick search through the forums here and beyond but couldn't find anything that scratched this particular itch.

The GUI gives me the following succinct tidbit upon migration attempt:

Code:
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Temporary failure; try again
TASK ERROR: command 'clusvcadm -M pvevm:109 -m vz2' failed: exit code 255

Which results in a "failed" state as follows (was in "started" state):

Code:
root@vz1:~# clustat
Cluster Status for digipve @ Wed Sep 12 11:52:30 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vz1                                                                 1 Online, Local, rgmanager
 vz2                                                                 2 Online, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 pvevm:109                                                        (vz1)                                                            failed
And produces the following in rgmanager.log

Code:
root@vz1:~# tail -n6 /var/log/cluster/rgmanager.log 
Sep 12 11:42:54 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:24 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:34 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:49 rgmanager Migrating pvevm:109 to vz2
Sep 12 11:43:50 rgmanager migrate on pvevm "109" returned 1 (generic error)
Sep 12 11:43:50 rgmanager Migration of pvevm:109 to vz2 failed; return code 1

and the following user.log

Code:
root@vz1:~# tail -n 3 /var/log/user.log
Sep 12 14:38:16 vz1 pvevm: <root@pam> starting task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam:
Sep 12 14:38:17 vz1 task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam:: migration aborted
Sep 12 14:38:17 vz1 pvevm: <root@pam> end task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam: migration aborted

I've been disabling and enabling the group in order to reset this "failed" state:

Code:
root@vz1:~# clusvcadm -d pvevm:109  && clusvcadm -e pvevm:109


Attempting to use pvectl to run the migration via the CLI produces the same commandline and rgmanager.log output:

Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Failure
command 'clusvcadm -M pvevm:109 -m vz2' failed: exit code 255



Stuff that may help in resolving:

cluster.conf
Code:
root@vz1:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="7" name="digipve">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_drac5" ipaddr="<ip>" login="root" module_name="SLOT-1" name="drac-cmc-blade1" passwd="<pass>" secure="1"/>
    <fencedevice agent="fence_drac5" ipaddr="<ip>" login="root" module_name="SLOT-2" name="drac-cmc-blade2" passwd="<pass>" secure="1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="vz1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="drac-cmc-blade1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vz2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="drac-cmc-blade2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="109"/>
  </rm>
</cluster>

Versions
Code:
root@vz1:~# pveversion --verbose
pve-manager: 2.1-14 (pve-manager/2.1/f32f3f46)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-66
pve-kernel-2.6.32-11-pve: 2.6.32-66
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-27
qemu-server: 2.0-39
pve-firmware: 1.0-18
libpve-common-perl: 1.0-30
libpve-access-control: 1.0-24
libpve-storage-perl: 2.0-31
vncterm: 1.0-3
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1


Could anyone point me in the right direction to resolve? Is there anything else I can provide to facilitate the discovery of my mistake? :)
 
Hi Dietmar

Does online migration work when you disable HA for the VM?

Disabled HA for the VM :
Code:
root@vz1:~# clustat
Cluster Status for digipve @ Wed Sep 12 17:16:51 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vz1                                                                 1 Online, Local
 vz2                                                                 2 Online

Attempted to run migration:

Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Sep 12 17:21:55 starting migration of CT 109 to node 'vz2' (<ip>)
Sep 12 17:21:55 container is running - using online migration
Sep 12 17:21:55 container data is on shared storage 'firstnfstest'
Sep 12 17:21:55 start live migration - suspending container
Sep 12 17:21:55 dump container state
Sep 12 17:21:55 # vzctl --skiplock chkpnt 109 --dump --dumpfile /mnt/pve/firstnfstest/dump/dump.109
Sep 12 17:21:55 Setting up checkpoint...
Sep 12 17:21:55 Can not create dump file /mnt/pve/firstnfstest/dump/dump.109: No such file or directory
Sep 12 17:21:55 ERROR: Failed to dump container state: Checkpointing failed
Sep 12 17:21:55 aborting phase 1 - cleanup resources
Sep 12 17:21:55 start final cleanup
Sep 12 17:21:55 ERROR: migration aborted (duration 00:00:01): Failed to dump container state: Checkpointing failed
migration aborted

Seems like there was no "dump" folder in the appropriate location?
Code:
root@vz1:~# mkdir /mnt/pve/firstnfstest/dump/

Try again
Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Sep 12 17:24:07 starting migration of CT 109 to node 'vz2' (<ip>)
Sep 12 17:24:07 container is running - using online migration
Sep 12 17:24:07 container data is on shared storage 'firstnfstest'
Sep 12 17:24:07 start live migration - suspending container
Sep 12 17:24:07 dump container state
Sep 12 17:24:07 dump 2nd level quota
Sep 12 17:24:08 initialize container on remote node 'vz2'
Sep 12 17:24:08 initializing remote quota
Sep 12 17:24:19 turn on remote quota
Sep 12 17:24:19 load 2nd level quota
Sep 12 17:24:19 starting container on remote node 'vz2'
Sep 12 17:24:19 restore container state
Sep 12 17:24:23 start final cleanup
Sep 12 17:24:23 migration finished successfuly (duration 00:00:17)

Ah ha, so when the NFS storage area was created it didn't populate with a "dump" folder.

Re-enable HA and confirm:

Code:
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Success
TASK OK

Huzzah.

So, issue was no dump folder in NFS shared storage.

Thanks Dietmar, sorry for taking your time on such a trivial issue :)