Issue with HA live migration of OVZ containers

pdolan

New Member
Sep 12, 2012
2
0
1
Hi all,

I have deployed a 2 node cluster to become familiar with the Proxmox2 HA features. Am having some issues running a live migration on an ovz container between cluster nodes using NFS shared storage - using both the GUI and commandline. Had a quick search through the forums here and beyond but couldn't find anything that scratched this particular itch.

The GUI gives me the following succinct tidbit upon migration attempt:

Code:
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Temporary failure; try again
TASK ERROR: command 'clusvcadm -M pvevm:109 -m vz2' failed: exit code 255

Which results in a "failed" state as follows (was in "started" state):

Code:
root@vz1:~# clustat
Cluster Status for digipve @ Wed Sep 12 11:52:30 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vz1                                                                 1 Online, Local, rgmanager
 vz2                                                                 2 Online, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 pvevm:109                                                        (vz1)                                                            failed
And produces the following in rgmanager.log

Code:
root@vz1:~# tail -n6 /var/log/cluster/rgmanager.log 
Sep 12 11:42:54 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:24 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:34 rgmanager [pvevm] CT 109 is running
Sep 12 11:43:49 rgmanager Migrating pvevm:109 to vz2
Sep 12 11:43:50 rgmanager migrate on pvevm "109" returned 1 (generic error)
Sep 12 11:43:50 rgmanager Migration of pvevm:109 to vz2 failed; return code 1

and the following user.log

Code:
root@vz1:~# tail -n 3 /var/log/user.log
Sep 12 14:38:16 vz1 pvevm: <root@pam> starting task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam:
Sep 12 14:38:17 vz1 task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam:: migration aborted
Sep 12 14:38:17 vz1 pvevm: <root@pam> end task UPID:vz1:000E40D5:0AE4AEF2:50509048:vzmigrate:109:root@pam: migration aborted

I've been disabling and enabling the group in order to reset this "failed" state:

Code:
root@vz1:~# clusvcadm -d pvevm:109  && clusvcadm -e pvevm:109


Attempting to use pvectl to run the migration via the CLI produces the same commandline and rgmanager.log output:

Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Failure
command 'clusvcadm -M pvevm:109 -m vz2' failed: exit code 255



Stuff that may help in resolving:

cluster.conf
Code:
root@vz1:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="7" name="digipve">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_drac5" ipaddr="<ip>" login="root" module_name="SLOT-1" name="drac-cmc-blade1" passwd="<pass>" secure="1"/>
    <fencedevice agent="fence_drac5" ipaddr="<ip>" login="root" module_name="SLOT-2" name="drac-cmc-blade2" passwd="<pass>" secure="1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="vz1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="drac-cmc-blade1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vz2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="drac-cmc-blade2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="109"/>
  </rm>
</cluster>

Versions
Code:
root@vz1:~# pveversion --verbose
pve-manager: 2.1-14 (pve-manager/2.1/f32f3f46)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-66
pve-kernel-2.6.32-11-pve: 2.6.32-66
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-27
qemu-server: 2.0-39
pve-firmware: 1.0-18
libpve-common-perl: 1.0-30
libpve-access-control: 1.0-24
libpve-storage-perl: 2.0-31
vncterm: 1.0-3
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1


Could anyone point me in the right direction to resolve? Is there anything else I can provide to facilitate the discovery of my mistake? :)
 
Hi Dietmar

Does online migration work when you disable HA for the VM?

Disabled HA for the VM :
Code:
root@vz1:~# clustat
Cluster Status for digipve @ Wed Sep 12 17:16:51 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vz1                                                                 1 Online, Local
 vz2                                                                 2 Online

Attempted to run migration:

Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Sep 12 17:21:55 starting migration of CT 109 to node 'vz2' (<ip>)
Sep 12 17:21:55 container is running - using online migration
Sep 12 17:21:55 container data is on shared storage 'firstnfstest'
Sep 12 17:21:55 start live migration - suspending container
Sep 12 17:21:55 dump container state
Sep 12 17:21:55 # vzctl --skiplock chkpnt 109 --dump --dumpfile /mnt/pve/firstnfstest/dump/dump.109
Sep 12 17:21:55 Setting up checkpoint...
Sep 12 17:21:55 Can not create dump file /mnt/pve/firstnfstest/dump/dump.109: No such file or directory
Sep 12 17:21:55 ERROR: Failed to dump container state: Checkpointing failed
Sep 12 17:21:55 aborting phase 1 - cleanup resources
Sep 12 17:21:55 start final cleanup
Sep 12 17:21:55 ERROR: migration aborted (duration 00:00:01): Failed to dump container state: Checkpointing failed
migration aborted

Seems like there was no "dump" folder in the appropriate location?
Code:
root@vz1:~# mkdir /mnt/pve/firstnfstest/dump/

Try again
Code:
root@vz1:~# pvectl migrate 109 vz2 --online
Sep 12 17:24:07 starting migration of CT 109 to node 'vz2' (<ip>)
Sep 12 17:24:07 container is running - using online migration
Sep 12 17:24:07 container data is on shared storage 'firstnfstest'
Sep 12 17:24:07 start live migration - suspending container
Sep 12 17:24:07 dump container state
Sep 12 17:24:07 dump 2nd level quota
Sep 12 17:24:08 initialize container on remote node 'vz2'
Sep 12 17:24:08 initializing remote quota
Sep 12 17:24:19 turn on remote quota
Sep 12 17:24:19 load 2nd level quota
Sep 12 17:24:19 starting container on remote node 'vz2'
Sep 12 17:24:19 restore container state
Sep 12 17:24:23 start final cleanup
Sep 12 17:24:23 migration finished successfuly (duration 00:00:17)

Ah ha, so when the NFS storage area was created it didn't populate with a "dump" folder.

Re-enable HA and confirm:

Code:
Executing HA migrate for CT 109 to node vz2
Trying to migrate pvevm:109 to vz2...Success
TASK OK

Huzzah.

So, issue was no dump folder in NFS shared storage.

Thanks Dietmar, sorry for taking your time on such a trivial issue :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!