[Proxmox 3] Migration error, configuration lost

ltor · Jul 31, 2013

Hello,
First a great thanks for Proxmox ! We really appreciate it...never a problem with 1.9
I just have donne an update of our two nodes toward 3.0, no problem, great compatibility !
Unfortunately, I meet a trouble when migrate from one node to another (same hardware, same update level) :

Code:

Jul 31 14:50:43 starting migration of CT 107 to node 'hn2' (xxx.yyy.www.zzz)
Jul 31 14:50:43 container is running - using online migration
Jul 31 14:50:44 starting rsync phase 1
Jul  31 14:50:44 # /usr/bin/rsync -aHAX --delete --numeric-ids --sparse  /var/lib/vz/private/107 root@xxx.yyy.www.zzz:/var/lib/vz/private
Jul 31 14:53:45 start live migration - suspending container
Jul 31 14:53:45 dump container state
Jul 31 14:53:51 copy dump file to target node
Jul 31 14:53:54 starting rsync (2nd pass)
Jul 31 14:53:54 # /usr/bin/rsync -aHAX --delete --numeric-ids /var/lib/vz/private/107 root@xxx.yyy.www.zzz:/var/lib/vz/private
Jul 31 14:54:03 dump 2nd level quota
Jul 31 14:54:03 copy 2nd level quota to target node
Jul 31 14:54:05 initialize container on remote node 'hn2'
Jul 31 14:54:05 initializing remote quota
Jul 31 14:54:05 # /usr/bin/ssh -o 'BatchMode=yes' root@xxx.yyy.www.zzz vzctl quotainit 107
Jul 31 14:54:05 ERROR: online migrate failure - Failed to initialize quota: Container config file does not exist
Jul 31 14:54:05 removing container files on local node
Jul 31 14:54:09 start final cleanup
Jul 31 14:54:09 ERROR: migration finished with problems (duration 00:03:26)
TASK ERROR: migration problems

And then, no more VM, it is here and not, impossible to stop or start or migrate back...no conf file in /etc/vz/conf...or maybe a schrodinger cat because I kept a backup of xxx.conf and I tried to copy it but when I did it, the system told the file already existed, so I try to delete it and the system told me the file did not existe...lsof, ls -li, nothing works...i can make a touch toto but not a touch xxx.conf whereas there is no xxx.conf

Any ideas ?

Thanks in advance,

L.Torlay

dietmar · Aug 1, 2013

ltor said:
I tried to copy it but when I did it, the system told the file already existed, so I try to delete it and the system told me the file did not existe...

The file exists but inside another directory(node) (/etc/pve/nodes/[nodename]/openvz).
The system is intelligent enough to check that there is only one config cluster wide.

To move a VM config to another node use:

# mv /etc/pve/nodes/[nodename1]/openvz/[VMID].conf /etc/pve/nodes/[nodename2]/openvz

dietmar · Aug 1, 2013

And what is the output of

# dpkg -l vzctl

on both nodes?

ltor · Aug 1, 2013

dietmar said:
And what is the output of
# dpkg -l vzctl
on both nodes?

Hello and thank you very much for your quick answer (I will test your procedure in a few hours)

Here is the version of vzctl (same on both nodes) : 4.0-1pve3

L.T.

ltor · Aug 1, 2013

The procedure is certainly correct but it failed and I think I know where but I do not know why !

if I connect to first node (nodename1), I can see of course /etc/pve/nodes/[nodename1]/openvz/[VMID].conf and /etc/pve/nodes/[nodename2]/openvz
and if I connect to second node (nodename2), it is the same thing...
But, if I restore a Vm on nodename1, I do do not see the [VMID].conf from the second node in /etc/pve/nodes/[nodename1]/openvz/
Is there a delay or a trouble in synchronisation between the two nodes ?
On the web interface (nodename1's one), the first node's icon is green whereas the second one is red...pvecm status seems OK on both nodes...

I made something wrong ?

Thanks,

L.T

dietmar · Aug 1, 2013

What is the output ob

# pveversion -v

make sure to update to the lastest version.

ltor · Aug 1, 2013

dietmar said:
What is the output ob
# pveversion -v
make sure to update to the lastest version.

I update frequently (this afternoon) and my installation has been done two days ago.

Here is the result of pveversion -v command :

Code:

pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-22-pve
proxmox-ve-2.6.32: 3.0-107
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-23
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1

For information, Proxmox run on two DELL R710 and I noticed that iowait are much higher in restoration than in backup (NFS share) and time to restore is double than the one to backup (container in snapshot backup)

Thanks,

L.T.

ltor · Aug 1, 2013

What I am doing and what I am not understanding :
- on nodename2 ( a ghost container due to migration failure), I stop pve-cluster, no more access to /etc/pve/nodes/, ok...
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz
- on nodename1, I start pve-cluster
- on nodename 2, I start pve-cluster and I can see /etc/pve/nodes/[nodename2]/openvz...but not with my backup 107.conf !

Can I rebuild the cluster without reboot nodename 1 if something strange happened during the cluster creation ?

Thanks,

L.T.

dietmar · Aug 1, 2013

ltor said:
What I am doing and what I am not understanding :
- on nodename2 ( a ghost container due to migration failure), I stop pve-cluster, no more access to /etc/pve/nodes/, ok...
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz

Why do you stop pve-cluster? If you stop, there is no cluster file system mounted at /etc/pve, so you copy the file to the local filesystem instead!

dietmar · Aug 1, 2013

ltor said:
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz
- on nodename1, I start pve-cluster

I guess starting pve-cluster fail, because you now have a local file inside /etc/pve/?

ltor · Aug 1, 2013

dietmar said:
I guess starting pve-cluster fail, because you now have a local file inside /etc/pve/?

Well, no it works (or not ?)....when pve-cluster is stopped on the two nodes, there is nothing after /etc/pve on the second node but on the first node I see the the subdirectories. I thought that the second node will get the information from the master node, but I thought bad, did I ?
Is the a way to reinitialize the cluster, my second node is not in prod ?
Thanks,

L.T

dietmar · Aug 2, 2013

ltor said:
I thought that the second node will get the information from the master node, but I thought bad, did I ?

Is cluster communication OK? check with

# pvecm status

ltor · Aug 2, 2013

dietmar said:
Is cluster communication OK? check with

# pvecm status

You guess right, there is a trouble on nodename2 :

Code:

#  pvecm status
cman_tool: Cannot open connection to cman, is it running ?
# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... /usr/sbin/ccs_config_validate: line 186: 28314 Segmentation fault      (core dumped) ccs_config_dump > $tempfile

Unable to get the configuration
corosync [MAIN  ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service.
corosync [MAIN  ] Corosync built-in features: nss
corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
corosync died with signal: 11 Check cluster logs for details
[FAILED]

corosync.log of the day is empty (last archive of corosync.log contains nothing wrong)

Code:

Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais cluster membership service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais event service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais distributed locking service B.03.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais message service B.03.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais timer service A.01.01
Aug 01 15:49:09 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1893.

On nodename1, it seems still ok :

Code:

#  pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: VIRTU
Cluster Id: 15565
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: hn1
Node ID: 1
Multicast addresses: 239.192.60.10
Node addresses: xxx.yyy.zzz.www

Thanks,

L.T.

dietmar · Aug 2, 2013

Please can you post the cluster config /etc/pve/cluster.cfg

ltor · Aug 2, 2013

dietmar said:
Please can you post the cluster config /etc/pve/cluster.cfg

Of course, here is :

Code:

<?xml version="1.0"?>
<cluster name="VIRTU" config_version="4">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>

  <clusternodes>
  <clusternode name="hn1" votes="1" nodeid="1"/>
  <clusternode name="hn2" votes="1" nodeid="2"/></clusternodes>

</cluster>

I am reinstalling the second node and carefully make all updates and I am trying again to add it to the cluster to see if I did sommething wrong ...

And after updating, here is the result of pvecm add ip_nodeanem1

Code:

root@hn2:~# pvecm add www.xxx.yyy.zzz
The authenticity of host 'www.xxx.yyy.zzz (www.xxx.yyy.zzz)' can't be established.
ECDSA key fingerprint is 82:b7:b2:72:0a:88:b2:bf:1e:08:da:bb:db:27:c7:ff.
Are you sure you want to continue connecting (yes/no)? yes
root@www.xxx.yyy.zzz's password:
node hn2 already defined
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

Do I need to delete something on nodename1 ?

Here is the result of pvecm status on nodename2 :

Code:

root@virt-hn2:~# pvecm status
Version: 6.2.0
Config Version: 4
Cluster Name: VIRTU
Cluster Id: 15565
Cluster Member: Yes
Cluster Generation: 416
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: virt-hn2
Node ID: 2
Multicast addresses: 239.192.60.10
Node addresses: www.xxx.yyy.zzz

and the result of pvecm updatecerts

Code:

can't create shared ssh key database '/etc/pve/priv/authorized_keys'
no quorum - unable to update files

Thanks a lot for support,

LT

dietmar · Aug 2, 2013

Seems multicast is not working correctly:

see http://pve.proxmox.com/wiki/Multicast_notes

Please test with omping.

ltor · Aug 2, 2013

dietmar said:
Seems multicast is not working correctly:

see http://pve.proxmox.com/wiki/Multicast_notes

Please test with omping.

Here is the result :

Code:

hn1 :   unicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.098/0.177/0.232/0.033
hn1 : multicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.120/0.195/0.246/0.030

and on the other node :

Code:

hn2 :   unicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.140/0.178/0.217/0.022
hn2 : multicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.171/0.199/0.230/0.016

Thanks,

L.T.

ltor · Aug 2, 2013

You were right !!!

That was a multicast trouble, I have added those lines from your documentation http://pve.proxmox.com/wiki/Multicast_notes

Code:

/sbin/iptables -A INPUT -m addrtype --dst-type MULTICAST -j ACCEPT
/sbin/iptables -A INPUT -p udp -m state --state NEW -m multiport --dports 5404,5405 -j ACCEPT

and now it is Ok for pvecm status and pvecm updatecerts

I am checking if migration is OK now...

Thanks a lot !!

L.T.

ltor · Aug 2, 2013

Everything's OK now !

Thanks a lot for support...

L.T.

PS : In case of a cluster, support subscription is by node or by cluster ?

tom · Aug 2, 2013

per node, and all nodes needs the same level. see http://www.proxmox.com/proxmox-ve/pricing (and pdf on this page)

[Proxmox 3] Migration error, configuration lost

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

We value your privacy