[Proxmox 3] Migration error, configuration lost

ltor

Member
Jun 1, 2010
35
0
6
Hello,
First a great thanks for Proxmox ! We really appreciate it...never a problem with 1.9
I just have donne an update of our two nodes toward 3.0, no problem, great compatibility !
Unfortunately, I meet a trouble when migrate from one node to another (same hardware, same update level) :
Code:
Jul 31 14:50:43 starting migration of CT 107 to node 'hn2' (xxx.yyy.www.zzz)
Jul 31 14:50:43 container is running - using online migration
Jul 31 14:50:44 starting rsync phase 1
Jul  31 14:50:44 # /usr/bin/rsync -aHAX --delete --numeric-ids --sparse  /var/lib/vz/private/107 root@xxx.yyy.www.zzz:/var/lib/vz/private
Jul 31 14:53:45 start live migration - suspending container
Jul 31 14:53:45 dump container state
Jul 31 14:53:51 copy dump file to target node
Jul 31 14:53:54 starting rsync (2nd pass)
Jul 31 14:53:54 # /usr/bin/rsync -aHAX --delete --numeric-ids /var/lib/vz/private/107 root@xxx.yyy.www.zzz:/var/lib/vz/private
Jul 31 14:54:03 dump 2nd level quota
Jul 31 14:54:03 copy 2nd level quota to target node
Jul 31 14:54:05 initialize container on remote node 'hn2'
Jul 31 14:54:05 initializing remote quota
Jul 31 14:54:05 # /usr/bin/ssh -o 'BatchMode=yes' root@xxx.yyy.www.zzz vzctl quotainit 107
Jul 31 14:54:05 ERROR: online migrate failure - Failed to initialize quota: Container config file does not exist
Jul 31 14:54:05 removing container files on local node
Jul 31 14:54:09 start final cleanup
Jul 31 14:54:09 ERROR: migration finished with problems (duration 00:03:26)
TASK ERROR: migration problems
And then, no more VM, it is here and not, impossible to stop or start or migrate back...no conf file in /etc/vz/conf...or maybe a schrodinger cat because I kept a backup of xxx.conf and I tried to copy it but when I did it, the system told the file already existed, so I try to delete it and the system told me the file did not existe...lsof, ls -li, nothing works...i can make a touch toto but not a touch xxx.conf whereas there is no xxx.conf

Any ideas ?

Thanks in advance,

L.Torlay
 
I tried to copy it but when I did it, the system told the file already existed, so I try to delete it and the system told me the file did not existe...

The file exists but inside another directory(node) (/etc/pve/nodes/[nodename]/openvz).
The system is intelligent enough to check that there is only one config cluster wide.

To move a VM config to another node use:

# mv /etc/pve/nodes/[nodename1]/openvz/[VMID].conf /etc/pve/nodes/[nodename2]/openvz
 
And what is the output of
# dpkg -l vzctl
on both nodes?

Hello and thank you very much for your quick answer (I will test your procedure in a few hours)

Here is the version of vzctl (same on both nodes) : 4.0-1pve3

L.T.
 
The procedure is certainly correct but it failed and I think I know where but I do not know why !

if I connect to first node (nodename1), I can see of course /etc/pve/nodes/[nodename1]/openvz/[VMID].conf and /etc/pve/nodes/[nodename2]/openvz
and if I connect to second node (nodename2), it is the same thing...
But, if I restore a Vm on nodename1, I do do not see the [VMID].conf from the second node in /etc/pve/nodes/[nodename1]/openvz/
Is there a delay or a trouble in synchronisation between the two nodes ?
On the web interface (nodename1's one), the first node's icon is green whereas the second one is red...pvecm status seems OK on both nodes...

I made something wrong ?

Thanks,

L.T
 
Last edited:
What is the output ob
# pveversion -v
make sure to update to the lastest version.

I update frequently (this afternoon) and my installation has been done two days ago.

Here is the result of pveversion -v command :

Code:
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-22-pve
proxmox-ve-2.6.32: 3.0-107
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-23
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1

For information, Proxmox run on two DELL R710 and I noticed that iowait are much higher in restoration than in backup (NFS share) and time to restore is double than the one to backup (container in snapshot backup)

Thanks,

L.T.
 
What I am doing and what I am not understanding :
- on nodename2 ( a ghost container due to migration failure), I stop pve-cluster, no more access to /etc/pve/nodes/, ok...
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz
- on nodename1, I start pve-cluster
- on nodename 2, I start pve-cluster and I can see /etc/pve/nodes/[nodename2]/openvz...but not with my backup 107.conf !

Can I rebuild the cluster without reboot nodename 1 if something strange happened during the cluster creation ?

Thanks,

L.T.
 
What I am doing and what I am not understanding :
- on nodename2 ( a ghost container due to migration failure), I stop pve-cluster, no more access to /etc/pve/nodes/, ok...
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz

Why do you stop pve-cluster? If you stop, there is no cluster file system mounted at /etc/pve, so you copy the file to the local filesystem instead!
 
- on nodename1 (production node, 30 containers), I stop pve-cluster, I copy a backup of 107.conf in /etc/pve/nodes/[nodename2]/openvz
- on nodename1, I start pve-cluster

I guess starting pve-cluster fail, because you now have a local file inside /etc/pve/?
 
I guess starting pve-cluster fail, because you now have a local file inside /etc/pve/?
Well, no it works (or not ?)....when pve-cluster is stopped on the two nodes, there is nothing after /etc/pve on the second node but on the first node I see the the subdirectories. I thought that the second node will get the information from the master node, but I thought bad, did I ?
Is the a way to reinitialize the cluster, my second node is not in prod ?
Thanks,

L.T
 
Is cluster communication OK? check with

# pvecm status

You guess right, there is a trouble on nodename2 :

Code:
#  pvecm status
cman_tool: Cannot open connection to cman, is it running ?
# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... /usr/sbin/ccs_config_validate: line 186: 28314 Segmentation fault      (core dumped) ccs_config_dump > $tempfile

Unable to get the configuration
corosync [MAIN  ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service.
corosync [MAIN  ] Corosync built-in features: nss
corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
corosync died with signal: 11 Check cluster logs for details
[FAILED]

corosync.log of the day is empty (last archive of corosync.log contains nothing wrong)

Code:
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais cluster membership service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais event service B.01.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais distributed locking service B.03.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais message service B.03.01
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Aug 01 15:49:09 corosync [SERV  ] Service engine unloaded: openais timer service A.01.01
Aug 01 15:49:09 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1893.

On nodename1, it seems still ok :

Code:
#  pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: VIRTU
Cluster Id: 15565
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: hn1
Node ID: 1
Multicast addresses: 239.192.60.10
Node addresses: xxx.yyy.zzz.www

Thanks,

L.T.
 
Last edited:
Please can you post the cluster config /etc/pve/cluster.cfg

Of course, here is :

Code:
<?xml version="1.0"?>
<cluster name="VIRTU" config_version="4">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>

  <clusternodes>
  <clusternode name="hn1" votes="1" nodeid="1"/>
  <clusternode name="hn2" votes="1" nodeid="2"/></clusternodes>

</cluster>

I am reinstalling the second node and carefully make all updates and I am trying again to add it to the cluster to see if I did sommething wrong ...

And after updating, here is the result of pvecm add ip_nodeanem1

Code:
root@hn2:~# pvecm add www.xxx.yyy.zzz
The authenticity of host 'www.xxx.yyy.zzz (www.xxx.yyy.zzz)' can't be established.
ECDSA key fingerprint is 82:b7:b2:72:0a:88:b2:bf:1e:08:da:bb:db:27:c7:ff.
Are you sure you want to continue connecting (yes/no)? yes
root@www.xxx.yyy.zzz's password:
node hn2 already defined
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

Do I need to delete something on nodename1 ?

Here is the result of pvecm status on nodename2 :

Code:
root@virt-hn2:~# pvecm status
Version: 6.2.0
Config Version: 4
Cluster Name: VIRTU
Cluster Id: 15565
Cluster Member: Yes
Cluster Generation: 416
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: virt-hn2
Node ID: 2
Multicast addresses: 239.192.60.10
Node addresses: www.xxx.yyy.zzz

and the result of pvecm updatecerts
Code:
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
no quorum - unable to update files



Thanks a lot for support,

LT
 
Last edited:
Seems multicast is not working correctly:

see http://pve.proxmox.com/wiki/Multicast_notes

Please test with omping.

Here is the result :

Code:
hn1 :   unicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.098/0.177/0.232/0.033
hn1 : multicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.120/0.195/0.246/0.030

and on the other node :

Code:
hn2 :   unicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.140/0.178/0.217/0.022
hn2 : multicast, xmt/rcv/%loss = 37/37/0%, min/avg/max/std-dev = 0.171/0.199/0.230/0.016

Thanks,

L.T.
 
You were right !!!

That was a multicast trouble, I have added those lines from your documentation http://pve.proxmox.com/wiki/Multicast_notes
Code:
/sbin/iptables -A INPUT -m addrtype --dst-type MULTICAST -j ACCEPT
/sbin/iptables -A INPUT -p udp -m state --state NEW -m multiport --dports 5404,5405 -j ACCEPT

and now it is Ok for pvecm status and pvecm updatecerts

I am checking if migration is OK now...

Thanks a lot !!

L.T.
 
Everything's OK now !

Thanks a lot for support...

L.T.

PS : In case of a cluster, support subscription is by node or by cluster ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!