PM 3.1 Clustering - Reinstalling after master node failed

mylesw

Renowned Member
Feb 10, 2011
81
3
73
I have a PM 3.1 cluster that has about 5 node members. On the weekend, the master server for the cluster died and we had to wipe & re-install PM on it. Its back online now, but its no longer part of the cluster. I did a cluster create on it, and that's done. But none of the child nodes are connected to it.

What do I need to do in order to re-attach a child node to a re-installed master node in a cluster? The cluster name is the same, but of course it has regenerated a new SSL key on re-installation.

Thanks in advance for any assistance.

Myles
 
I have a PM 3.1 cluster that has about 5 node members. On the weekend, the master server for the cluster died and we had to wipe & re-install PM on it. Its back online now, but its no longer part of the cluster. I did a cluster create on it, and that's done. But none of the child nodes are connected to it.

Never do a cluster create when you want to join an existing cluster!

What do I need to do in order to re-attach a child node to a re-installed master node in a cluster?

I have no clue what you talk about. There is simply no 'master' and no 'slave' in a pve cluster.
To add a node, use 'pvecm add'
 
Never do a cluster create when you want to join an existing cluster!I have no clue what you talk about. There is simply no 'master' and no 'slave' in a pve cluster.To add a node, use 'pvecm add'
Ah, you might have just solved my problem. I was under the impression that a cluster had a master node and all children join to it. Since we normally create the cluster on a master first, and then do an add node on the children, it suggested to me that it had some form of hierachy to it. Are you saying that this is just semantic - that I should just add the original master node back to the cluster again?
 
Are you saying that this is just semantic - that I should just add the original master node back to the cluster again?

Yes, just add the node again. Maybe you need to use the --force flag if you use the same name/ip as before.
 
Something interesting... I wiped the server that I am trying to add to the cluster and re-installed PM 3.1 on it. All fine. Servers back up and running now.

But there is no sign of clustering on this at all. Attempting to addnode to the cluster is failing. When I look in /etc/cluster there is no cluster.conf there at all.

Am I missing a step here? Shouldn't clustering be installed by default and just enabled when you add a node to the cluster?

Myles
 
Shouldn't clustering be installed by default and just enabled when you add a node to the cluster?

a fresh node has no cluster on it:
- if you join to an existing cluster, it makes no sense to have one different locally
- if you need to start a new cluster with it, you can simply create one

Marco
 
a fresh node has no cluster on it:
- if you join to an existing cluster, it makes no sense to have one different locally
- if you need to start a new cluster with it, you can simply create one
Marco

Yes, but in this case the server was originally part of a cluster. The disk array died, but I was able to migrate all VMs off it before I had to wipe and re-install. Now I have wiped and re-installed, I want to re-join it to the same cluster it was on before.

When I attempt to do this with pvecm add <nodename> --force it fails with:

I/O warning : failed to load external entity "/etc/pve/cluster.conf"
ccs_tool: Error: unable to parse requested configuration file

I check, and I find that cman is installed and there but there is no definition of the cluster at all.

I think I'm missing a step here - something that sets up the cluster name that it should join to, and hence the cluster.conf files?

Myles
 
Thanks to everyone. I got it working. The problem was using a hostname rather than IP address of a member node.

Myles
 
I thought I was through the woods on this, but not quite yet. So I have successfully re-installed PM 3.1 on my server, and added it back to the cluster. It has the same hostname & IP as before, so it is now showing up as part of the cluster.

I am now trying to migrate VMs from the temporary server back to this one, and I'm getting this on attempting a migration:

Jul 28 13:08:29 # /usr/bin/ssh -o 'BatchMode=yes' root@xxxx /bin/true
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Jul 28 13:08:29 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
Jul 28 13:08:29 It is also possible that a host key has just been changed.
Jul 28 13:08:29 The fingerprint for the ECDSA key sent by the remote host is
Jul 28 13:08:29 xxxxxxxxxxxxx
Jul 28 13:08:29 Please contact your system administrator.
Jul 28 13:08:29 Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Jul 28 13:08:29 Offending RSA key in /etc/ssh/ssh_known_hosts:12
Jul 28 13:08:29 ECDSA host key for xx.xx.xx.xx has changed and you have requested strict checking.
Jul 28 13:08:29 Host key verification failed.
Jul 28 13:08:29 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

OK, what should I do now?

Myles
 
I thought I was through the woods on this, but not quite yet. So I have successfully re-installed PM 3.1 on my server, and added it back to the cluster. It has the same hostname & IP as before, so it is now showing up as part of the cluster.

I am now trying to migrate VMs from the temporary server back to this one, and I'm getting this on attempting a migration:

Jul 28 13:08:29 # /usr/bin/ssh -o 'BatchMode=yes' root@xxxx /bin/true
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Jul 28 13:08:29 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
Jul 28 13:08:29 It is also possible that a host key has just been changed.
Jul 28 13:08:29 The fingerprint for the ECDSA key sent by the remote host is
Jul 28 13:08:29 xxxxxxxxxxxxx
Jul 28 13:08:29 Please contact your system administrator.
Jul 28 13:08:29 Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Jul 28 13:08:29 Offending RSA key in /etc/ssh/ssh_known_hosts:12
Jul 28 13:08:29 ECDSA host key for xx.xx.xx.xx has changed and you have requested strict checking.
Jul 28 13:08:29 Host key verification failed.
Jul 28 13:08:29 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

OK, what should I do now?

Myles

SSH from one node to the other.
 
I thought I was through the woods on this, but not quite yet. So I have successfully re-installed PM 3.1 on my server, and added it back to the cluster. It has the same hostname & IP as before, so it is now showing up as part of the cluster.

I am now trying to migrate VMs from the temporary server back to this one, and I'm getting this on attempting a migration:

Jul 28 13:08:29 # /usr/bin/ssh -o 'BatchMode=yes' root@xxxx /bin/true
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
Jul 28 13:08:29 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Jul 28 13:08:29 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Jul 28 13:08:29 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
Jul 28 13:08:29 It is also possible that a host key has just been changed.
Jul 28 13:08:29 The fingerprint for the ECDSA key sent by the remote host is
Jul 28 13:08:29 xxxxxxxxxxxxx
Jul 28 13:08:29 Please contact your system administrator.
Jul 28 13:08:29 Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Jul 28 13:08:29 Offending RSA key in /etc/ssh/ssh_known_hosts:12
Jul 28 13:08:29 ECDSA host key for xx.xx.xx.xx has changed and you have requested strict checking.
Jul 28 13:08:29 Host key verification failed.
Jul 28 13:08:29 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

OK, what should I do now?

Myles

SSH from one node to the new node.
 
Nevermind. Fixed it. Was an entry in /root/.ssh/known_hosts on the older node that was causing the problem. Removed it, and then manually ssh'd to the new node, creating a new entry. Now it works great.

Thanks
Myles
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!