After trying to add a new node to a cluster it is broken - please help!

klik333

Active Member
Jan 9, 2017
7
1
43
51
Ukraine
My working cluster consisted of 2 nodes. When I was adding the third node something went wrong. The process wasn't finished and just timed out.
When I started to investigate possible reasons of such unexpected result, I noticed that I made a mistake while adding a new node to /etc/hosts file on the main node of the cluster. The mistake was in a new node's domain name (it was the same as the main node). Now I fixed hosts file, but it didn't help - cluster seems broken and I've lost web- access to my third node.
I can post my conf files - just tell me which ones.
On my third node /etc/pve folder looks very strange (see screenshot)
 

Attachments

  • 2020-02-18_18-36-06.jpg
    2020-02-18_18-36-06.jpg
    77.7 KB · Views: 24
Hi,

the easiest way is to remove the new member from the cluster if it is already added.
Then you have to fix your hosts file if not already done.
After this, you can try to add the node with the force lag to the cluster.
more information see
man pvecm
 
Thank you for your help.
Could you please tell me how to manually remove the bad node from the cluster? Or give me a link to the guide. Because it's impossible to remove it using GUI.
I googled the official guide with that instructions. But it would be difficult to power off the node because it is on VDS. Would it be enough to just make it unavailable by network (by disconnecting from virtual switch)?
I want to be sure that I'm doing everything right because the running cluster has 2 production servers.
The third server is quite new and it's ok to make it unavailable.
 
After some investigations I managed to manually removed a bad node from the cluster. (I took the steps from the section "Separate A Node Without Reinstalling" of the official guide).
But when I tried to join this node to the cluster again (now via command line), I faced with "waiting for quorum..." message (see screenshot).
All messages I have found on forum did not give me a clear answer what is the reason.
My assumption is that multicast is not working. But in this case how I managed to join these two servers to the current cluster?
The configurations on these three servers nearly the same.
Please give me a hint where to dig.
P.S. 'Waiting for quorum' message had been displayed several hours, until I interrupted with Ctrl+C.
 

Attachments

  • 2020-02-25_15-27-57.jpg
    2020-02-25_15-27-57.jpg
    99.6 KB · Views: 22
Last edited:
I am having the exact same issue here. Have been looking up and down the internet and there is not a single solution to it. =(
 
One day I found somewhere this walkthrough and since then use it when my cluster suddenly broke apart by adding a new node.
0. BACKUP ALL YOUR GUESTS

Be sure you have all backups just for in case if something goes wrong.

1. Separate a Node Without Reinstalling

First, stop the corosync and pve-cluster services on the node:

systemctl stop pve-cluster
systemctl stop corosync

Start the cluster file system again in local mode:

pmxcfs -l

Delete the corosync configuration files:

rm /etc/pve/corosync.conf
rm -r /etc/corosync/*

You can now start the file system again as a normal service:

killall pmxcfs
systemctl start pve-cluster

The node is now separated from the cluster. You can deleted it from any remaining node of the cluster with:

pvecm delnode oldnode

Now switch back to the separated node and delete all the remaining cluster files on it. This ensures that the node can be added to another cluster again without problems.

rm /var/lib/corosync/*

As the configuration files from the other nodes are still in the cluster file system, you may want to clean those up too. After making absolutely sure that you have the correct node name, you can simply remove the entire directory recursively from /etc/pve/nodes/NODENAME.

The node’s SSH keys will remain in the authorized_key file. This means that the nodes can still connect to each other with public key authentication. You should fix this by removing the respective keys from the /etc/pve/priv/authorized_keys file.

2. Add a node (with guests) back to cluster

On node1 (with guests)
Create a new cluster or get join information.

On node2 (with guests)
Copy node's conf to any node from cluster
scp -r /etc/pve/nodes/* to node1:/etc/pve/nodes

Then remove or backup to somewhere else
rm -r /etc/pve/nodes/*

Join cluster (better from console)
pvecm add [node1_hostname]

Hope it would help you.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!