Joining cluster troubles in production environement !

R0bin

Member
Dec 6, 2019
27
0
21
35
Montpellier
Hello.
I have a problem with a cluster, so I give you the context and then I explain you the problem I have.

Context :
I have 2 multi-nodes clusters
- one on proxmox V5 / Debian 9
- One with proxmox V6 / Debian 10

My goal is to migrate all my proxmox V5 CT on my new cluster (proxmox V6) and each times I have migrate all CTs of a proxmox V5 node, I reinstall a fresh proxmox V6, and I join it to the new cluster.

my current state is :
proxmox-0001 -> OLD cluster
proxmox-0002 -> Migrating to NEW cluster
proxmox-0003 -> OLD cluster
proxmox-0004 -> NEW cluster
proxmox-0005 -> NEW cluster (first node, I have create the cluster on it)
proxmox-0006 -> NEW cluster

My problem :
The procedure to join cluster seems to be failed for proxmox-0002 : I don't have any message that indicates me it's all right, and proxmoxmox-0002 appear in my server view on each nodes of the cluster, but it appear as "broken".

Each nodes are really slow now (web interface includeed). and commands like : pct list don't respond anything...

On any members of the cluster I have this problem, and on any nodes, when I navigate to DATACENTER/cluster, I have a list with my nodes, but the form allows me to create a cluster (not join it) what is insane.

cluster.png

I tried to reinstall proxmox-0002, because I can't access to webinterface and They aren't CT on it. But my cluster seems broken. I can't create new CT on my cluster becase nodes list is empty (see the other attachment).

empty.png


How can I fix this ?

I tried to reboot proxmox-0005 and proxmox-0002 nodes, but still broken :(

I specify that I have production servers that run on this cluster.

Really thank you for your urgent help.
 
Last edited:
I think I found a solution to restore my cluster (removing proxmox-0002) :

First I reboot my proxmox-0002 in rescue mode (an OVH live OS).

Then on proxmox-0005 I ran :
Code:
pvecm delnode proxmox-0002
ls -l /etc/pve/nodes/                    # Shows me a folder for proxmox-0002 so I "remove" it
mv /etc/pve/nodes/proxmox-0002/ /root/    # Actually I save it in my /root directory
cat /etc/corosync/corosync.conf            # No more informations about proxmox-0002

Now I can create CT and my proxmox join informations are back.

BUT :
After 2 reinstallation, I can't add proxmox-0002 to my cluster. It's a problem for me
Can you help me please ? If I don't found solution I would have to create a new cluster again and migrate my production serveur on it again...

maybe there is persistent data in a database? I have no idea how to deal with the problem but it is a handicap.
 
Last edited: