Trying to join a new 9.2 node to an existing cluster - failing

Ron Gage

Active Member
Aug 7, 2019
6
1
41
59
Greetings from Detroit.

I am trying (unsuccessfully) to join a new 9.2 node to an existing cluster. The new node (pv01) ends up in a state that I can't recover from: web interface won't let me log in and the join - while not erroring out, at least not visibly anyhow - appears to not have succeeded. This has happened to me twice now today. Same procedure both times. The first time, I ended up having to completely reload the system. I'm afraid that I will have to do that again - it's a major PITA since there is no VGA port on the node (I have to swap out a 10Gb network card with a VGA card to be able to reload).

What I have been doing for trying to join the cluster.
Swap 10gb card out and install VGA card
Load PVE onto node
Set up management networking
Update node to latest software
Power down node
Swap VGA card for 10gb card
Power on node
Set up storage networking (10gb card)
Gather cluster join info from existing cluster
Plug cluster join info into new node and execute
New node does not actually join cluster
Can no longer log into new node from Web UI.

At this point the folder /etc/pve/nodes is completely gone on the new node.

oot@pv01:/etc/pve# ls -l
total 1
-r--r----- 1 root www-data 443 Jun 20 16:02 corosync.conf
lr-xr-xr-x 1 root www-data 10 Dec 31 1969 local -> nodes/pv01
lr-xr-xr-x 1 root www-data 14 Dec 31 1969 lxc -> nodes/pv01/lxc
lr-xr-xr-x 1 root www-data 17 Dec 31 1969 openvz -> nodes/pv01/openvz
lr-xr-xr-x 1 root www-data 22 Dec 31 1969 qemu-server -> nodes/pv01/qemu-server
root@pv01:/etc/pve#

Rebooting the new node at this point makes it so the Web UI no longer works (tcp/8006 has something listening on it but nothing actually can connect). This is likely because the SSL certs are gone /etc/pve/*.

At a basic level, is there anything I can do to reset the node back to a "standalone" configuration so I can have another go at it without having to reload the node. For that matter, is there anything I am doing wrong here? Did I hit a bug?

Ron Gage
 
Partial solution:

I was able to get the system to at least think it was a standalone system. Here are the steps there:
systemctl stop pve-cluster
systemctl stop corosync
rm -f /var/lib/pve-cluster/.pmcxfs.lockfile -- note the prefixed period on the file name
pmxcfs -l
edit /etc/pve/corosync.conf - remove entire stanza that mentions other (non-local) nodes.
pvecm expected 1
systemctl restart pveproxy
systemctl restart pvestatd

At this point, you should be able to connect to the webUI and log in as root
The system should exhibit as a stand-alone system now.

Still don't know why the new node wasn't able to join the cluster, but at least it looks like I recovered the node to a somewhat normal state.
 
Adding more notes:
If you ever get to a point where ssh operations (like migrate) freezes at the start, the following fix worked for me
Edit ~/.ssh/config and add the following line at the end of that file:
KexAlgorithms=curve25519-sha256
 
  • Like
Reactions: Onslow