Unable to create cluster

DeanKellham

New Member
May 29, 2024
5
0
1
Hi all,

I've been using PVE for a while now. I have 1 system running 8.2.2 with a handful of VMs.

I've installed a fresh copy of PVE 8.2.2 onto another machine with the intent being to create a cluster so I can replicate / migrate, and manage both nodes form one place.

I confirm both nodes are working fine independently - both accessible via ssh and web gui, etc.

As soon as I join node 2 (pve2.lan) to the cluster the web gui stops being available. I've read this is due to certificates, and can be resolved by restarting pveproxy and pve-cluster services on all nodes. I do this to no avail on pve2.lan.

I then restart the services on node 1 (pve.lan) - and now I lose web gui on here too.

I've followed steps over on https://pve.proxmox.com/wiki/Cluster_Manager to destroy the cluster, which has restored web gui on both nodes.

Can anyone help me work out what I'm doing wrong here? Is there another prerequisite to creating a cluster that I'm missing?

Everything else is default - using self-signed certs, nothing fancy.

Any help would be greatly appreciated.

Thanks,
Dean
 
Some possibly relevant information - my primary node (pve) was previously running v7 and was recently upgraded to 8 via apt. I'm toying with reinstalling Proxmox VE 8 fresh on there and restoring VMs from NAS backup to see if a fresh install will help.
 
Do you have ssh or terminal access to the node(s) still, even if the webUI isn't showing up?

If so, you can run the command ss -nltp to show what processes are listening to what ports on a system,

For example, this is a system here:

Bash:
# ss -nltp
State     Recv-Q   Send-Q   Local Address:Port   Peer Address:Port   Process                                                                                                                                         
LISTEN    0        100         127.0.0.1:25          0.0.0.0:*       users:(("master",pid=2080,fd=13))                                                                                                             
LISTEN    0        4096        127.0.0.1:85          0.0.0.0:*       users:(("pvedaemon worke",pid=2140,fd=6),("pvedaemon worke",pid=2139,fd=6),("pvedaemon worke",pid=2138,fd=6),("pvedaemon",pid=2137,fd=6))     
LISTEN    0        4096          0.0.0.0:111         0.0.0.0:*       users:(("rpcbind",pid=1560,fd=4),("systemd",pid=1,fd=35))                                                                                     
LISTEN    0        128           0.0.0.0:22          0.0.0.0:*       users:(("sshd",pid=1904,fd=3))                                                                                                                 
LISTEN    0        100             [::1]:25             [::]:*       users:(("master",pid=2080,fd=14))                                                                                                             
LISTEN    0        4096                *:8006              *:*       users:(("pveproxy worker",pid=2252,fd=6),("pveproxy worker",pid=2251,fd=6),("pveproxy worker",pid=2250,fd=6),("pveproxy",pid=2249,fd=6))       
LISTEN    0        4096                *:3128              *:*       users:(("spiceproxy work",pid=2257,fd=6),("spiceproxy",pid=2256,fd=6))                                                                         
LISTEN    0        4096             [::]:111            [::]:*       users:(("rpcbind",pid=1560,fd=6),("systemd",pid=1,fd=37))                                                                                     
LISTEN    0        128              [::]:22             [::]:*       users:(("sshd",pid=1904,fd=4))

From that we can see that "pveproxy worker" is listening on port 8006, which is what's needed.

If you have terminal access on yours, you should be able to run that command to see if something is listening on port 8006 or not.

If not, that's one type of problem ;). If it is though then it'll probably just be an ip addressing issue, so possibly fixable by modifying /etc/network/interfaces.
 
So have tested some things and can confirm:
  • Both nodes have identical time set.
  • After joining to cluster pve2 appears in under the Datacenter tree in pve's web GUI, however times out when trying to pull any info.
  • After joining to cluster pve2's web gui is unresponsive and it times out when trying to connect via SSH. It does respond to a ping and is still on the network.
  • After joining to cluster pve2 is running "pveproxy worker" and listening on *:8006.
  • After joining to cluster pve2 is running "sshd" and listening on 0.0.0.0:22
I've torn the cluster down again by deleting the corosync config as I've done previously, which has restored web gui and ssh on pve2, and removed the node from the Datacenter tree on pve's web gui.

Thanks for the suggestions. Anything else I can look at?
 
Last edited:
As soon as I join node 2 (pve2.lan) to the cluster the web gui stops being available.

Reading this back a bit more properly (instead of just skimming ;)), um... that's expected.

So, lets say you have two potential nodes, pve and pve2, not yet joined into a cluster.

On the web interface of pve you switch to the Datacenter level, then go to the Cluster section on the right, then click the "Create cluster" (or similar wording) button.

A dialog should appear for choosing a network interface that other cluster members will communicate to this node through. So do that, and create the cluster.

After that, in the same Cluster part of the interface there should be a button to display the Join information. Click that and copy the join info to your clipboard.

Then switch to the web interface of the 2nd node, pve2. Go to the Datacenter level on that one, switch to the Cluster section there, and click the "Join cluster" (or similar wording, I'm writing this from memory). Then paste in the join info from your clipboard, type in the root password for the first node, and choose a network interface for it to communicate to the first node over.

At this point, behind the scenes, Proxmox is setting up corosync between the 2 nodes and making sure they're using a common set of datacenter related files. As part of this join process your browser session in the 2nd node (the one joining) gets invalidated and the browser will throw up errors about an invalid cert. Just close that browser tab as it's no longer useful.

Meanwhile, in the web interface for the first node (pve) the second node (pve2) should have appeared in the list below the Datacenter level. And if you click on the second node there, it should display the info for the 2nd node without throwing any errors.

Following that process, does it all work that far? Or do you get weird things happening somewhere in that part of the process?
 
Reading this back a bit more properly (instead of just skimming ;)), um... that's expected.

So, lets say you have two potential nodes, pve and pve2, not yet joined into a cluster.

On the web interface of pve you switch to the Datacenter level, then go to the Cluster section on the right, then click the "Create cluster" (or similar wording) button.

A dialog should appear for choosing a network interface that other cluster members will communicate to this node through. So do that, and create the cluster.

After that, in the same Cluster part of the interface there should be a button to display the Join information. Click that and copy the join info to your clipboard.

Then switch to the web interface of the 2nd node, pve2. Go to the Datacenter level on that one, switch to the Cluster section there, and click the "Join cluster" (or similar wording, I'm writing this from memory). Then paste in the join info from your clipboard, type in the root password for the first node, and choose a network interface for it to communicate to the first node over.

At this point, behind the scenes, Proxmox is setting up corosync between the 2 nodes and making sure they're using a common set of datacenter related files. As part of this join process your browser session in the 2nd node (the one joining) gets invalidated and the browser will throw up errors about an invalid cert. Just close that browser tab as it's no longer useful.

Meanwhile, in the web interface for the first node (pve) the second node (pve2) should have appeared in the list below the Datacenter level. And if you click on the second node there, it should display the info for the 2nd node without throwing any errors.

Following that process, does it all work that far? Or do you get weird things happening somewhere in that part of the process?
So I've been running through all those steps. Pve2 appears under the Datacenter tree on pve's web gui as expected, however clicking on it doesn't actually load any data, and eventually it tells me that it's timed out.

I suspect something's not configuring properly during the cluster join, but I'm a bit lost as to what.
 
As a test, I'd probably set up the second node from scratch again but use a different name for it (ie `pve2b` or something), as I'm thinking there might be some kind of left over info (old ssh keys?) from previous node attempt left hanging around.

Or just (!) recreate the whole cluster from scratch, so there's no possibility of old bits getting in the way.
 
The doku say u have to reinstall the node2 and delete from cluster.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!