Unable to join cluster

hacman · Jun 10, 2021

Hi all,

We're in the process of setting up a new cluster, with Proxmox 6.4, but have an issue where for some unknown reason we are unable to have the hosts join the cluster.

To give some context of the setup, there are currently 2 nodes loaded up, each has had all the latest no-sub updates applied, and the cluster is successfully created on the first node.

When we go to add the second to the cluster, we get the below error:

Code:

Establishing API connection with host '10.100.2.1'
TASK ERROR: 500 Can't connect to 10.100.2.1:8006

The same is returned if we do this via the CLI:

Code:

root@NODE-2:~# pvecm add 10.100.2.1
Please enter superuser (root) password for '10.100.2.1': **********
Establishing API connection with host '10.100.2.1'
500 Can't connect to 10.100.2.1:8006
root@NODE-2:~#

Both nodes can speak to each other on 8006:

Code:

root@NODE-2:~# telnet 10.100.2.1 8006
Trying 10.100.2.1...
Connected to 10.100.2.1.
Escape character is '^]'.
^CConnection closed by foreign host.

pvecm status gives the below on the node where the cluster was created:

Code:

root@NODE-1:~# pvecm status
Cluster information
-------------------
Name:             [REDACTED]
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun 10 11:59:24 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.2.1 (local)

Package versions:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.119-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.9-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

This has genuinely got me stumped, as we have even reinstalled both nodes on the small chance that something was damaged / broken! Does anyone have any ideas?

Any help would be very much appreciated!

Thanks,

Jon

ph0x · Jun 10, 2021

Did you apply any hardening measures before? Can you ssh from one host to the other?

hacman · Jun 10, 2021

ph0x said:
Did you apply any hardening measures before? Can you ssh from one host to the other?

SSH from one host to the next works fine. No hardening or security measures applied - it's a fresh, out the box, just updated install on both machines.

ph0x · Jun 10, 2021

Although this always worked out of the box for me, did you try with the option --use_ssh?

hacman · Jun 10, 2021

ph0x said:
Although this always worked out of the box for me, did you try with the option --use_ssh?

I've considered trying that, but was unsure about doing so as I believe once the nodes are clustered they will still need the API to be able to work effectively.

I will try it now.

hacman · Jun 10, 2021

hacman said:
I've considered trying that, but was unsure about doing so as I believe once the nodes are clustered they will still need the API to be able to work effectively.

I will try it now.

Ok, this had the effect I sort of expected.

They clustered fine - and now show up in eachothers web-manager.

Output of the cluster command returned:

Code:

root@NODE-2:~# pvecm add 10.100.2.1 --use_ssh
root@10.100.2.1's password:
No cluster network links passed explicitly, fallback to local node IP '10.100.2.2'
copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1623333504.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'NODE-2' to cluster.

pvecm status returns:

Code:

root@NODE-2:~# pvecm status
Cluster information
-------------------
Name:             [REDACTED]
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun 10 14:59:24 2021
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.2.1
0x00000002          1 10.100.2.2 (local)

But when I try to browse to the other node in each nodes web-manager, I just get presented with the spinning "please wait" icon, and can't view or do anything to the other node, before eventually getting "Connection timed out (596)"

Very strange.

ph0x · Jun 10, 2021

Indeed ...
Any firewall active maybe?

hacman · Jun 10, 2021

ph0x said:
Indeed ...
Any firewall active maybe?

Hi,

Nope - I've not changed any of the settings, so it is disabled at the datacenter level. I've just disabled at the node level too so we can be sure, and no difference.

ph0x · Jun 10, 2021

Then I can only imagine some trouble with addresses and/or hostnames. Any doubles in your network? It seems like you didn't define a separate network for corosync, you should probably do that, although I don't think that this is the cause here.

hacman · Jun 10, 2021

ph0x said:
Then I can only imagine some trouble with addresses and/or hostnames. Any doubles in your network? It seems like you didn't define a separate network for corosync, you should probably do that, although I don't think that this is the cause here.

Hi,

Separate network which is used for accessing the management-UI and corosync only. No duplicate addresses, no duplicate hostnames. There is no packet loss and good response times between the hosts.

This was the first thing I checked, and part of why this issue is so perplexing. There is nothing obvious that I can see wrong.

Thanks for the help!

ph0x · Jun 10, 2021

Don't think that I helped much, since this is equally confusing for me.

Search

Search

Unable to join cluster

hacman

Renowned Member

ph0x

Renowned Member

hacman

Renowned Member

ph0x

Renowned Member

hacman

Renowned Member

hacman

Renowned Member

ph0x

Renowned Member

hacman

Renowned Member

ph0x

Renowned Member

hacman

Renowned Member

ph0x

Renowned Member

We value your privacy