Unable to join cluster

hacman

Renowned Member
Oct 11, 2013
91
8
73
Newcastle upon Tyne, UK
Hi all,

We're in the process of setting up a new cluster, with Proxmox 6.4, but have an issue where for some unknown reason we are unable to have the hosts join the cluster.

To give some context of the setup, there are currently 2 nodes loaded up, each has had all the latest no-sub updates applied, and the cluster is successfully created on the first node.

When we go to add the second to the cluster, we get the below error:
Code:
Establishing API connection with host '10.100.2.1'
TASK ERROR: 500 Can't connect to 10.100.2.1:8006

The same is returned if we do this via the CLI:
Code:
root@NODE-2:~# pvecm add 10.100.2.1
Please enter superuser (root) password for '10.100.2.1': **********
Establishing API connection with host '10.100.2.1'
500 Can't connect to 10.100.2.1:8006
root@NODE-2:~#

Both nodes can speak to each other on 8006:
Code:
root@NODE-2:~# telnet 10.100.2.1 8006
Trying 10.100.2.1...
Connected to 10.100.2.1.
Escape character is '^]'.
^CConnection closed by foreign host.

pvecm status gives the below on the node where the cluster was created:
Code:
root@NODE-1:~# pvecm status
Cluster information
-------------------
Name:             [REDACTED]
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun 10 11:59:24 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.2.1 (local)

Package versions:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.119-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.9-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

This has genuinely got me stumped, as we have even reinstalled both nodes on the small chance that something was damaged / broken! Does anyone have any ideas?

Any help would be very much appreciated!

Thanks,

Jon
 
Last edited:
I've considered trying that, but was unsure about doing so as I believe once the nodes are clustered they will still need the API to be able to work effectively.

I will try it now.
Ok, this had the effect I sort of expected.

They clustered fine - and now show up in eachothers web-manager.

Output of the cluster command returned:
Code:
root@NODE-2:~# pvecm add 10.100.2.1 --use_ssh
root@10.100.2.1's password:
No cluster network links passed explicitly, fallback to local node IP '10.100.2.2'
copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1623333504.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'NODE-2' to cluster.

pvecm status returns:
Code:
root@NODE-2:~# pvecm status
Cluster information
-------------------
Name:             [REDACTED]
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun 10 14:59:24 2021
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.e
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.2.1
0x00000002          1 10.100.2.2 (local)

But when I try to browse to the other node in each nodes web-manager, I just get presented with the spinning "please wait" icon, and can't view or do anything to the other node, before eventually getting "Connection timed out (596)"

Very strange.
 
Last edited:
Then I can only imagine some trouble with addresses and/or hostnames. Any doubles in your network? It seems like you didn't define a separate network for corosync, you should probably do that, although I don't think that this is the cause here.
 
Then I can only imagine some trouble with addresses and/or hostnames. Any doubles in your network? It seems like you didn't define a separate network for corosync, you should probably do that, although I don't think that this is the cause here.
Hi,

Separate network which is used for accessing the management-UI and corosync only. No duplicate addresses, no duplicate hostnames. There is no packet loss and good response times between the hosts.

This was the first thing I checked, and part of why this issue is so perplexing. There is nothing obvious that I can see wrong. :(

Thanks for the help!