[SOLVED] Unable to join 3rd node to cluster

gurgle

Renowned Member
Sep 2, 2015
16
0
66
Hi i have a 2 node cluster and i'm trying to add a third node but i keep getting errors. From the new node I can successfully join the cluster but I get the below error. If i try join the cluster using the webui the page hangs after entering in the join details and then won't reload or I get a Connection Error message. On the current cluster, the new node is added but i'm unable to select any of the sections (summary, nodes, shell etc) with out a Communication Failure error.


pve.jpg

This is from using the proxmox 6.2 ISO, and persists whether i update the new server or not prior to attempting to join the cluster. Despite not being able to access it from the cluster the cluster performance is crazy unstable after doing this. On the webui nodes drop in and out and the page takes ages to load or do anything.

Can anyone help please?
 
hi,

are the first 2 nodes are also running the latest version?

can you post the outputs of:
Code:
cat /etc/hosts # on all nodes
pvecm status # on all nodes
journalctl -xe # on last node
cat /etc/pve/corosync.conf # on all nodes
 
Hey thanks for the reply. Details below.

Bash:
root@host1:~# pveversion
pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve)

root@host1:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.0.175 host1.domain host1
192.168.0.170 host3.domain host3    # this i added manually while troubleshooting

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

root@host1:~# pvecm status
Cluster information
-------------------
Name:             constellation
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jun 10 12:29:52 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.1417
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2     # this i changed from 3 to 2 while troubleshooting
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.175 (local)
0x00000002          1 192.168.0.192

root@host1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: host3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.170
  }
  node {
    name: host1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.175
  }
  node {
    name: host2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.192
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: constellation
  config_version: 15
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@host2:~# pveversion
pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve)

root@host2:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.0.192 host2.domain host2

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

root@host2:~# pvecm status
Cluster information
-------------------
Name:             constellation
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jun 10 12:28:49 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.1417
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.175
0x00000002          1 192.168.0.192 (local)

root@host2:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: host3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.170
  }
  node {
    name: host1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.175
  }
  node {
    name: host2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.192
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: constellation
  config_version: 15
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@host3:~# pveversion
pve-manager/6.2-4/9824574a (running kernel: 5.4.34-1-pve)

root@host3:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.0.170 host3.domain host3

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

root@host3:~# pvecm status
Cluster information
-------------------
Name:             constellation
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jun 10 12:56:30 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.141c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.175
0x00000002          1 192.168.0.192
0x00000003          1 192.168.0.170 (local)

root@host3:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: host3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.0.170
  }
  node {
    name: host1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.0.175
  }
  node {
    name: host2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.0.192
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: constellation
  config_version: 15
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Code:
root@host3:~# journalctl -xe | tail -n 100
Jun 10 13:11:21 host3 pveproxy[1215]: worker 5108 finished
Jun 10 13:11:21 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:21 host3 pveproxy[1215]: worker 5111 started
Jun 10 13:11:21 host3 pveproxy[5110]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:21 host3 pveproxy[5111]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:21 host3 pveproxy[5109]: worker exit
Jun 10 13:11:21 host3 pveproxy[1215]: worker 5109 finished
Jun 10 13:11:21 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:21 host3 pveproxy[1215]: worker 5112 started
Jun 10 13:11:21 host3 pveproxy[5112]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:26 host3 pveproxy[5110]: worker exit
Jun 10 13:11:26 host3 pveproxy[5111]: worker exit
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5110 finished
Jun 10 13:11:26 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5134 started
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5111 finished
Jun 10 13:11:26 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5135 started
Jun 10 13:11:26 host3 pveproxy[5134]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:26 host3 pveproxy[5135]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:26 host3 pveproxy[5112]: worker exit
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5112 finished
Jun 10 13:11:26 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:26 host3 pveproxy[1215]: worker 5136 started
Jun 10 13:11:26 host3 pveproxy[5136]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:31 host3 pveproxy[5134]: worker exit
Jun 10 13:11:31 host3 pveproxy[5135]: worker exit
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5134 finished
Jun 10 13:11:31 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5137 started
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5135 finished
Jun 10 13:11:31 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5138 started
Jun 10 13:11:31 host3 pveproxy[5137]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:31 host3 pveproxy[5138]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:31 host3 pveproxy[5136]: worker exit
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5136 finished
Jun 10 13:11:31 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:31 host3 pveproxy[1215]: worker 5139 started
Jun 10 13:11:31 host3 pveproxy[5139]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:36 host3 pveproxy[5137]: worker exit
Jun 10 13:11:36 host3 pveproxy[5138]: worker exit
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5137 finished
Jun 10 13:11:36 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5161 started
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5138 finished
Jun 10 13:11:36 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5162 started
Jun 10 13:11:36 host3 pveproxy[5161]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:36 host3 pveproxy[5162]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:36 host3 pveproxy[5139]: worker exit
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5139 finished
Jun 10 13:11:36 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:36 host3 pveproxy[1215]: worker 5163 started
Jun 10 13:11:36 host3 pveproxy[5163]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:41 host3 pveproxy[5161]: worker exit
Jun 10 13:11:41 host3 pveproxy[5162]: worker exit
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5161 finished
Jun 10 13:11:41 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5170 started
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5162 finished
Jun 10 13:11:41 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5171 started
Jun 10 13:11:41 host3 pveproxy[5170]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:41 host3 pveproxy[5171]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:41 host3 pveproxy[5163]: worker exit
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5163 finished
Jun 10 13:11:41 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:41 host3 pveproxy[1215]: worker 5172 started
Jun 10 13:11:41 host3 pveproxy[5172]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:46 host3 pveproxy[5170]: worker exit
Jun 10 13:11:46 host3 pveproxy[5171]: worker exit
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5170 finished
Jun 10 13:11:46 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5194 started
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5171 finished
Jun 10 13:11:46 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5195 started
Jun 10 13:11:46 host3 pveproxy[5194]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:46 host3 pveproxy[5195]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:46 host3 pveproxy[5172]: worker exit
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5172 finished
Jun 10 13:11:46 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:46 host3 pveproxy[1215]: worker 5196 started
Jun 10 13:11:46 host3 pveproxy[5196]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:51 host3 pveproxy[5194]: worker exit
Jun 10 13:11:51 host3 pveproxy[5195]: worker exit
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5194 finished
Jun 10 13:11:51 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5199 started
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5195 finished
Jun 10 13:11:51 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5200 started
Jun 10 13:11:51 host3 pveproxy[5199]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:51 host3 pveproxy[5200]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jun 10 13:11:51 host3 pveproxy[5196]: worker exit
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5196 finished
Jun 10 13:11:51 host3 pveproxy[1215]: starting 1 worker(s)
Jun 10 13:11:51 host3 pveproxy[1215]: worker 5201 started
Jun 10 13:11:51 host3 pveproxy[5201]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
 
can you ssh from let's say host1 -> host2 without any user interaction (simply running ssh host2 from host1). in your first screenshot it seems like the authorized_keys file wasn't created correctly, which can cause issues like this.

you can first try to remove host3, make sure you can ssh between host1-2 and then retry adding the host3 in the cluster.. [0]

[0]: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
I can ssh between host1 and host2 as you described.

Removing and readding failed:

Bash:
root@host3:~# pvecm add host1.domain
Please enter superuser (root) password for 'host1.domain': ********
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!
root@host3:~#
 
Oh thanks. I followed those steps but it still failed.

Bash:
root@host3:~# pvecm add host1.domain
Please enter superuser (root) password for 'host1.domain': ********
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!
root@host3:~#
 
* this host already contains virtual guests

the machine needs to be clean when adding to a cluster (no VM/CT).

you can do the following:
1. take backups of guests on host3
2. destroy the guests
3. now you should be able to add back into the cluster
4. restore backups after rejoining cluster
 
Hm it shouldn't have anything on it, it's a brand new host. The webui fails to load so i cant check there.

Bash:
root@host3:~# ls /etc/pve/qemu-server
root@host3:~# ls /etc/pve/lxc
root@host3:~# qm list
root@host3:~# pct list
root@host3:~#
 
Last edited:
Any more ideas with this? I reinstalled again and ended up with the same issue. Have tried like half dozen times now.
 
I reinstalled again and ended up with the same issue. Have tried like half dozen times now.

if you've reinstalled the 3rd node completely and it still doesn't work, then there's a chance that your remaining two nodes are misconfigured in some way...

are you seeing anything else in the logs
 
Logs arent showing anything obvious. Syslog has this every minute but i'm led to believe its normal.
Bash:
Jun 23 14:17:00 host1 systemd[1]: Starting Proxmox VE replication runner...
Jun 23 14:17:00 host1 systemd[1]: pvesr.service: Succeeded.
Jun 23 14:17:00 host1 systemd[1]: Started Proxmox VE replication runner.
 
I should add - the original two nodes (host1 and host2) are only about a month old. Installed from the same media as I installed host3.
 
. Syslog has this every minute but i'm led to believe its normal.
yes its normal.

so just to be clear can you verify the following:
* you have reinstalled host3
* it's totally empty
* cluster is functioning with host1 and host2



when you try joining now, do you get the first error message in the original post, or the 'node has guests' message?
 
Ok i reinstalled host3 one more time for funsies. Has just booted, no other changes. host1 and host2 are running as normal. I just updated both to the latest non-subscription packages and rebooted.

Anything i should do before attempting to join the cluster?
 
Sigh. No luck. This is as far as i got, nothing further after this.

Bash:
root@host3:~# pvecm add host1.domain
Please enter superuser (root) password for 'host1.domain': ********
Establishing API connection with host 'host1.domain'
The authenticity of host 'host1.domain' can't be established.
X509 SHA256 key fingerprint is ##############      # <edited
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.0.170'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1592910830.sql.gz'
waiting for quorum...OK

I cant establish another SSH session with the box. On the host terminal screen i got these info lines without even having logged in (censored hostname)

login.jpg

After logging in the syslog has all sorts of weirdness (censored hostname again)

syslog.jpg


As i mentioned my other boxes basically hang unless i shutdown host3. After doing that, on host1 i find that logs are spammed with this -
Code:
Jun 23 21:27:26 host3 corosync[1085]:   [TOTEM ] Retransmit List: 12 13 15 16 18 19 1b 1c 1e 1f 21 22 34

On the webui host3 appears as a server, but when powered on i just get Communication failure errors when attempting to access anything on it.

:(
 
could you check on each node /etc/corosync/corosync.conf and /etc/pve/corosync.conf

contents of these files need to be the same in all locations
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!