[SOLVED] Failed to join a Node to an existing Cluster

MasterTH · Feb 4, 2021

Hi,

today i tried to join 3 Nodes into my Cluster that is existring since 2017.
It's is the actual version 6.3.3 and erverything was fine.
After installing the first node and adding it, i got strang issues with the cluster itself. it lost the connections between the nodes back and forth. so i restarted some services (pvestatd, pvesr, pve-cluster and corosync) - it went back normal. Node joined and everything is fine.
Second node - same thing...

third node - same thing, but there was an error show:
end task UPID

ve5-9:0000280D:0005F92F:601BC08C:clusterjoin::root@pam: unable to generate pve ssl certificate: command 'faketime yesterday openssl x509 -req -in /tmp/pvecertreq-10253.tmp -days 730 -out /etc/pve/nodes/pve5-9/pve-ssl.pem -CAkey /etc/pve/priv/pve-root-ca.key -CA /etc/pve/pve-root-ca.pem -CAserial /etc/pve/priv/pve-root-ca.srl -extfile /tmp/pvesslconf-10253.tmp' failed: exit code 1"},

the node is joined in the cluster, but i cannot manage it from another host. How can i readjust the certificates for this node in the cluster?

And another strange thing now - the cluster-tab where i can display the join informations is showing me that i don't have a cluster.

oguz · Feb 4, 2021

hi,

was your cluster updated to the latest version before you tried to add the new nodes?

pve5 and pve6 have some differences in corosync so they shouldn't be in the same cluster.

to verify you can run pveversion -v on all your nodes and compare the outputs.

MasterTH · Feb 4, 2021

hi oguz,

thank you for your reply.

pve5 is jus the name of the host. i started in 2017 with this cluster, in the meantime we're up in pve6 but hostname didn't change

Code:

root@pve5-2:/etc/pve/nodes# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.3.18-3-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-17
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.10.15-1-pve: 4.10.15-15
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

oguz · Feb 4, 2021

MasterTH said:
proxmox-ve: 6.3-1 (running kernel: 5.3.18-3-pve)

might be unrelated but it looks like you've made kernel upgrades and didn't reboot? (you have a newer kernel installed but it's not running)

maybe rebooting the nodes will fix some issues. can you try that?

MasterTH · Feb 4, 2021

uh - rebooting isn't possible now. adding these new nodes to be able to move vms arround.
but do you think this is the issue with the certificate? when this is fine i'll move vms around to reboot the nodes indivd.

oguz · Feb 4, 2021

MasterTH said:
but do you think this is the issue with the certificate?

did you do any manual changes to the certificate, like certbot etc. ?

can you ssh from one node to another; ssh root@ip.of.another.node should work without any interaction.

if so you can run pvecm updatecerts -f to see if it can regenerate the certificates.

MasterTH · Feb 4, 2021

> did you do any manual changes to the certificate, like certbot etc. ?
nope

> can you ssh from one node to another; ssh root@ip.of.another.node should work without any interaction.
ssh works fine, yes

> if so you can run pvecm updatecerts -f to see if it can regenerate the certificates.
only on that node that has failed?

oguz · Feb 4, 2021

updatecerts should update the certificate for all nodes

MasterTH · Feb 4, 2021

Code:

root@pve5-9:/etc/pve# pvecm updatecerts -f
(re)generate node files
generate new node certificate
unable to load Private Key
139705435645056:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: ANY PRIVATE KEY
unable to generate pve certificate request:
command 'openssl req -batch -new -config /tmp/pvesslconf-26507.tmp -key /etc/pve/nodes/pve5-9/pve-ssl.key -out /tmp/pvecertreq-26507.tmp' failed: exit code 1

i did it at the node which has gone wrong.

Stoiko Ivanov · Feb 4, 2021

Is the cluster-filesystem mounted?

Code:

mount |grep '/etc/pve'

does the key-file exist?

Code:

stat /etc/pve/nodes/pve5-9/pve-ssl.key

what's the output of:

Code:

pvesh get /cluster/config/join

pvecm status

MasterTH · Feb 4, 2021

hi stokio invano,

thank you for your reply:

Code:

root@pve5-9:/etc/pve# mount |grep '/etc/pve'
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

root@pve5-9:/etc/pve# stat /etc/pve/nodes/pve5-9/pve-ssl.key
  Datei: /etc/pve/nodes/pve5-9/pve-ssl.key
  Größe: 0              Blöcke: 0          EA Block: 4096   reguläre leere Datei
Gerät: 39h/57d  Inode: 21          Verknüpfungen: 1
Zugriff: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (   33/www-data)
Zugriff    : 2021-02-04 10:43:29.000000000 +0100
Modifiziert: 2021-02-04 10:43:29.000000000 +0100
Geändert   : 2021-02-04 10:43:29.000000000 +0100
 Geburt    : -

root@pve5-9:/etc/pve# pvesh get /cluster/config/join
'/etc/pve/nodes/pve5-9/pve-ssl.pem' does not exist!

root@pve5-9:/etc/pve# pvecm status
Cluster information
-------------------
Name:             pve5
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Feb  4 16:02:33 2021
Quorum provider:  corosync_votequorum
Nodes:            9
Node ID:          0x00000009
Ring ID:          1.3d1a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      9
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 xx.151
0x00000002          1 xx.152
0x00000003          1 xx.153
0x00000004          1 xx.154
0x00000005          1 xx.155
0x00000006          1 xx.156
0x00000007          1 xx.157
0x00000008          1 xx.158
0x00000009          1 xx.159 (local)
[code]

Stoiko Ivanov · Feb 4, 2021

hmm - guess I can see why the updatecert generates an error:

MasterTH said:
root@pve5-9:/etc/pve# stat /etc/pve/nodes/pve5-9/pve-ssl.key Datei: /etc/pve/nodes/pve5-9/pve-ssl.key Größe: 0 Blöcke: 0 EA Block: 4096 reguläre leere Datei

the file is empty, but still exists...
the key is only generated if the file does not exist (which is not the case here), yet the certificate generation does not work with an empty key-file...

try moving the file out of the way and then regenerating the key+cert:

Code:

mv /etc/pve/nodes/pve5-9/pve-ssl.key /root/pve-ssl.key.bck
pvecm updatecerts --force

(you most likely could also simply delete it since it looks empty anyways - but I'm always a bit hesitant to delete files)

I hope this helps!

It's stil curious how the system got into the situation of having an empty key-file

MasterTH · Feb 4, 2021

you guys are the best.

thank you

Code:

root@pve5-9:/etc/pve# mv /etc/pve/nodes/pve5-9/pve-ssl.key /root/pve-ssl.key.bck
root@pve5-9:/etc/pve# pvecm updatecerts -f
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts

Stoiko Ivanov · Feb 4, 2021

Glad that worked - if possible could you still share the logs of the first join-attempt of the node? (maybe we can spot how this situation happened)

N0REAVER · Aug 10, 2021

Hi
Today I added 4 new identical nodes to the existing cluster. But on the last one everything hung up. I followed the same process, update&upgrade, reboot, join. Last node failed to join completely (invalid PVE ticket 401) but the config was copied on it but was not found on other nodes of the cluster configs (corosync.conf for ex) even though was appearing on datacenter list.
The whole cluster was brought down to it's knees while the node was online. If i stopped networking everything started to work again.
I tried to delete the node but rejoining was impossible because was detecting the corosync.conf already having the cluster join info. Unable to remove the file (read only for root).
The only difference was that CEPH was already installed and configured on the cluster but not installed on the last node.
i reinstalled the node and installed ceph and rejoined and all was ok ... for now.
PVE7.0-1

Search

Search

[SOLVED] Failed to join a Node to an existing Cluster

MasterTH

Renowned Member

Attachments

oguz

Proxmox Retired Staff

MasterTH

Renowned Member

oguz

Proxmox Retired Staff

MasterTH

Renowned Member

oguz

Proxmox Retired Staff

MasterTH

Renowned Member

oguz

Proxmox Retired Staff

MasterTH

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

MasterTH

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

MasterTH

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

N0REAVER

Member

We value your privacy