[SOLVED] Failed to join a Node to an existing Cluster

MasterTH

Renowned Member
Jun 12, 2009
224
7
83
www.sonog.de
Hi,

today i tried to join 3 Nodes into my Cluster that is existring since 2017.
It's is the actual version 6.3.3 and erverything was fine.
After installing the first node and adding it, i got strang issues with the cluster itself. it lost the connections between the nodes back and forth. so i restarted some services (pvestatd, pvesr, pve-cluster and corosync) - it went back normal. Node joined and everything is fine.
Second node - same thing...

third node - same thing, but there was an error show:
end task UPID:pve5-9:0000280D:0005F92F:601BC08C:clusterjoin::root@pam: unable to generate pve ssl certificate: command 'faketime yesterday openssl x509 -req -in /tmp/pvecertreq-10253.tmp -days 730 -out /etc/pve/nodes/pve5-9/pve-ssl.pem -CAkey /etc/pve/priv/pve-root-ca.key -CA /etc/pve/pve-root-ca.pem -CAserial /etc/pve/priv/pve-root-ca.srl -extfile /tmp/pvesslconf-10253.tmp' failed: exit code 1"},

the node is joined in the cluster, but i cannot manage it from another host. How can i readjust the certificates for this node in the cluster?


And another strange thing now - the cluster-tab where i can display the join informations is showing me that i don't have a cluster.
 

Attachments

  • 2021-02-04 13_28_10-pve5-2 - Proxmox Virtual Environment.png
    2021-02-04 13_28_10-pve5-2 - Proxmox Virtual Environment.png
    10.7 KB · Views: 26
hi,

was your cluster updated to the latest version before you tried to add the new nodes?

pve5 and pve6 have some differences in corosync so they shouldn't be in the same cluster.

to verify you can run pveversion -v on all your nodes and compare the outputs.
 
hi oguz,

thank you for your reply.

pve5 is jus the name of the host. i started in 2017 with this cluster, in the meantime we're up in pve6 but hostname didn't change :D


Code:
root@pve5-2:/etc/pve/nodes# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.3.18-3-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-17
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.10.15-1-pve: 4.10.15-15
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
proxmox-ve: 6.3-1 (running kernel: 5.3.18-3-pve)
might be unrelated but it looks like you've made kernel upgrades and didn't reboot? (you have a newer kernel installed but it's not running)

maybe rebooting the nodes will fix some issues. can you try that?
 
uh - rebooting isn't possible now. adding these new nodes to be able to move vms arround.
but do you think this is the issue with the certificate? when this is fine i'll move vms around to reboot the nodes indivd.
 
but do you think this is the issue with the certificate?
did you do any manual changes to the certificate, like certbot etc. ?

can you ssh from one node to another; ssh root@ip.of.another.node should work without any interaction.

if so you can run pvecm updatecerts -f to see if it can regenerate the certificates.
 
> did you do any manual changes to the certificate, like certbot etc. ?
nope

> can you ssh from one node to another; ssh root@ip.of.another.node should work without any interaction.
ssh works fine, yes

> if so you can run pvecm updatecerts -f to see if it can regenerate the certificates.
only on that node that has failed?
 
updatecerts should update the certificate for all nodes
 
  • Like
Reactions: MasterTH
Code:
root@pve5-9:/etc/pve# pvecm updatecerts -f
(re)generate node files
generate new node certificate
unable to load Private Key
139705435645056:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: ANY PRIVATE KEY
unable to generate pve certificate request:
command 'openssl req -batch -new -config /tmp/pvesslconf-26507.tmp -key /etc/pve/nodes/pve5-9/pve-ssl.key -out /tmp/pvecertreq-26507.tmp' failed: exit code 1

i did it at the node which has gone wrong.
 
Is the cluster-filesystem mounted?
Code:
mount |grep '/etc/pve'

does the key-file exist?
Code:
stat /etc/pve/nodes/pve5-9/pve-ssl.key

what's the output of:
Code:
pvesh get /cluster/config/join

pvecm status
 
  • Like
Reactions: Olda
hi stokio invano,

thank you for your reply:

Code:
root@pve5-9:/etc/pve# mount |grep '/etc/pve'
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

root@pve5-9:/etc/pve# stat /etc/pve/nodes/pve5-9/pve-ssl.key
  Datei: /etc/pve/nodes/pve5-9/pve-ssl.key
  Größe: 0              Blöcke: 0          EA Block: 4096   reguläre leere Datei
Gerät: 39h/57d  Inode: 21          Verknüpfungen: 1
Zugriff: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (   33/www-data)
Zugriff    : 2021-02-04 10:43:29.000000000 +0100
Modifiziert: 2021-02-04 10:43:29.000000000 +0100
Geändert   : 2021-02-04 10:43:29.000000000 +0100
 Geburt    : -

root@pve5-9:/etc/pve# pvesh get /cluster/config/join
'/etc/pve/nodes/pve5-9/pve-ssl.pem' does not exist!

root@pve5-9:/etc/pve# pvecm status
Cluster information
-------------------
Name:             pve5
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Feb  4 16:02:33 2021
Quorum provider:  corosync_votequorum
Nodes:            9
Node ID:          0x00000009
Ring ID:          1.3d1a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      9
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 xx.151
0x00000002          1 xx.152
0x00000003          1 xx.153
0x00000004          1 xx.154
0x00000005          1 xx.155
0x00000006          1 xx.156
0x00000007          1 xx.157
0x00000008          1 xx.158
0x00000009          1 xx.159 (local)
[code]
 
hmm - guess I can see why the updatecert generates an error:
root@pve5-9:/etc/pve# stat /etc/pve/nodes/pve5-9/pve-ssl.key Datei: /etc/pve/nodes/pve5-9/pve-ssl.key Größe: 0 Blöcke: 0 EA Block: 4096 reguläre leere Datei
the file is empty, but still exists...
the key is only generated if the file does not exist (which is not the case here), yet the certificate generation does not work with an empty key-file...

try moving the file out of the way and then regenerating the key+cert:
Code:
mv /etc/pve/nodes/pve5-9/pve-ssl.key /root/pve-ssl.key.bck
pvecm updatecerts --force
(you most likely could also simply delete it since it looks empty anyways - but I'm always a bit hesitant to delete files)

I hope this helps!

It's stil curious how the system got into the situation of having an empty key-file
 
  • Like
Reactions: MasterTH
you guys are the best.

thank you

Code:
root@pve5-9:/etc/pve# mv /etc/pve/nodes/pve5-9/pve-ssl.key /root/pve-ssl.key.bck
root@pve5-9:/etc/pve# pvecm updatecerts -f
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts
 
Glad that worked - if possible could you still share the logs of the first join-attempt of the node? (maybe we can spot how this situation happened)
 
Hi
Today I added 4 new identical nodes to the existing cluster. But on the last one everything hung up. I followed the same process, update&upgrade, reboot, join. Last node failed to join completely (invalid PVE ticket 401) but the config was copied on it but was not found on other nodes of the cluster configs (corosync.conf for ex) even though was appearing on datacenter list.
The whole cluster was brought down to it's knees while the node was online. If i stopped networking everything started to work again.
I tried to delete the node but rejoining was impossible because was detecting the corosync.conf already having the cluster join info. Unable to remove the file (read only for root).
The only difference was that CEPH was already installed and configured on the cluster but not installed on the last node.
i reinstalled the node and installed ceph and rejoined and all was ok ... for now.
PVE7.0-1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!