Failed to add first node to cluster

Cebr

New Member
Jan 20, 2025
3
0
1
Hello,

I had a proxmox-ve 8.2 instance for some time, let's call it [B]alpha[/B].mydomain.com
Alpha has been hardened (TFA+firewall) and has been running 10+ vm for some years now.

I just built a second node, let's call it [B]beta[/B].mydomain.com, running 8.3.
Both are connected on the same switch (RTT <1ms) .

rushing to create my first cluster I didn't saw the version mismatch at first.
While joining the cluster, beta complained about the TFA and asked me to go command line.
This failed as did my 3 other attempts (after updating both alpha and beta to the same version 8.3.2, disabling TFA and rebuilding beta from scratch).

The latest attempt went like this (but it was almost the same on every attempt):
on beta:
Code:
Establishing API connection with host [alpha IP]
Login succeeded
check cluster join API version
request addition of this node
in the background I can see this error message permission denied - invalid PVE ticket (401) and after 30s communication failure (0)
after that if I refresh my browser my let's encrypt certificate has disappeared and I'm left with the self signed one.
I can see both alpha and beta in the menu, both with a green tick
I can access both "summary" pages (which are updated), but most of the pages linked to beta are failing with a "loading screen"
"shell" to alpha is failing with root@X.X.X.X: Permission denied (publickey,password).

On alpha:
I can see both alpha and beta but only alpha has a green mark, beta has a grey question mark
in the cluster page I see an error message : '/etc/pve/nodes/beta/pve-ssl.pem' does not exist! (500)
I can connect to both alpha and beta shell
When trying to check the content of /etc/pve/nodes/beta I see that it is mostly empty. When trying to check /etc/pve/nodes/alpha the command hangs and I need to issue a Ctrl+C to regain control

On beta:
When trying to check the content of /etc/pve/nodes/alpha I see that it is mostly full. When trying to check /etc/pve/nodes/beta the command hangs and I need to issue a Ctrl+C to regain control

On alpha :
when checking beta summary, I get an error connection error 596 : error:0A000086:SSL routines::certificate verify failed

after a few minutes, everything goes south, both alpha and beta become unresponsive and I need to shut down beta and issue a pvecm e 1 on alpha to regain control

Do you have any idea ?

Regards
 

Attachments

  • 2025-01-20_22h56_36.png
    2025-01-20_22h56_36.png
    15.4 KB · Views: 12
  • 2025-01-20_22h57_07.png
    2025-01-20_22h57_07.png
    35.2 KB · Views: 12
  • 2025-01-20_23h45_29.png
    2025-01-20_23h45_29.png
    14.2 KB · Views: 12
Hello,

I had a proxmox-ve 8.2 instance for some time, let's call it [B]alpha[/B].mydomain.com
Alpha has been hardened (TFA+firewall) and has been running 10+ vm for some years now.

I just built a second node, let's call it [B]beta[/B].mydomain.com, running 8.3.
Both are connected on the same switch (RTT <1ms) .

rushing to create my first cluster I didn't saw the version mismatch at first.
While joining the cluster, beta complained about the TFA and asked me to go command line.
This failed as did my 3 other attempts (after updating both alpha and beta to the same version 8.3.2, disabling TFA and rebuilding beta from scratch).

The latest attempt went like this (but it was almost the same on every attempt):
on beta:
Code:
Establishing API connection with host [alpha IP]
Login succeeded
check cluster join API version
request addition of this node
in the background I can see this error message permission denied - invalid PVE ticket (401) and after 30s communication failure (0)
after that if I refresh my browser my let's encrypt certificate has disappeared and I'm left with the self signed one.
I can see both alpha and beta in the menu, both with a green tick
I can access both "summary" pages (which are updated), but most of the pages linked to beta are failing with a "loading screen"
"shell" to alpha is failing with root@X.X.X.X: Permission denied (publickey,password).

On alpha:
I can see both alpha and beta but only alpha has a green mark, beta has a grey question mark
in the cluster page I see an error message : '/etc/pve/nodes/beta/pve-ssl.pem' does not exist! (500)
I can connect to both alpha and beta shell
When trying to check the content of /etc/pve/nodes/beta I see that it is mostly empty. When trying to check /etc/pve/nodes/alpha the command hangs and I need to issue a Ctrl+C to regain control

On beta:
When trying to check the content of /etc/pve/nodes/alpha I see that it is mostly full. When trying to check /etc/pve/nodes/beta the command hangs and I need to issue a Ctrl+C to regain control

On alpha :
when checking beta summary, I get an error connection error 596 : error:0A000086:SSL routines::certificate verify failed

after a few minutes, everything goes south, both alpha and beta become unresponsive and I need to shut down beta and issue a pvecm e 1 on alpha to regain control

Do you have any idea ?

Regards
I just experienced the exact same issue. I was able to able to break the cluster for my "alpha" following these instructions:

https://forum.proxmox.com/threads/delete-all-cluster-config-from-proxmox.108478/

My "beta" and "gamma" needed to be completely reinstalled.
 
I just experienced the exact same issue. I was able to able to break the cluster for my "alpha" following these instructions:

https://forum.proxmox.com/threads/delete-all-cluster-config-from-proxmox.108478/

My "beta" and "gamma" needed to be completely reinstalled.
Didn't work for me.
I found https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_separate_node_without_reinstall on day 1.
Used it a lot but to no avail.

As I was stuck with no answer, I decided to go the other way around : since my beta node is more powerful than alpha, I migrated (export+import) all VM there and will rebuild alpha from scratch and then will join it to beta.
 
I have the same problem and tried all of this forum. I didnt find any solution for that and its very random. of 7 nodes, 3 make the same issue every time after reinstall, fully update, check kernelversions and all other.

the joining nodes dont change from "/etc/pve/local" to clusterfilesystem with /etc/pve/nodes and the certificates cant send, cause missing folders and filesystems.
 
Sorry @vernongreen , I decided to nuke and rebuilt from scratch the whole "cluster" (not finalized yet) so I won't be of much help
i also reset my whole cluster, but the issue is still here with all new installed nodes with 3/7 nodes... its so annoying i cant get my fully functionallity back...
 
without logs from both the existing cluster node and the joining one it will be hard to tell what's going on..
 
without logs from both the existing cluster node and the joining one it will be hard to tell what's going on..
Which logs especially you need? I was looking for logs, but I didn't find useful logs. Every time while joining all logs are missing in this time range.
 
the journal covering the time period of the join task +- a few minutes would be a great start. any related task logs as well, of course. please post the logs in [code] tags not as screenshots.