Cluster config - Stuck while join

Feb 24, 2022
87
5
13
40
Hi all,

I think the whole mess came up because I tried to setup a cluster between two different pve versions, but I am not sure about that. I managed to upgrade both machines to 7.4-16, but still the system seems to be messed up.
The cluster join still fails.

This is what I get:
Cluster "host":
1694942662034.png
1694942681050.png
The new node shows up but does not turn green.
Error message pve-ssl.pem does not exist!

Cluster "client":
1694942770888.png
Node is stuck in the Join task
Latest output: "Request addition of this node"


In the end this failed cluster join leads to a fatal situation:
After a hard reset the host is not able to start the VMs any more because of a broken cluster configuration?!
one can force the start by executing pvecm expected 1

The only way out is to remove the cluster and the node directories with:
Code:
rm -f /etc/pve/cluster.conf /etc/pve/corosync.conf
rm -f /etc/cluster/cluster.conf /etc/corosync/corosync.conf
rm /var/lib/pve-cluster/.pmxcfs.lockfile
<Reboot>
Code:
rm -rf /etc/pve/nodes/<nodename>

On the other node it is similar. I tried to cleanup all the cluster and corosync files.

Now I am a little bit stuck.
I do have both pves on the same version.
Time is synced up and also same time zone (if that makes a difference.
I tried to cleanup all filles I thought involved.

Always the same. When I think I am clean and well prepared. I create a new cluster (fine). Then I add the node... stuck.

Hope you can help.
 
Last edited:
I just did some further testing:

I swaped the network cable to be sure.

I manually copied the /etc/pve/nodes/LayProx2/pve-ssl.pem file from the LayProx2 to the LayProx machine.
Now the error message in "Datacenter" -> "Cluster" disappeard but the connection still does not work.

I also tested to ping A to B and from B to A and both nodes see each other.

On the host I have seen different errors in the Syslog. Don't know if these may be related:
Code:
pveproxy[28247]: proxy detected vanished client connection
or
corosync[4642]:   [TOTEM ] Retransmit List: d e ba c7 cf d4 d7 d9 e4 eb ed ef f1 f5 f8
not sure what to do with that

And another interesting thing:
After the failed node join the LayProx2 is not available any more. Neither via web, nor via SSH. But it is still pingable and the Icon on the LayProx is still green for LayProx2.
 
I think I solved the problem myself...

Executing the following on both nodes has solved the problem:

Code:
rm -f /etc/pve/pve-root-ca.pem /etc/pve/priv/pve-root-ca.* /etc/pve/local/pve-ssl.*
pvecm updatecerts -f

Don't ask me why, I am sure an expert could tell...
For me it is not understandable how I managed to get such a mess in my system. ;-)
 
I have to reopen this.
It was working well and I successfully migrated one vm to the second node but after about one day the cluster was broken again.

It is getting a little bit annoying...

Ok, back to the beginning and going through what could be wrong:

1.) Time Sync
I manually configured a NTP server on both machines and checked that the sync was successfull and the time is the same.
Code:
timedatectl

2.) Certificates
I manually recreated the certificates several times with:
Code:
rm -f /etc/pve/pve-root-ca.pem /etc/pve/priv/pve-root-ca.* /etc/pve/local/pve-ssl.*
pvecm updatecerts -f

3.) DNS
Not sure how this gets into the play but I compared:
/etc/resolv.conf
on both machines
The search name is set to the same kind of pseudo domain (xyz.com)

4.) corosync config
Not sure about this too. I still don't understand why it should be needed to mess around with this on a clean default installation but I read from somebody that this may help:
Bash:
#totem {
#  cluster_name: homelab
#  config_version: 15
#  interface {
#    linknumber: 0
#  }
#  ip_version: ipv4-6
#  secauth: on
#  version: 2
#}

totem {
  cluster_name: homelab
  config_version: 16
  interface {
      ringnumber: 0
      knet_transport: sctp
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
  token: 10000
}
Not sure what I am doing there and changing the /etc/pve/corosync.conf is kind of tricky as read in https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf

I think I need a break. Every help is welcome. Currently I have the following error:

permission denied - invalid PVE ticket (401)
 
Last edited:
I ran into the same problem and haven’t resolved it yet. Something we both have is the mixed environment where not all nodes have upgraded yet, that might be the cause.

By the way you can resolve the urgent problems by following the instructions in the manual page for pvecm to have the new node leave the cluster, the problems will just reappear when you try joining again.
 
Last edited:
I made a number of changes:

* upgraded all the other members to 8
* switched out to a different network port and cable
* manually copied over the certificates

And now it is working. Unfortunately I can’t tell you which step made the difference.
 
I made a number of changes:

* upgraded all the other members to 8
* switched out to a different network port and cable
* manually copied over the certificates

And now it is working. Unfortunately I can’t tell you which step made the difference.
Thanks for your input.

I successfully upgraded Proxmox 8.
Still the same issue.

How did you managed to manually copy the certificates?
When I join the second node to the first one's cluster the problem begins.
Can't access the web interface of node2.
I can access via SFTP but I can't write to /etc/pve/nodes/node1or2/ .
After executing 1 command via ssh everything freezes.

Thanks.
 
My suggestion would be:

* follow the steps in man pvecm for making the node leave the cluster. This way /etc/pve starts working again.
* connect the nodes to a low latency, quiet network
* try again

You may also want to clear out your failing nodes node directory on one of the cluster members.

While a node is in a failing state and /etc/pve is not working, you can forcefully make it work by stopping the pve-cluster service and running ‘pmxcfs -l’ Then you can copy the certificates from a working node to your problem node. But I presume this is all a big hack and not necessary if the network is working properly.
 
  • Like
Reactions: fpdragon
Finally... After days of investigation and numerous reinstalls and config hacks, I think I had a breakthrough.

I moved the second PC physically next to the first one and they are now both connected directly to the switch next to each other. And now, suddenly the cluster join is working again.

I have to say, before both machines where directly connected to two different meshed wifi AP and although I normally experience very good bandwith and data throughput, I am sure that the latency is not the best. However, ping was still fine and for me, I don't have high requirements since I basically just want to use the cluster to easily migrate VMs between nodes.
It seems that the Wifi latency kills the whole cluster and further makes the nodes unusable.

However, now that I have new knowladge, the question is how to fix that issue permanently and increase the tollrances for networking latency and timeout times?
 
I just double checked my latency between the original networking ports that were not working.
In my opinion that does not look that bad.
In https://pve.proxmox.com/wiki/Cluster_Manager it is written that the default cluster config requires 5 ms and the worst case is 10 ms.
How can I rise this limits? I don't want to use the HA feature, I just want to use clustering for live VM migration.
1696149043506.png
 
Corosync is (inherently) very sensitive to latency, unfortunately there is no way to "rise this limit". 1-2ms latency should work perfectly fine assuming the connection is stable and there are no latency spikes though, any spike in latency could break the setup.

Do note that with 2-node setups you should have a QDevice to avoid split-brain situations, see [1].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
 
Thanks for the note.

I think I now have managed to get everything running even on the problematic network interface.

Although I had a pretty stable and good ping I noticed that something has been wrong with my ethernet connection. First, I noticed that the router degraded the connection to 100Mb/s. After playing arround with the router settings, firewall setting and broadcast config I somehow managed to come back to stable 1Gb/s LAN. With that step, all troubles with proxmox clustering also disappeared.

In the end, what I have learned is that it says nothing about the quality of your network connection if you have a stable and good ping.

Thanks for all the help. Debugging cluster issues seems to be not that easy.