[SOLVED] Problem with reinstalling a node of a cluster

Feb 17, 2020
106
22
38
45
Hi,

I followed this guide to reinstall a node of my cluster step-by-step:

https://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster#Re-installing_a_cluster_node

Two things I did different:
- The 'cman' service did not exist in my case so stopping / starting didn't work but I ignored that.
- A copied the backup…tar.gz files to another host via scp and restored them afterwards because obviously reinstalling a PVE host would destroy them.

The first host is running, the second host is running, the cluster information were correctly restored but one host cannot connect to the other:
- proxmox (is untouched)
- proxmox2 (is reinstalled)

This is a screenshot from the untouched node:

1594886882493.png

This is a screenshot from the reinstalled node:

1594886910165.png

It's clear that communication does not work between the two nodes.

I think I know what causes it but I don't understand it:

- SSH from proxmox2 (reinstalled) -> proxmox (untouched) works fine
- SSH from proxmox (untouched) -> proxmox2 (reinstalled) throws a public key changed error

1594887001989.png

Now I don't understand this because all the /root/.ssh files were restored from the backup and it's clearly visible from the timestamps that they are:

1594887061077.png

Any help is greatly appreciated!
 
  • Like
Reactions: davidand
Hi @Stoiko Ivanov - thanks for your quick response.

I didn't notice that I'm reading an old manual, thanks for pointing out. I read the document you sent above. Let me just confirm:

In order to reinstall one custer node I need to do the following:

1) Remove the node fro the cluster: Remove a Cluster Node

2) The follow these insutrctions:

If, for whatever reason, you want this server to join the same cluster again, you have to
  • reinstall Proxmox VE on it from scratch
  • then join it, as explained in the previous section.
After removal of the node, its SSH fingerprint will still reside in the known_hosts of the other nodes. If you receive an SSH error after rejoining a node with the same IP or hostname, run pvecm updatecertsonce on the re-added node to update its fingerprint cluster wide.

Is that correct?
 
The reference docs describe the working procedure - so yes :)
 
@Stoiko Ivanov - OK, I think I'm having a problem here.

Since proxmox2 cluster node was shut down before removing it, it does not show up in the command line and I can't seem to be able to remove it following the guide above:

Code:
root@proxmox:~#  pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 proxmox (local)

root@proxmox:~# pvecm delnode proxmox2
cluster not ready - no quorum?

However, I do see the node in the WebUI:

1594894116199.png

I think this is a fairly normal situation as a cluster node can just die from hardware failure, what is the right way to do now?

Thanks!
 
it's described a bit further down in the reference documentation I linked above:
If the command failed, because the remaining node in the cluster lost quorum when the now separate node exited, you may set the expected votes to 1 as a workaround:

pvecm expected 1
And then repeat the pvecm delnode command.
 
  • Like
Reactions: davidand
@Stoiko Ivanov - Thanks, it worked!

I did:

1) pvecm expected 1
2) pvecm delnode proxmox2
3) Reinstall proxmox2 from scratch (same hostname, same IP address)
4) Rejoin the cluster

Surprisingly enough I don't even get any SSH key warnings when ssh-ing from proxmox to proxmox2 therefore even the new SSH keys were synced automatically.
 
  • Like
Reactions: Stoiko Ivanov