[SOLVED] Problems after adding node(s) with same name as one(s) previously deleted

Nov 10, 2020
15
21
23
41
Sweden
Hello,

This post is more of a proposal/bug related. I just feel like sharing if anyone else encounters the same issues since we solved it.

So the problem we encountered was this:

We had a 5 node cluster (7.0) about to get some new hardware. For brevity let's call the nodes node1/2/3/4/5. We migrated everything off of the last two nodes node4 and node5. Shut them down since their life had now come to an end.

We then removed those nodes using pvecm delnode. All was good. We then provisioned some new hardware and installled Proxmox 7.0 on the two new machines also called node4 and node5. As we joined (assisted join in the GUI) this brand new machine (node4) with newly installed proxmox on it things got messy. First of the node4 became visible in the GUI with a red cross, and the GUI got unresponsive. Turned out that the corosync.conf had been updated with the new node but that was pretty much it. Nothing else had happened as it seemed.

So we removed the joined node4 from corosync.conf following the documentation for that specific task while node4 was shutdown (because then everything worked like a charm again). We then noticed that the cluster configuration filesystem still had folders for the deleted nodes in /etc/pve/nodes/[node4/5]. And this is my first suggestion or appeal to the devs. Shouldn't pvecm delnode also delete these folders? Because after removing these joining the node4 and node5 with the same name as the previously deleted ones worked like a charm. I guess what happened was that the generation of the certs and stuff failed because of the existence of the folders with that same name?

Happy as we were about joining the two new awesome nodes we then hit another wall. ssh-keys :) Or host identities rather. After following some advice from different threads here in the forums one was told to just do pvecm updatecerts on each affected node, in this case node4 and node5 since they couldn't be accessed. The problem we had was that when issuing pvecm updatecerts on for instance node4. We could then reach that node from the others. But as we did the same for node5 the host identities changed for node4. So now node5 was working but node4 was not. The reason to this is that pvecm delnode didn't care about removing the old identities in /etc/pve/priv/known_hosts which each hosts /etc/ssh/ssh_known_hosts points to. So everytime one did pvecm updatecerts the old identities from the old hosts were merged together and overwriting the new ones on the other newly added nodes. So that's the second suggestion/appeal. Make pvecm delnode also remove the old identities in the known_hosts file shared across all hosts :)

And here's what we did (might even work to fiddle directly with /etc/pve/priv/known_hosts but for some reason we did it the this way and it is confirmed to work :) ) to remedy this:
On a arbitrary node enter /etc/ssh/ folder.
issue command: ssh-keygen -f ./ssh_known_hosts -R node4 The example is for a node named node4. tab completion works for the names inside the ssh_known_hosts file and for other hosts aswell like those in /etc/hosts, keep that in mind.
You will probably have like 3 identities for each node, hostname, fqdn and IP. Remove the ones relating to the old nodes you deleted and are now rejoining with the same name. Now your ssh_known_hosts file isn't a symlink anymore. Copy this file to be your new /etc/pve/priv/known_hosts. After this you can issue pvecm updatecerts on all nodes (at minimum the ones afftected, and on the machine on which you issued the commands mentioned above) in the cluster and everything will work like a charm again. And your /etc/ssh/ssh_known_hosts file on the node you chose to do all your commands on is now once again restored to be a symlink to /etc/pve/priv/known_hosts.

I hope this helps someone encountering related issues.

Cheers
Marcus
 
Last edited:
I'm running into the same issue after removing and adding to the cluster a different instance, with the same name (and IP address).
Confirmed with two nodes.
 
Last edited:
I am new to proxmox and was performing migration from ESXI which is very easy if you follow the proxmox guide.
During the migration of 3 nodes I had to use one temporary node in order to migrate and then the plan was to remove the temporary node and use the correct hardware with the same name.
Unfortunately, I have experienced all the same problems as described in this forum except my version of proxmox VE is 8.2.4
Since I am new to proxmox and I had to finish migration my solution was to give up on using same name and IP of the temporary node and give everything new.
It is an unfortunate experience since I like proxmox VE, but I think it should be easier to remove nodes and add them back if needed using the same naming system for clarity and continuity.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!