Hello,
This post is more of a proposal/bug related. I just feel like sharing if anyone else encounters the same issues since we solved it.
So the problem we encountered was this:
We had a 5 node cluster (7.0) about to get some new hardware. For brevity let's call the nodes node1/2/3/4/5. We migrated everything off of the last two nodes node4 and node5. Shut them down since their life had now come to an end.
We then removed those nodes using
So we removed the joined node4 from corosync.conf following the documentation for that specific task while node4 was shutdown (because then everything worked like a charm again). We then noticed that the cluster configuration filesystem still had folders for the deleted nodes in /etc/pve/nodes/[node4/5]. And this is my first suggestion or appeal to the devs. Shouldn't pvecm delnode also delete these folders? Because after removing these joining the node4 and node5 with the same name as the previously deleted ones worked like a charm. I guess what happened was that the generation of the certs and stuff failed because of the existence of the folders with that same name?
Happy as we were about joining the two new awesome nodes we then hit another wall. ssh-keys Or host identities rather. After following some advice from different threads here in the forums one was told to just do
And here's what we did (might even work to fiddle directly with /etc/pve/priv/known_hosts but for some reason we did it the this way and it is confirmed to work ) to remedy this:
On a arbitrary node enter /etc/ssh/ folder.
issue command:
You will probably have like 3 identities for each node, hostname, fqdn and IP. Remove the ones relating to the old nodes you deleted and are now rejoining with the same name. Now your ssh_known_hosts file isn't a symlink anymore. Copy this file to be your new /etc/pve/priv/known_hosts. After this you can issue
I hope this helps someone encountering related issues.
Cheers
Marcus
This post is more of a proposal/bug related. I just feel like sharing if anyone else encounters the same issues since we solved it.
So the problem we encountered was this:
We had a 5 node cluster (7.0) about to get some new hardware. For brevity let's call the nodes node1/2/3/4/5. We migrated everything off of the last two nodes node4 and node5. Shut them down since their life had now come to an end.
We then removed those nodes using
pvecm delnode
. All was good. We then provisioned some new hardware and installled Proxmox 7.0 on the two new machines also called node4 and node5. As we joined (assisted join in the GUI) this brand new machine (node4) with newly installed proxmox on it things got messy. First of the node4 became visible in the GUI with a red cross, and the GUI got unresponsive. Turned out that the corosync.conf had been updated with the new node but that was pretty much it. Nothing else had happened as it seemed.So we removed the joined node4 from corosync.conf following the documentation for that specific task while node4 was shutdown (because then everything worked like a charm again). We then noticed that the cluster configuration filesystem still had folders for the deleted nodes in /etc/pve/nodes/[node4/5]. And this is my first suggestion or appeal to the devs. Shouldn't pvecm delnode also delete these folders? Because after removing these joining the node4 and node5 with the same name as the previously deleted ones worked like a charm. I guess what happened was that the generation of the certs and stuff failed because of the existence of the folders with that same name?
Happy as we were about joining the two new awesome nodes we then hit another wall. ssh-keys Or host identities rather. After following some advice from different threads here in the forums one was told to just do
pvecm updatecerts
on each affected node, in this case node4 and node5 since they couldn't be accessed. The problem we had was that when issuing pvecm updatecerts
on for instance node4. We could then reach that node from the others. But as we did the same for node5 the host identities changed for node4. So now node5 was working but node4 was not. The reason to this is that pvecm delnode didn't care about removing the old identities in /etc/pve/priv/known_hosts which each hosts /etc/ssh/ssh_known_hosts points to. So everytime one did pvecm updatecerts the old identities from the old hosts were merged together and overwriting the new ones on the other newly added nodes. So that's the second suggestion/appeal. Make pvecm delnode also remove the old identities in the known_hosts file shared across all hosts And here's what we did (might even work to fiddle directly with /etc/pve/priv/known_hosts but for some reason we did it the this way and it is confirmed to work ) to remedy this:
On a arbitrary node enter /etc/ssh/ folder.
issue command:
ssh-keygen -f ./ssh_known_hosts -R node4
The example is for a node named node4. tab completion works for the names inside the ssh_known_hosts file and for other hosts aswell like those in /etc/hosts, keep that in mind.You will probably have like 3 identities for each node, hostname, fqdn and IP. Remove the ones relating to the old nodes you deleted and are now rejoining with the same name. Now your ssh_known_hosts file isn't a symlink anymore. Copy this file to be your new /etc/pve/priv/known_hosts. After this you can issue
pvecm updatecerts
on all nodes (at minimum the ones afftected, and on the machine on which you issued the commands mentioned above) in the cluster and everything will work like a charm again. And your /etc/ssh/ssh_known_hosts file on the node you chose to do all your commands on is now once again restored to be a symlink to /etc/pve/priv/known_hosts.I hope this helps someone encountering related issues.
Cheers
Marcus
Last edited: