Hey! Sorry I made it very terse as it was the minimum to include for them to take it in.Nice.
I assume this should be run on each host.
I also assume the steps are;
Download to /usr/share/perl5/PVE/Cluster/Setup.pm
Make executable.
Run it.
diff
command makes these files). What that means is that you can just use it on the original file, like so:# patch file-to-apply-on patchfile
cp Setup.pm Setup.pm.orig
would do). Accidentally, if you tried to apply the patch on an already patched file, it will actually offer you to reverse the previously performed changes.pvecm updatecerts
. Unless it crashes (if it did just revert the file back), you basically have the patched version of the command on THAT node. (You can then proceed to patch it on others.) Unless you have a broken known_hosts
file again, you do not need to run it at all. The point of the patch is that it should not get the known_hosts
broken again at all. The patched piece of code is also run on joining a blank node into the cluster, so if you were to add a new node, you would need to patch that one too and only then add it.apt update/upgrade
fix of this later on. As long as you are happy now, you can just wait. Once the patched code is on the nodes, you would also not need the procedure we ended up using in this thread, simply because it should stop corrupting the files we had to all wipe clean and start over.Yes, everything has been working well again since the problem was solved.
About the only thing that remains unknown is if we'll be able to get to a DNS method rather than hostname and IP.
HostKeyAlias
entries as a crutch - this is an abused SSH feature (which is meant to be used for situation when your host's IP is actually constantly changing or you run multiple ones on single host. It's going to be all confusing for an average user because say you create a node name it pve7
and put it on 10.0.0.7
. This is all static config of the node, it's put in the /etc/hosts
, it's put in some other places (corosync.conf
). Then you have to worry your DHCP does not give out such address and you will only have the hostname "resolve" on that very node (it pulls it from the hosts
file). If you manually try to e.g. ssh pve3
, it will not successfully look up that node's IP anywhere. So either you connect by the IP and like the built-in tools even use the HostKeyAlias
option for the SSH command (on a second thought I wonder what was that of any use, the IP entry and alias entry at the same time). At the beginning I had thought I was missing something about this setup. I do not think that anymore. (Anyone who finds this post and wants to enlighten us all, happy to read it.)That should help prevent this though, not sure how often it happens.
Or just read here ... https://forum.proxmox.com/threads/ssh-keys-across-nodes.136437/#post-605931They are not going to do that, if you went on to read into the 3 Bugzilla reports, everything...
Sadly it's not really just that - sometimes you may end up having your hostkey changed. It happens in certain scenarios when your machine's hostkey (usually not on a regular update on physical hardware) regenerates. That in turn would cause the same problem. Or simply someone ever accessing something by same actual DNS hostname as the alias ... and you would have started getting same results. There was way more than one person getting this... https://forum.proxmox.com/threads/p...-in-cluster-after-upgrade.133030/#post-606358It's a big mention in my notes now, do not re-use hostname and IP when rebuilding or adding new hosts.
All good and super happy things are working again thanks to your help.
I spent some hours debugging this issue, was getting crazy, and I solved it on my 3 nodes cluster this way: https://forum.proxmox.com/threads/c...ask-error-migration-aborted.42390/post-619486
Due to the nature of the bug, it unfortunately will, one way or another when the keys are regenerated, the certs tutorial was to avoid that because they are not fixing it anytime soon.Hope it doesn't trigger again.
As a SaaS owner myself, I know how people are when things don't work or cause them frustration. They simple leave, most of the time, without even asking for help then trash the services/product in forums.
Not sure why the devs of Proxmox would even take such a chance. Proxmox has a healthy community and tons of input to be used.
Makes no sense to me.
Also, I didn't even know there are alternatives. I've never looked beyond Proxmox once I found it thinking of moving from vmware.
I use Proxmox in production but I've also never gotten rid of vmware yet. I do/did plan on subscriptions but am a little nervous lately.
I can appreciate your not posting the others and glad you didn't.
Hope devs will see this and someone push the others to do a little better.
I do see this as being a solid competitor at some point assuming they keep the costs low.
So, if anybody runs into this. I couldn't get updatecerts to add keys for reinstalled nodes to the global /etc/pve/priv/ssh_known_hosts; however the folder /etc/pve/nodes/<nodename> contains a ssh_known_hosts file which contains the content you need; copy it over and the world is good again.
/etc/pve/nodes/<node>/
/etc/ssh/ssh_host_rsa_key.pub
/etc/pve/nodes/<node>/ssh_known_hosts
NodeHostname ssh-rsa <the_rsa_pub_key>