[SOLVED] pvecm updatecert -f - not working

Nice.
I assume this should be run on each host.

I also assume the steps are;
Download to /usr/share/perl5/PVE/Cluster/Setup.pm
Make executable.
Run it.
 
Nice.
I assume this should be run on each host.

I also assume the steps are;
Download to /usr/share/perl5/PVE/Cluster/Setup.pm
Make executable.
Run it.
Hey! Sorry I made it very terse as it was the minimum to include for them to take it in.

It's a standard patch file you'll encounter in Linux that contains only the changes (diff command makes these files). What that means is that you can just use it on the original file, like so:
# patch file-to-apply-on patchfile

What it does - it takes the original file and changes only the lines what need changing (it's quite human readable if you are curious). But because it is made in this rudimentary way, it really needs the original file that it was based on, so beware it was done on my system off the mentioned version (in the Bugzilla post). Unless you have the same version, you would NOT want to apply the patch.

It's the most recent version as of today, also linked into the Github. If you want to play it safe, check if it is the same file in the first place, then backup yours (just cp Setup.pm Setup.pm.orig would do). Accidentally, if you tried to apply the patch on an already patched file, it will actually offer you to reverse the previously performed changes.

The files are Perl scripts, you can leave the permissions as they are. You will be able to run-test the new code simply by running the pain-in-the-neck pvecm updatecerts. Unless it crashes (if it did just revert the file back), you basically have the patched version of the command on THAT node. (You can then proceed to patch it on others.) Unless you have a broken known_hosts file again, you do not need to run it at all. The point of the patch is that it should not get the known_hosts broken again at all. The patched piece of code is also run on joining a blank node into the cluster, so if you were to add a new node, you would need to patch that one too and only then add it.

I realise it's qutie tedious. For most people, they should really wait. The PVE team should pull it into the next upgrade or whatever release cycle they have, I would hope for convenient apt update/upgrade fix of this later on. As long as you are happy now, you can just wait. Once the patched code is on the nodes, you would also not need the procedure we ended up using in this thread, simply because it should stop corrupting the files we had to all wipe clean and start over.

And of course, the official word can only come from the PVE team, so your call. If you run production now and all is good, I would wait. Oh and if anyone finds this post much later on, make sure it's not already included in your system, it should be, eventually. :)
 
Last edited:
Yes, everything has been working well again since the problem was solved.
About the only thing that remains unknown is if we'll be able to get to a DNS method rather than hostname and IP.
That should help prevent this though, not sure how often it happens.
 
Yes, everything has been working well again since the problem was solved.

This reminds me, if using this patched version, you also would not need to worry about naming a new node after a dead one - like no one ever should. :)

About the only thing that remains unknown is if we'll be able to get to a DNS method rather than hostname and IP.

They are not going to do that, if you went on to read into the 3 Bugzilla reports, everything is going to stay as down-to-earth as possible (this is my takeout, do not quote me on that as I am not quoting anyone either;). The reasons given would be something non-descriptive (stability of cluster, latency of UDP on network, etc.) Needless to say, I do not agree to any of that.

So the system is using HostKeyAlias entries as a crutch - this is an abused SSH feature (which is meant to be used for situation when your host's IP is actually constantly changing or you run multiple ones on single host. It's going to be all confusing for an average user because say you create a node name it pve7 and put it on 10.0.0.7. This is all static config of the node, it's put in the /etc/hosts, it's put in some other places (corosync.conf). Then you have to worry your DHCP does not give out such address and you will only have the hostname "resolve" on that very node (it pulls it from the hosts file). If you manually try to e.g. ssh pve3, it will not successfully look up that node's IP anywhere. So either you connect by the IP and like the built-in tools even use the HostKeyAlias option for the SSH command (on a second thought I wonder what was that of any use, the IP entry and alias entry at the same time). At the beginning I had thought I was missing something about this setup. I do not think that anymore. (Anyone who finds this post and wants to enlighten us all, happy to read it.)

That should help prevent this though, not sure how often it happens.

It should not happen unless the hostkey changes or extra known_hosts are appended. But you do not really have all that much control over it. That's another complication with using the hostkey aliases looking like DNS hostnames - they will clash. But I would go on a tangent here. ;)

Anyhow I am glad it was such a small thing to fix, what I do not understand is how it was not a problem for 10+ years. :rolleyes:
 
Last edited:
It's a big mention in my notes now, do not re-use hostname and IP when rebuilding or adding new hosts.

All good and super happy things are working again thanks to your help.
 
  • Like
Reactions: tempacc375924
It's a big mention in my notes now, do not re-use hostname and IP when rebuilding or adding new hosts.

All good and super happy things are working again thanks to your help.
Sadly it's not really just that - sometimes you may end up having your hostkey changed. It happens in certain scenarios when your machine's hostkey (usually not on a regular update on physical hardware) regenerates. That in turn would cause the same problem. Or simply someone ever accessing something by same actual DNS hostname as the alias ... and you would have started getting same results. There was way more than one person getting this... https://forum.proxmox.com/threads/p...-in-cluster-after-upgrade.133030/#post-606358

Yep, so ... have a good one! :)
 
  • Like
Reactions: Proximate
I tried my best, there's the tutorial, there's the overview of other options (cleaning it all up, manually removing it everytime this happens) and there's the bugreport with the patch. Instead of pulling the patch, it's now moved on to have it living its own life in the pve-devel list like nothing was a bug:
https://lists.proxmox.com/pipermail/pve-devel/2023-December/061176.html

So talking of 10 years old bug, bundle it with other stuff and call it "improving SSH handling". The issue is, it's RFC, so it will take a while and it may not even make it to the next major version still.

I tried to point out in the tutorial thread how many times it has been appearing on the forum [1], but people do not even +1 themselves on the Bugzilla issues, such is the community of users of PVE. So I have to assume it's all hobbyists or risk-takers who could not afford the other alternatives.

I found some other bugs in PVE's Perl scripts, but I did not even post them on Bugzilla anymore. But don't get me wrong, for homelab it's good enough. Especially without clustering or HA, it's alright, it's just that for that there's other open source alternatives that are not as buggy too.

[1] https://forum.proxmox.com/threads/s...ass-ssh-known_hosts-bug-s.137809/#post-615906
 
Last edited:
I spent some hours debugging this issue, was getting crazy, and I solved it on my 3 nodes cluster this way: https://forum.proxmox.com/threads/c...ask-error-migration-aborted.42390/post-619486

Multiply the hours of yours by the number of people who had encountered this, some of the staff hours spent (wrongly) advising them in the past, it tells you something ...

Hope it doesn't trigger again.
Due to the nature of the bug, it unfortunately will, one way or another when the keys are regenerated, the certs tutorial was to avoid that because they are not fixing it anytime soon.
 
As a SaaS owner myself, I know how people are when things don't work or cause them frustration. They simple leave, most of the time, without even asking for help then trash the services/product in forums.

Not sure why the devs of Proxmox would even take such a chance. Proxmox has a healthy community and tons of input to be used.

Makes no sense to me.

Also, I didn't even know there are alternatives. I've never looked beyond Proxmox once I found it thinking of moving from vmware.
I use Proxmox in production but I've also never gotten rid of vmware yet. I do/did plan on subscriptions but am a little nervous lately.
 
As a SaaS owner myself, I know how people are when things don't work or cause them frustration. They simple leave, most of the time, without even asking for help then trash the services/product in forums.

I did not even know PVE 3 months ago, I looked around, downloaded PVE and tried to set up a cluster. Before considering it I would want to test it throughout, so I went on to dismantle it, recreate, add nodes, mimic nodes deads, etc. Sure enough in a day or two I had replications failing for no apparent reason, which then likely (?) caused a node to self-fence going on reboot occasionally when it could not restart the said VM because of missing replica. It is how I went on to discover the bug. This is what a QA department is for, if not it's caught by the community - which it was - and then it has to be acted on. I did not move on to use PVE for production anything, it is usable, it lacks QC, I will not bash it per se, but I will caution everyone if they ask me about it.

Not sure why the devs of Proxmox would even take such a chance. Proxmox has a healthy community and tons of input to be used.

The only good thing I found about it all is that it was allowed to stay on the forum here so that others can find it. I say only, because I had namedropped everyone including CEO and nothing really in terms of pulling the patch (the low hanging fruit). I finished my rant in an unrelated long-living bug/issue/feature-request here:
https://forum.proxmox.com/threads/vzdump-unexpected-status.125476/page-2#post-617106

Makes no sense to me.

I can only assume there's even more serious problems to deal with, but then I see git commits with translation typo hunters. Odd.

Also, I didn't even know there are alternatives. I've never looked beyond Proxmox once I found it thinking of moving from vmware.
I use Proxmox in production but I've also never gotten rid of vmware yet. I do/did plan on subscriptions but am a little nervous lately.

I did know, I came to evaluate PVE with others lower cost ones. PVE does not really stand against ESXi, as in, unless it's a SOHO use, admins do not really mass migrate from THAT. But I will not be posting the other alternatives (open source too) on public forum post since I kind of keep some standard for myself (even I had nothing deleted here on the forum).
 
I can appreciate your not posting the others and glad you didn't.
Hope devs will see this and someone push the others to do a little better.
I do see this as being a solid competitor at some point assuming they keep the costs low.
 
I can appreciate your not posting the others and glad you didn't.
Hope devs will see this and someone push the others to do a little better.
I do see this as being a solid competitor at some point assuming they keep the costs low.

You see, there's been many good interactions I had on the forum, the forum is probably the strongest aspect of Proxmox as a brand amongst e.g. hobbyists. I also had staff reply me many times in more than helpful way, I can only imagine they try even harder with support tickets, but ... I literally do not understand interactions when e.g. a simple docs update would save a lot of confusion to everyone, e.g.:

https://forum.proxmox.com/threads/n...m-host-is-pve-8-0-4.132620/page-2#post-617051

The answer I got is the style I used to see several times to some other people. At the same time, I sometimes wonder how come the devs have so much time to spend on the forum (including answering repetitive questions). The only thing I could think of was that it's literally KPI to be on the forum including for the devs and so you go answer as many threads as possible. The issue I take with that is it does not seem to have any purpose (for the in-house know-how). For instance, if I encounter many inquiries about the same on the forum, not only would I get tired of it as a dev, I would also want to make sure that our docs or a simple if-then-else check in script avoids the most common reasons people come to the forum (including with silly questions). But it does not seem to be in the internal culture, almost as if some concepts are untouchable.

I understand I am not the nicest person in the room, but I do not consider myself rude, just blunt. When someone bothers to answer my thread (which I am not entitled to), why not address at least why I am supposedly wrong. Even just a statement "our position is that ..." ... instead, silence. Similar like after patch appeared in the bugreport. And the next day another staff member was advising someone to run into the pvecm updatecerts bug (they do not know, after all the forum threads?) ...

https://forum.proxmox.com/threads/v...1-1-november-23-iso.137502/page-2#post-616090
 
Last edited:
Truth is, we'll never know unless a dev responds to this.
As a user, I can only imagine that they are prioritizing, perhaps not responding to some questions that make them a lot of money on the paid user side. That's the thing that sucks when using a free with paid support platform of any kind.
 
  • Like
Reactions: esi_y
So, if anybody runs into this. I couldn't get updatecerts to add keys for reinstalled nodes to the global /etc/pve/priv/ssh_known_hosts; however the folder /etc/pve/nodes/<nodename> contains a ssh_known_hosts file which contains the content you need; copy it over and the world is good again.
 
So, if anybody runs into this. I couldn't get updatecerts to add keys for reinstalled nodes to the global /etc/pve/priv/ssh_known_hosts; however the folder /etc/pve/nodes/<nodename> contains a ssh_known_hosts file which contains the content you need; copy it over and the world is good again.

Your post put me in the right track and it seems I'm able to connect by WebGUI shell from any host to any host in the cluster now.

The problem was that two of my nodes were missing ssh_known_hosts file in
Code:
/etc/pve/nodes/<node>/
(The hosts that gave me KEY CHANGED warning in WebGUI Shell)

I logged in to both troublesome nodes via ssh terminal and copied SSH public key from
Code:
/etc/ssh/ssh_host_rsa_key.pub
to
Code:
/etc/pve/nodes/<node>/ssh_known_hosts
file and added the node hostname in the beginning of the line before RSA public key like so:

Code:
NodeHostname ssh-rsa <the_rsa_pub_key>

after that I restarted SSH service systemctl restart sshd on both nodes (not sure if necessary)

This seems to have worked.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!