[TUTORIAL] [OBSOLETE] SSH Host Key Certificates - How to bypass SSH known_hosts bug(s)

That doesn't appear to work, it only appears to respond to the ssh_known_hosts file in the nodes directory. Will write my own script to check and update these files.

I've created this batch script to d append the CA public key if it is removed and installed it in the /etc/cron.hourly folder of one of the nodes.

Bash:
#!/bin/bash

# Define the directory and the string to search for
base_dir="/etc/pve/nodes"
search_string="cert-authority"
ca_string="your-ca-custom-string"

# Find all ssh_known_hosts files in the nodes subdirectories
find "$base_dir" -type f -name "ssh_known_hosts" | while read -r file; do
    # Check if the ssh_known_hosts file contains the CA public key
    if grep -q "$search_string" "$file"; then
        echo "CA public key found in $file. Doing nothing."
    else
        echo "CA public key not found in $file. Appending CA public key."
        echo "$ca_string" >> "$file"
    fi
done

Hopefully, this should resolve it for now.

So I did not get a chance to testlab the new version, but at least I found that the official rollout of the new "feature" is for v 8.2:
https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2

It is under "Improved management for Proxmox VE clusters" and completely misrepresents the the issue in the past having been allegedly caused by "conflicting hostkeys appeared in /root/.ssh/known_hosts" - either the person compiling the notes has no idea (more likely) or the person making the fix had no idea what was to be fixed (less likely).

Most confusingly it states that "For existing clusters, pvecm updatecerts can optionally unmerge the existing /etc/ssh/ssh_known_hosts." - I wonder which was your case, what exactly you had issue with under what circumstances (existing cluster, updated only some nodes, retained some symlinks but not others, etc.).

I noticed this thread popped up soon after the new release came out:
https://forum.proxmox.com/threads/pvecm-qdevice-setup-fails.88681/page-2#post-668408

I suspect you found a new bug, but you would need to file it as such and document it to be reproducible with: 1) upgraded; 2) new install.

I would leave this thread here, for anyone willing to take over, feel free to contribute.
 
So, I got hit by this, from the perspective of *ADD*ing a PVE8.2 node to a 7.x cluster, with the intention to migrate the vMs/LXCs to that while decommissioning the old host it's replacing, rinse repeat for the rest of the new & old nodes.

So, yes, the challenge is on to add a node to an old cluster you are migrating from
 
So, I got hit by this, from the perspective of *ADD*ing a PVE8.2 node to a 7.x cluster, with the intention to migrate the vMs/LXCs to that while decommissioning the old host it's replacing, rinse repeat for the rest of the new & old nodes.

So, yes, the challenge is on to add a node to an old cluster you are migrating from

I suppose this has absolutely no relation to using (in the past) SSH certs whatsover, as it might appear futile, please submit a proper report [1]. From what you describe it should be a new report but related to 4252, 4886, etc.

Someone from PVE staff should have a look at it and consider if this was intended behaviour or not, then adapt release notes.

If no one does it, it will be buried in this unrelated thread. Quality of alpha testing is incredible once again. Cordial thanks to involuntary beta testers...

[1] https://bugzilla.proxmox.com/
 
I wanted to pop in here and ask about my cluster and my upgrade pathway.

I had to follow this originally when I set this cluster up, and now im, afraid to update my nodes for fear it will break something important, as this cluster "inherited" critical workloads when other systems failed...
From someone who is more knowledgeable about this, what should I be worried about after having performed this workaround?

I noted that I am on 8.2.4, and I am not even sure if the fix for this from proxmox is before or after that release...

I would really appreciate someone to weigh in on this. Thanks!
 
I wanted to pop in here and ask about my cluster and my upgrade pathway.

I had to follow this originally when I set this cluster up, and now im, afraid to update my nodes for fear it will break something important, as this cluster "inherited" critical workloads when other systems failed...
From someone who is more knowledgeable about this, what should I be worried about after having performed this workaround?

This is my guide, so I guess I have to answer. :)

If you have additional keys (keys that are signed = additional ones, like any other) in your known_hosts, it does not cause anything. The system simply can recognise remote systems by additional keys. If you have multiple keys for the same remote system, it does not cause any issue (the misinformation that it does was tacitly also communicated by Proxmox).

I noted that I am on 8.2.4, and I am not even sure if the fix for this from proxmox is before or after that release...

I would really appreciate someone to weigh in on this. Thanks!

If you wonder about, after upgrading Proxmox, what you can do, to remove the extra keys, personally, I would probably not. I still convey to everyone that SSH certificates are superior to juggling around individual ones. But after the upgrade, your PVE nodes will not use them to recognise the other nodes, they now use completely custom ssh command line which ignores them.

If you want to remove the extra keys anyhow, you can safely do so, simply revert the steps, so remove lines you had added to your files.

But this is an excellent point, perhaps I should publish a guide, how to do SSH certificates setup, both known_hosts and authorized_keys, becaues Proxmox is not doing this and they have bugs all over for missing out, e.g.:

https://bugzilla.proxmox.com/show_bug.cgi?id=4670
https://bugzilla.proxmox.com/show_bug.cgi?id=5060
https://bugzilla.proxmox.com/show_bug.cgi?id=5736
 
  • Like
Reactions: Ela@Blach
Also, as for disentangling what Proxmox nodes were doing (merging ssh_known_hosts) on systems, see my comment in BZ for this (now closed) bug:
https://bugzilla.proxmox.com/show_bug.cgi?id=4252#c32

If you ask when exactly was this fixed, I would need to go check Proxmox commit log to GIT, but this my pet peeve, their own RELEASE NOTES are useless, e.g. consider:

https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2

Improved management for Proxmox VE clusters​


  • Modernize handling of host keys for SSH connections between cluster nodes (isse 4886).Previously, /etc/ssh/ssh_known_hosts was a symlink to a shared file containing all node hostkeys.This could cause problems if conflicting hostkeys appeared in /root/.ssh/known_hosts, for example after re-joining a node to the cluster under its old name.Now, each node advertises its own host key over the cluster filesystem. When Proxmox VE initiates an SSH connection from one node to another, it pins the advertised host key.For existing clusters, pvecm updatecerts can optionally unmerge the existing /etc/ssh/ssh_known_hosts.

You would not even think that there was any bug by documenting it like this. You would not know which exact version got the fix in (also not mentioned in BZ) and you would think that if
  • conflicting hostkeys appeared in /root/.ssh/known_hosts
... this caused any problem.

Let this be clear to everyone one last time. There's no such thing as "conflicting" keys in SSH known_hosts. Proxmox is wrong. I told them before, they ignore me. The bug here was caused by wrongly removing valid keys. You can have multiple keys for the same machine in the known_hosts no problem.
 
  • Like
Reactions: Ela@Blach
So, If I am interpreting this correctly, It should be fine to run updates, and wont cause any issues without taking any particular steps beforehand.

I have backups of everything so If SHTF I can redeploy all infrastructure.

Thanks.
 
So, If I am interpreting this correctly, It should be fine to run updates, and wont cause any issues without taking any particular steps beforehand.

In relation to this tutorial, of course. :) Yes, that's correct.

You can imagine I do not want to vouch for anything other (e.g. disentangling known_hosts after upgrade - see the linked BZ closing this issue), but I take the calculated risk stating that doing that also does not cause a problem either.

Basically, PVE stopped using the common known_hosts file altogether. It used to symlink it from all nodes to shared /etc/pve version. Now they do not do it anymore. If you yourself used that for anything (e.g. some own scripts), you might be getting extra prompts to verify remote hosts (for your scripts). But this is in relation to the new "unmerge" command in the new pvecm.

Nothing to do with this guide.

I have backups of everything so If SHTF I can redeploy all infrastructure.

You always should have. :)


You're welcome.
 
  • Like
Reactions: Ela@Blach
BTW Proxmox broke some things which users previously expected. I just drop it here both for my own reference or if anyone finds it later. Since your global known_hosts per nodes are not synced now, say you SSH into node1 and want to then from there just SSH into node2. Previously, you would not get a prompt. Now, the first time you do this, you will. And this is for every possible combination of connections. The first thing that comes to mind is that if someone has e.g. Ansible on some of the nodes using to set up others, it will sooner or later start complaining about newly added nodes.

Proxmox do not care. And I told them about it:
https://lists.proxmox.com/pipermail/pve-devel/2024-January/061329.html
 
Also, as for disentangling what Proxmox nodes were doing (merging ssh_known_hosts) on systems, see my comment in BZ for this (now closed) bug:
https://bugzilla.proxmox.com/show_bug.cgi?id=4252#c32

If you ask when exactly was this fixed, I would need to go check Proxmox commit log to GIT, but this my pet peeve, their own RELEASE NOTES are useless, e.g. consider:

https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2



You would not even think that there was any bug by documenting it like this. You would not know which exact version got the fix in (also not mentioned in BZ) and you would think that if

... this caused any problem.

Let this be clear to everyone one last time. There's no such thing as "conflicting" keys in SSH known_hosts. Proxmox is wrong. I told them before, they ignore me. The bug here was caused by wrongly removing valid keys. You can have multiple keys for the same machine in the known_hosts no problem.
Having goten bitten by this myself, the sequence of events that peefed me that I believe was the reason for this change:

have "working" /etc/ssh/known_hosts --symlink-> /etc/pve/priv/ssh/known_hosts (can't recall the exact symlink destination)

from a node you ssh to something/where else, and the /etc/ssh/known_hosts (or was it /root/.ssh/known_hosts) gets renamed, and a new LOCAL file created with new ssh known_host entries.
Add new node, /etc/pve/priv/ssh/known_hosts gets updated, but not the local node and... ARGH!!!

Yeah, it is a pain, and something that is tried to be solved with .pem/X509 certificates, and I'm half way with Proxmox on this: rather do the "normal" what a SysAdmin would expect, and not try to be clever and things breaks for sysadmins
 
Having goten bitten by this myself, the sequence of events that peefed me that I believe was the reason for this change:

have "working" /etc/ssh/known_hosts --symlink-> /etc/pve/priv/ssh/known_hosts (can't recall the exact symlink destination)

from a node you ssh to something/where else, and the /etc/ssh/known_hosts (or was it /root/.ssh/known_hosts) gets renamed, and a new LOCAL file created with new ssh known_host entries.
Add new node, /etc/pve/priv/ssh/known_hosts gets updated, but not the local node and... ARGH!!!

Yeah, it is a pain, and something that is tried to be solved with .pem/X509 certificates, and I'm half way with Proxmox on this: rather do the "normal" what a SysAdmin would expect, and not try to be clever and things breaks for sysadmins

Would you mind adding a note on this (what you would prefer) to the - so far very unproductive BZs (anyone can make an account and post comment):

https://bugzilla.proxmox.com/show_bug.cgi?id=5804
https://bugzilla.proxmox.com/show_bug.cgi?id=5061
https://bugzilla.proxmox.com/show_bug.cgi?id=5736
https://bugzilla.proxmox.com/show_bug.cgi?id=4560


BTW The behaviour you described is a combination of the pruning bug (new key getting pruned instead of old) and people then running ssh-keygen -R on symlink which works as expected, in my opinion:
https://bugzilla.proxmox.com/show_bug.cgi?id=4252
https://bugzilla.proxmox.com/show_bug.cgi?id=4886

Cheers!
 
Will that bug address overwriting any user modifications to the ssh_known_hosts file, and the ability to specify an @cert-authority for the acceptance of signed host certificates?

It depends what's your take on this. In which location does your ssh_known_hosts gets overwritten on current PVE?
 
It depends what's your take on this. In which location does your ssh_known_hosts gets overwritten on current PVE?
As I have used a CA to signed each of my nodes' SSH certificates, the public certificate presented by the ssh server for each node is different to the unsigned node certificate which Promox copies into /etc/pve/nodes/nodename/ssh_known_hosts file.

Also Proxmox defaults to populating the RSA public certificate in the ssh_known_hosts file and I may have chosen to use an ED25519 public certificate instead. Even if I retained the RSA certificate I would need to overwrite the existing unsigned certificate with the signed one rather than keeping it in a separate file. I prefer to keep a copy of the original certificate untouched so I don't need to regenerate if from the private key or risk the signed key being overwritten through some automated process.

Adding my @cert-authority to each of the /etc/pve/nodes/nodename/ssh_known_hosts files ensures that each of the nodes is recognised as legitimate by another node.

The use of a CA to sign SSH host certificates is good practice and there are also legitimate reasons why admins may wish to use different key algorithms other than RSA.
 
As I have used a CA to signed each of my nodes' SSH certificates, the public certificate presented by the ssh server for each node is different to the unsigned node certificate which Promox copies into /etc/pve/nodes/nodename/ssh_known_hosts file.

I now noticed we had the conversation here in May, sorry. :) Yes, PVE basically initializes these (what they called pinned) snippet files from the actual host keys, I believe on boot time now. There's nothing you can do about that, it's how their tooling goes about using SSH intra-cluster (it does not even matter, their SSH helper for constructing the command always adds custom options to avoid using global files).

Also Proxmox defaults to populating the RSA public certificate in the ssh_known_hosts file and I may have chosen to use an ED25519 public certificate instead.

So if you are asking me whether you can override this for PVE nodes connecting to each other with their own SSH tooling, not really, other than by custom patch. I had even previously filed BZ on this. As you can tell, at Proxmox, there's always enough resources to invest in making the request somehow invalid or not applicable rather than doing the actual work.

The irony of this is, that the long promised SSL API only approach is also a BZ for that - see how active that one is. :D I always created a related BZ for any issue where "we will move away from SSH" was given as a reason, in the end I am now told I am creating too much spam with BZ updates.

At this point, I do not expect any help for someone like you (i.e. a user) for youself, consider the last reply on the bad released notes on the current SSH change we are now discussing here:

> Users are confused [1] with the 8.2 release notes on fix 4886 (URL):

I can not see any confusion in the linked thread that directly stems for the release note but rather your tutorial and your replies.
Yes, this is how @t.lamprecht replies to what they caused with this change and we have to discuss now here.

Next thing I will be accused of hijacking (my own?) thread by talking about "my" issues, I suspect.

With that out of the way (yes I am frustrated we have to discuss it here, NOT because of you @kesawi, but becase they do not do their job):

Even if I retained the RSA certificate I would need to overwrite the existing unsigned certificate with the signed one rather than keeping it in a separate file.

Not really, not if you don't care what PVE tooling themselves use, i.e. if you only want own SSH keys (signed or not makes no difference), you should be using the common system files (which do not get overwritten), e.g.:

/etc/ssh/ssh_host_ed25519.pub for your own host key
/etc/ssh/ssh_known_hosts that you have to populate on every system (but not every reboot), if regular file, as always meant to be

I prefer to keep a copy of the original certificate untouched so I don't need to regenerate if from the private key or risk the signed key being overwritten through some automated process.

These should not be getting overwritten, or are they, for you?

Adding my @cert-authority to each of the /etc/pve/nodes/nodename/ssh_known_hosts files ensures that each of the nodes is recognised as legitimate by another node.

If there's any confusion still after the above was mentioned, let's clarify what "each of the nodes is recognised as legitimate by another node" means for you. Do you mean for manually invoking SSH from command line without extra options as a user would? That was my context above.

The use of a CA to sign SSH host certificates is good practice and there are also legitimate reasons why admins may wish to use different key algorithms other than RSA.

For your own needs, you can, for PVE stack alone, they already gave me enough dismissive answers why they would not (they are technically non-sensical, so either they do not understand SSH certs or they do not think their users do, whichever it is, I find it funny at the least).
 
Last edited:
Will that bug address overwriting any user modifications to the ssh_known_hosts file, and the ability to specify an @cert-authority for the acceptance of signed host certificates?

Circling back to your original question and/or to troubleshoot it if you are putting it in /etc/ssh/ssh_known_hosts (not /etc/pve), can you confirm your file is or is not a symlink now?

/etc/ssh/ssh_known_hosts used to be symlink, I think on upgrades that never changes, on new installs it is not a symlink anymore.
 
Circling back to your original question and/or to troubleshoot it if you are putting it in /etc/ssh/ssh_known_hosts (not /etc/pve), can you confirm your file is or is not a symlink now?

/etc/ssh/ssh_known_hosts used to be symlink, I think on upgrades that never changes, on new installs it is not a symlink anymore.
Using /etc/ssh/ssh_known_hosts works and I can happily use ssh to connect to any host on my network that is signed by my CA from any Proxmox node, including between proxmox nodes. Everything works as required.

Where the issue arises is that the /etc/pve/nodes/nodename/ssh_known_hosts files are used for some PVE communications between nodes. If I'm accessing the web interface from node1 and try to open a console on node2 it will produce the remote host identification error. This appears to occur because node1 will use ssh to access node2 and verfiy it's signature using the /etc/pve/nodes/node2/ssh_known_hosts file. However, as this file doesn't match the ssh certificate being provided by node2 (I'm using a CA signed certificate rather than the unsigned certificate) it produces the error.

To attempt to overcome this, I signed the RSA public ssh host certificate with my CA for the nodes and adjusted the configuration to use them instead of my ED25519 host certificates, as I figured Proxmox would just take the RSA public certificate and propogate it accordingly. However, it didn't work.

I can happily connect to the nodes from any client, however the command pvecm updatecerts which is used to copy the /etc/ssh/ssh_host_rsa_key.pub file for a node to its /etc/pve/nodes/nodename/ssh_known_hosts file errors out. This is due to the script /usr/share/perl5/PVE/Cluster/Setup.pm continaing several checks to ensure the ssh_host_rsa_key.pub file matches the regex pattern ^(ssh-rsa\s\S+)(\s.*)?$(i.e. it starts with ssh-rsa). However a signed certificate starts with ssh-rsa-cert-v01@openssh.com so it gets rejected and the script terminates.

I could potentially amend the script so the regex pattern match is ^(?:ssh-rsa|ssh-rsa-cert-v01@openssh.com)\s\S+(\s.*)?$ which would let it work with either a signed or unsigned RSA cert, but I'm not sure what the knock on consequences would be.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!