DISCLAIMER: This tutorial became obsolete due to changes rolled out by the PVE team, see comment #57.
DISCLAIMER: As became apparent over some troubleshooting under this post, please take note there are also genuine cases where networking or SSH itself has been misconfigured and the error message you have encountered is genuine, i.e. you are connecting to the wrong host. If you properly follow this tutorial in what you suspected was just a PVE bug and it does not resolve your issue, you can continue troubleshooting further. This tutorial does not cause further misconfiguration, it is simply adding alternative means of host verification that bypass the bug.
As of PVE 8.1, there's still a bug where running
This is a simple streamlined process to either prevent this issue or get out of it without having to do potentially risky Perl file patching or deleting keys one might have wished to otherwise retain. It bypasses the
Assuming the cluster is otherwise healthy and has quorum and no connectivity issues, except for disrupted SSH connections, e.g. proxying console/shell, secure local-storage migration and replication, but also QDevice setup. See also [1].
There's an existing Certification Authority (CA) used in PVE - see also [2], currently only for SSL connections, but as SSH certificates are nothing more than CA-signed SSH keys with associated IDs (principals), it is easiest to reuse the said CA (see note (i)):
In any single node's root shell perform once (the location is shared for all nodes in the cluster):
This converts the CA certificate to a format needed for SSH and adds any current or future SSH key signed by the CA as recognized by any node of the cluster as valid, even in case of other conflicting entries present.
On each individual node (you may want to automate this in case of large cluster), the respective host key then needs to be signed and set for the node:
This makes use of Ed25519 keys, it did however use the RSA (albeit 4096bit) CA's key to sign them. If you have any specific reason, you may of course opt for any other SSH keys in
Note: The
And that's it! From now on, your nodes will be always able to SSH connect to each other. The only annoyance being, all future nodes need to have the two liner executed once. Again, this would be best automated as it does not interfere with the rest of PVE's internals. There's no caveats to this, however, if you do not sign your future nodes' keys and PVE manages to find individual recognised key on record, it will work still. But if you encounter the bug in
One final note on how PVE makes use of
If you end up with multiple records present with the same name that is also the ID listed in the key signed by the CA, the signed key will take precedence as can be checked:
If you however failed to list the ID under which your node is recognised by PVE, you will have a failure (only in case it would have failed anyways due to the bug):
TESTED ON: pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-6-pve)
Related bug reports: #4252, #4886
References:
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_role_of_ssh_in_proxmox_ve_clusters
[2] https://pve.proxmox.com/wiki/Certificate_Management
Notes:
(i) If you wanted to know how much validity there's left for the CA, feel free to check with
(ii) If you wish to double check that all the correct IDs (principals) were included in the signed key, you can do so with
DISCLAIMER: As became apparent over some troubleshooting under this post, please take note there are also genuine cases where networking or SSH itself has been misconfigured and the error message you have encountered is genuine, i.e. you are connecting to the wrong host. If you properly follow this tutorial in what you suspected was just a PVE bug and it does not resolve your issue, you can continue troubleshooting further. This tutorial does not cause further misconfiguration, it is simply adding alternative means of host verification that bypass the bug.
As of PVE 8.1, there's still a bug where running
pvecm updatecerts
deletes all but the oldest (instead of newest) SSH keys from the shared cluster-wide known_hosts
file which then causes issues manifesting themselves through WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!
and Offending RSA key in /etc/ssh/ssh_known_hosts:$lineno
and remove with: ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "$alias"
, which then breaks the symlink into pmxcfs
and makes one dig even deeper into the troubleshooting rabbit hole.This is a simple streamlined process to either prevent this issue or get out of it without having to do potentially risky Perl file patching or deleting keys one might have wished to otherwise retain. It bypasses the
known_hosts
corruption issue by using SSH certificates for the purpose of remote host authentication, it does NOT change the behaviour in relation to the user authorisation (related to the authorized_keys
file).Assuming the cluster is otherwise healthy and has quorum and no connectivity issues, except for disrupted SSH connections, e.g. proxying console/shell, secure local-storage migration and replication, but also QDevice setup. See also [1].
There's an existing Certification Authority (CA) used in PVE - see also [2], currently only for SSL connections, but as SSH certificates are nothing more than CA-signed SSH keys with associated IDs (principals), it is easiest to reuse the said CA (see note (i)):
In any single node's root shell perform once (the location is shared for all nodes in the cluster):
Code:
# openssl x509 -in /etc/pve/pve-root-ca.pem -inform pem -pubkey -noout | ssh-keygen -f /dev/stdin -i -m PKCS8 > /etc/pve/pve-root-ca.pub
# echo "@cert-authority * `cat /etc/pve/pve-root-ca.pub`" >> /etc/ssh/ssh_known_hosts
This converts the CA certificate to a format needed for SSH and adds any current or future SSH key signed by the CA as recognized by any node of the cluster as valid, even in case of other conflicting entries present.
On each individual node (you may want to automate this in case of large cluster), the respective host key then needs to be signed and set for the node:
Code:
# ssh-keygen -I `hostname` -s /etc/pve/priv/pve-root-ca.key -h -n `(hostname -s; hostname -f; hostname -I | xargs -n1) | paste -sd,` /etc/ssh/ssh_host_ed25519_key.pub
# echo "HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub" >> /etc/ssh/sshd_config.d/PVEHostCertificate.conf
This makes use of Ed25519 keys, it did however use the RSA (albeit 4096bit) CA's key to sign them. If you have any specific reason, you may of course opt for any other SSH keys in
/etc/ssh/
to be used here, not necessarily Ed25519. See also note (ii).Note: The
sshd
service needs to be restarted for the changes to take effect.And that's it! From now on, your nodes will be always able to SSH connect to each other. The only annoyance being, all future nodes need to have the two liner executed once. Again, this would be best automated as it does not interfere with the rest of PVE's internals. There's no caveats to this, however, if you do not sign your future nodes' keys and PVE manages to find individual recognised key on record, it will work still. But if you encounter the bug in
pvecm updatecerts
it will not disrupt connections to those nodes which had signed host keys as the buggy tool safely ignores @cert-authority
entries in the known_hosts
file.One final note on how PVE makes use of
HostKeyAlias
option for SSH connections. This option is always used for e.g. migrations/replications and will make use of specific ID from the known_hosts
file irrespective of the hostname or IP address of the node being connected to. If your IDs (principals) listed in the signed keys (see note (ii) include this alias, it will keep working as expected, i.e. it will even work if this is your x-th time introducing a cluster node by the same name (as some dead nodes used to have) as long as its host key is signed. The leftover keys on record are safely ignored then, as they should have been to begin with.If you end up with multiple records present with the same name that is also the ID listed in the key signed by the CA, the signed key will take precedence as can be checked:
Code:
# ssh -vvv -o HostKeyAlias=$alias $ipaddress
...
debug1: Found CA key in /etc/ssh/ssh_known_hosts:$lineno
debug3: check_host_key: certificate host key in use; disabling UpdateHostkeys
If you however failed to list the ID under which your node is recognised by PVE, you will have a failure (only in case it would have failed anyways due to the bug):
Code:
#ssh -vvv -o HostKeyAlias=$alias $ipaddress
...
debug1: Host '$alias' is known and matches the ED25519-CERT host certificate.
debug1: Found CA key in /etc/ssh/ssh_known_hosts:$lineno
Certificate invalid: name is not a listed principal
debug1: No matching CA found. Retry with plain key
TESTED ON: pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-6-pve)
Related bug reports: #4252, #4886
References:
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_role_of_ssh_in_proxmox_ve_clusters
[2] https://pve.proxmox.com/wiki/Certificate_Management
Notes:
(i) If you wanted to know how much validity there's left for the CA, feel free to check with
openssl x509 -in /etc/pve/pve-root-ca.pem -text -noout
, it is 10 years as nominally generated by PVE and therefore rotation is a not in scope of this tutorial either.(ii) If you wish to double check that all the correct IDs (principals) were included in the signed key, you can do so with
ssh-keygen -L -f /etc/ssh/ssh_host_ed25519_key-cert.pub
. There should be the hostname, FQDN as well as all IP addresses listed. You can, of course, change this list by editing the list within the -n
option of ssh-keygen
. Please also note there's absolutely no expiry defined for these keys which mimics the default behaviour of PVE regarding SSH key handling.
Last edited: