BUG / MISSING FEATURE: Host key verification fails after adding node to existing cluster

dmulk

Member
Jan 24, 2017
74
5
13
50
All,

I've just added a new, 5th node to my existing PVE 4.4-15 cluster and when I attempt to live migrate a VM to the new node it fails with "Host Key Verification Failed".

My cluster is configured with a dedicated, separate, network for the cluster communication. All nodes can ping each other and has access to all shared storage.

I used the following command syntax to join the node:

pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0


Have any of you encountered this?

Thanks,

<D>
 
A bit more info....

When I trigger the live migration, the syslog shows the ip address of vmbr0 on the source attempting and SSH session and NOT the dedicated interface or address range...

I just confirmed that /etc/pve/datacenter.cfg is configured with the correct range "migration: secure,network= CORRECT NETWORK"
 
A bit more info....

When I trigger the live migration, the syslog shows the ip address of vmbr0 on the source attempting and SSH session and NOT the dedicated interface or address range...

I just confirmed that /etc/pve/datacenter.cfg is configured with the correct range "migration: secure,network= CORRECT NETWORK"

we first connect over SSH on the cluster IP and retrieve the IP of the target node on the migration network, so SSH access needs to work for both IPs. could you post the full log?
 
Happy to provide this information but I need some clarification....do you want the Syslog from the machine initiating the migration, the syslog from the receiving machine or BOTH?

<D>
 
So...if I ssh root@<CLUSTERIPOFNEWNODE> once and select YES to add the ECDSA to the list of known hosts I'm able to migrate from this particular existing node to the new node.

When I attempted to migrate back it failed.....so I did the same in reverse (allowed the ECDSA key from existing node to be added to the new node) and it then allowed me to migrate back.

The other 3 existing nodes are in the same state so it appears that the SSH keys are not being replicated across the cluster.....where should I start looking to troubleshoot?

Thanks!
<D>
 
So...if I ssh root@<CLUSTERIPOFNEWNODE> once and select YES to add the ECDSA to the list of known hosts I'm able to migrate from this particular existing node to the new node.

When I attempted to migrate back it failed.....so I did the same in reverse (allowed the ECDSA key from existing node to be added to the new node) and it then allowed me to migrate back.

The other 3 existing nodes are in the same state so it appears that the SSH keys are not being replicated across the cluster.....where should I start looking to troubleshoot?

Thanks!
<D>

the host keys of all hosts should be in /etc/pve/priv/known_hosts , and on every node /etc/ssh/ssh_known_hosts should be a symlink pointing to that location. the keys are added when joining a node to a cluster, so if you regenerate the keys afterwards you need to add the new ones (and remove the old ones ;))
 
I'm confirming that those symlinks do exist and the files do appear to be the same but it's not working the way you are explaining it.

To be clear, I have not regenerated any keys...

When I joined the cluster originally I used: pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0

So there isn't any syncronization after a node is joined to the cluster?

Very confused. What simple thing am I missing here?
 
Also, when you say that if keys are regenerated they must be added and removed....do you mean only to ONE node or to manually to EACH and EVERY node?

This would seem to indicate that after a node is joined there isn't any syncronization of known hosts....
 
Also, when you say that if keys are regenerated they must be added and removed....do you mean only to ONE node or to manually to EACH and EVERY node?

This would seem to indicate that after a node is joined there isn't any syncronization of known hosts....

/etc/pve is synchronized, so if you have the symlink all nodes have access to the known_hosts file in /etc/pve/priv and you only need to change that one. but if you manually connect and accept a host key, it will only be added on that host in /root/.ssh/known_hosts , which is not synchronized.
 
So I basically need to pick one of the updated files from /root/.ssh/known_hosts and copy that into the /etc/pve/priv on one of the updated hosts and that should update the rest of the cluster?

<D>
 
So I basically need to pick one of the updated files from /root/.ssh/known_hosts and copy that into the /etc/pve/priv on one of the updated hosts and that should update the rest of the cluster?

<D>

no, because they will be different on all nodes. you need to add the missing lines from your nodes' /root/.ssh/known_hosts to ssh_known_hosts in /etc/pve/priv . Like I said, we do this automatically when joining a node to a cluster, so I don't really know how you end up in this situation?
 
Just circling back on this...It's still a problem and I'm having a hard time figuring out how to surgically.

I've been limping along for all of this time but I need to add 5 new nodes and I want to make sure this is solved before I do that.

Right now, the 5 nodes I have are all in various states.

Is there some way to fix this without killing the active environment? Can I regenerate keys?

It might be noteworthy to mention that I'm also running CEPH on all of these nodes. (if that matters).

Any help is appreciated.

Thanks,
Dan
 
Still not sure how we got to this state...

A bit more info:

I can't seem to ssh from most nodes to another without being prompted to accept an SSH key. Comparing to another pve cluster in our environment all nodes can ssh into all nodes with no problem.

On the "working" cluster, there is no "known hosts" file in the local ssh folder on any of the nodes.
On the "broken" cluster, there is a "known hosts" file in the local ssh folder on all of the nodes. They are all different.

Will deleting all of the various "known hosts" in the local ssh folders on all of the nodes clean this up?

Thanks,
Dan
 
More info...

On one node of the "broken cluster" I deleted the "known hosts" in the local ssh folder. I then performed a live migration of a vm from one node to this node. It succeeded. I then attempted to live migrate back to the original node and it failed. After restoring the "known hosts" to the local ssh folder I was again able to successfully migrate the vm back.

So, in my case, it appears that live migration relies on the known hosts file in the local ssh folder for my broken cluster.

In the working cluster there are no "known hosts" files in this folder.

I'll also add that I am using a dedicated migration \ cluster interface in the broken cluster and not in the working cluster if that matters.

How the heck do I fix this?

Thanks!
Dan
 
More info:

I've noticed that I can ssh from all nodes if I use: root@hostname but it prompts for a key if I use root@FQDN....DNS seems to factor in here...hmmm....
 
Last edited:
More info:

I still don't know why this happened....but I now suspect that this is related to the fact that there doesn't appear to be any ssh relationship for the dedicated cluster network ?

Running against my my dedicated cluster network (default behavior) (FAILS):

Code:
# /usr/bin/ssh -v -o 'BatchMode=yes' root@192.X.X.X /bin/true
OpenSSH_6.7p1 Debian-5+deb8u4, OpenSSL 1.0.1t 3 May 2016
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to 192.X.X.X [192.X.X.X] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.7p1 Debian-5+deb8u4
debug1: match: OpenSSH_6.7p1 Debian-5+deb8u4 pat OpenSSH* compat 0x04000000
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr umac-64-etm@openssh.com none
debug1: kex: client->server aes128-ctr umac-64-etm@openssh.com none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ECDSA BLAH BLAH BLAH
Host key verification failed.


Running against the VM Network (SUCCEEDS):

Code:
# /usr/bin/ssh -v -o 'BatchMode=yes' root@10.X.X.X /bin/true
OpenSSH_6.7p1 Debian-5+deb8u4, OpenSSL 1.0.1t 3 May 2016
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to 10.X.X.X [10.X.X.X] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u4
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.7p1 Debian-5+deb8u4
debug1: match: OpenSSH_6.7p1 Debian-5+deb8u4 pat OpenSSH* compat 0x04000000
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr umac-64-etm@openssh.com none
debug1: kex: client->server aes128-ctr umac-64-etm@openssh.com none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: RSA BLAH BLAH BLAH
debug1: Host '10.X.X.X' is known and matches the RSA host key.
debug1: Found key in /etc/ssh/ssh_known_hosts:8
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /root/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Authentication succeeded (publickey).
Authenticated to 10.X.X.X ([10.X.X.X]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
debug1: Sending command: /bin/true
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0
debug1: channel 0: free: client-session, nchannels 1
Transferred: sent 3056, received 2164 bytes, in 0.0 seconds
Bytes per second: sent 694091.8, received 491496.9
debug1: Exit status 0

To be clear, if I SSH from the host I want to migrate from to the host I want to migrate to and accept the ECDSA key entry (which adds the key to the known_hosts file in ~/.ssh/known_hosts) migration will work on the dedicated cluster.

It seems like the ECDSA key should be added to the clustered /etc/pve/priv/known_hosts...but that doesn't seem to be happening.

I followed this guide to configure the dedicated cluster network:

https://pve.proxmox.com/wiki/Separate_Cluster_Network

Now what?

<D>
 
So, here's a summary of where I'm at so far:

I created the original cluster following the Proxmox instructions with a single IP range. After further research I found that Proxmox recommends a dedicated Cluster Network, so I followed their instructions to configure corosync on a second ip range. After this change I found that migration was was borked. After some investigation I found that SSH host verification was broken. After further research I found the easiest solution was to create a new ssh_known_hosts in /etc/pve/priv with the new IP addresses.

It appears that Proxmox (at least at the time I created and configured my environment about a year ago) does not have a mechanism or documentation to gracefully handle updating the clustered know_hosts when switching to a dedicated clustered OR migration environment.

I'm not sure if this is still the case, but now that I fully understand the issue, I'll be adding an additional 5 nodes to my cluster and will report back if this is still the case. If it is, I'll open a ticket to report.

Cheers,
<D>
 
Can anyone else confirm that using this command to add a new node to an existing cluster actually adds the node's cluster network address to the /etc/pve/priv/known_hosts file?

pvecm add <IP addr of a cluster member> -ring0_addr <new nodes ring addr>

In my case it does not. Which breaks migration and anything else that by default uses ssh on this path....


<D>
 
Well...I'm talking to myself at this point but I can confirm that it does not on any of my new nodes that I'm adding...=)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!