BUG / MISSING FEATURE: Host key verification fails after adding node to existing cluster

dmulk · Jan 11, 2018

Final summary:

If you add a node to an existing cluster that is configured to use a dedicated, separate, cluster network as described in this document:

https://pve.proxmox.com/wiki/Separate_Cluster_Network#Adding_nodes_in_the_future

Using this command:

pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0

The process uses the initial bridge ip information for the information it adds to the /pve/priv/known_hosts and fails to add the dedicated cluster ip's rsa information thus breaking Migration which defaults to the cluster network.

WORKAROUND until this is sorted:

1) From any other existing cluster node run the following command to force an RSA based ssh connection: ssh -o HostKeyAlgorithms=ssh-rsa root@<DEDICATED CLUSTER IP of NEWLY ADDED NODE>

2) Accept the connection

3) Copy the newly added entry from ~/.ssh/known_hosts to /pve/priv/known_hosts

Migration should work.

<D>

fanboyfanboy · Apr 26, 2018

I was having the same problem, cause was different but that's irrelevant. Incase you are still struggling, I was able to fix it. So if you would like the fix, just know you're not talking to nobody and your trial & error was able to eliminate many my initial guessing! So thank you

Original Setup:
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003

I was adding a new node: (using same name/ID) FF-Node3 | 192.168.1.11 to the cluster when the issue arose. At this time you may notice my FF-Node3 and FF-Node2 to not have the correct {node_id} which was a longstanding issue from when I originally setup the cluster long ago. There has only been 2 Nodes for quite a cluster status/node output confirmed it. However, I added the new node 'FF-Node3 | 192.168.1.11' using `pvecm add 192.168.1.13`

Output of `pvecm nodes` when connected via SSH to root@192.168.1.13
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003
FF-Node3 | 192.168.1.13 | id=002

Now you can probably guess I'm going to finally fix the ID Name/IPv4 address/node_id mismatch. Using that same SSH connection @.13, I fixed /etc/pve/corosync.conf

logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: FF-Node1
nodeid: 1
quorum_votes: 1
ring0_addr: FF-Node1
}
node {
name: FF-Node2
# nodeid: 3 <- before change
nodded: 2 # <-after change
quorum_votes: 1
ring0_addr: FF-Node2
}
node {
name: FF-Node3
# nodeid: 3 <-before
nodeid: 2 # <-afer change
quorum_votes: 1
# ring0_addr: 192.168.1.11 <. before change
ring0_addr: FF-Node3 # <- after change
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: FF-Farm
config_version: 10
interface {
bindnetaddr: 192.168.1.13
ringnumber: 0
}
(after update)

I restart corosync & proceed to access the WebUI via FF-Node1 as the Host. I always use FF-Node1's IP as the host to access the cluster this is important.
I then attempted to move an offline CT from FF-Node1->FF-Node2. When I started receiving MITM / SSH Key mismatches.

TO FIX THE ISSUE:

Visit each host directly, then attempt to access every other host in the cluster. Using the WebUI via .13, I could access it's VM's/CT's, but not Node2's or Node3's.

Use the output of the failed task, it should contain a suggestion to fix along the lines of:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256

2136C081mHeeXlW08xzV4YNz51rC/y2Z+NQWcb+hxo.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:4
remove with:
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 192.168.1.11

So I ran the suggested removal fix for both .11 and .12 hosts on .13 host (FF-Node1). When I accessed the other hosts directly using JS Shell - NOT default. The shell prompted me to accept a new key as if I just ran ssh-copy-keygen & BAM! I had restored full usability. I then accessed FF-Node3 via JS Web Shell and I was again prompted, accepted, and had came away with restored full WebUI functionality & fixed the bad SSH keys.

I got curious, then accessed the Cluster's WebUI via FF-Node2 Host (192.168.1.12). Sure enough, I could not access FF-Node1 or the new FF-Node3 via the VNCProxy Shell and had to repeat the steps above. And then again when I accessed the webUI via FF-Node3 Host

So when I manually changed the cluster Id's, new SSH keys were generated (which makes sense because now Node_ID != Node_IP).

@dmulk, yes SSH Keys are copied as Nodes are added to clusters and they are propagated throughout each Host. The problem is once I modified the cluster corosync & restarted it, the SSHKeys were propagated for the webUI as mentioned in the Wiki, but PVE Hosts still contained had old ID/SSHKey pair.
Visit a host (e.g. 192.168.1.13 in my example), remove all offending SSHkeys with recommend ssh-keygen fix (192.168.1.11, 192.168.1.12), then access-in WebUI JS Shell for both offending host FF-Node2 and FF-Node3 and you should receive the 'accept SSH prompts & afterwards I had access!'

Hope this helps OP! https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration

Search

Search

BUG / MISSING FEATURE: Host key verification fails after adding node to existing cluster

dmulk

Member

fanboyfanboy

Member

We value your privacy