BUG / MISSING FEATURE: Host key verification fails after adding node to existing cluster

Final summary:

If you add a node to an existing cluster that is configured to use a dedicated, separate, cluster network as described in this document:

https://pve.proxmox.com/wiki/Separate_Cluster_Network#Adding_nodes_in_the_future

Using this command:

pvecm add IP-ADDRESS-CLUSTER -ring0_addr IP-ADDRESS-RING0

The process uses the initial bridge ip information for the information it adds to the /pve/priv/known_hosts and fails to add the dedicated cluster ip's rsa information thus breaking Migration which defaults to the cluster network.

WORKAROUND until this is sorted:

1) From any other existing cluster node run the following command to force an RSA based ssh connection: ssh -o HostKeyAlgorithms=ssh-rsa root@<DEDICATED CLUSTER IP of NEWLY ADDED NODE>

2) Accept the connection

3) Copy the newly added entry from ~/.ssh/known_hosts to /pve/priv/known_hosts

Migration should work.


<D>
 
Last edited:
I was having the same problem, cause was different but that's irrelevant. Incase you are still struggling, I was able to fix it. So if you would like the fix, just know you're not talking to nobody and your trial & error was able to eliminate many my initial guessing! So thank you :D

Original Setup:
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003

I was adding a new node: (using same name/ID) FF-Node3 | 192.168.1.11 to the cluster when the issue arose. At this time you may notice my FF-Node3 and FF-Node2 to not have the correct {node_id} which was a longstanding issue from when I originally setup the cluster long ago. There has only been 2 Nodes for quite a cluster status/node output confirmed it. However, I added the new node 'FF-Node3 | 192.168.1.11' using `pvecm add 192.168.1.13`

Output of `pvecm nodes` when connected via SSH to root@192.168.1.13
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003
FF-Node3 | 192.168.1.13 | id=002

Now you can probably guess I'm going to finally fix the ID Name/IPv4 address/node_id mismatch. Using that same SSH connection @.13, I fixed /etc/pve/corosync.conf

logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: FF-Node1
nodeid: 1
quorum_votes: 1
ring0_addr: FF-Node1
}
node {
name: FF-Node2
# nodeid: 3 <- before change
nodded: 2 # <-after change
quorum_votes: 1
ring0_addr: FF-Node2
}
node {
name: FF-Node3
# nodeid: 3 <-before
nodeid: 2 # <-afer change
quorum_votes: 1
# ring0_addr: 192.168.1.11 <. before change
ring0_addr: FF-Node3 # <- after change
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: FF-Farm
config_version: 10
interface {
bindnetaddr: 192.168.1.13
ringnumber: 0
}
(after update)

I restart corosync & proceed to access the WebUI via FF-Node1 as the Host. I always use FF-Node1's IP as the host to access the cluster this is important.
I then attempted to move an offline CT from FF-Node1->FF-Node2. When I started receiving MITM / SSH Key mismatches.


TO FIX THE ISSUE:


Visit each host directly, then attempt to access every other host in the cluster. Using the WebUI via .13, I could access it's VM's/CT's, but not Node2's or Node3's.

Use the output of the failed task, it should contain a suggestion to fix along the lines of:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:p2136C081mHeeXlW08xzV4YNz51rC/y2Z+NQWcb+hxo.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:4
remove with:
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 192.168.1.11

So I ran the suggested removal fix for both .11 and .12 hosts on .13 host (FF-Node1). When I accessed the other hosts directly using JS Shell - NOT default. The shell prompted me to accept a new key as if I just ran ssh-copy-keygen & BAM! I had restored full usability. I then accessed FF-Node3 via JS Web Shell and I was again prompted, accepted, and had came away with restored full WebUI functionality & fixed the bad SSH keys.

I got curious, then accessed the Cluster's WebUI via FF-Node2 Host (192.168.1.12). Sure enough, I could not access FF-Node1 or the new FF-Node3 via the VNCProxy Shell and had to repeat the steps above. And then again when I accessed the webUI via FF-Node3 Host

So when I manually changed the cluster Id's, new SSH keys were generated (which makes sense because now Node_ID != Node_IP).

@dmulk, yes SSH Keys are copied as Nodes are added to clusters and they are propagated throughout each Host. The problem is once I modified the cluster corosync & restarted it, the SSHKeys were propagated for the webUI as mentioned in the Wiki, but PVE Hosts still contained had old ID/SSHKey pair.
Visit a host (e.g. 192.168.1.13 in my example), remove all offending SSHkeys with recommend ssh-keygen fix (192.168.1.11, 192.168.1.12), then access-in WebUI JS Shell for both offending host FF-Node2 and FF-Node3 and you should receive the 'accept SSH prompts & afterwards I had access!'

Hope this helps OP! https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!