I was having the same problem, cause was different but that's irrelevant. Incase you are still struggling, I was able to fix it. So if you would like the fix, just know you're not talking to nobody and your trial & error was able to eliminate many my initial guessing! So thank you
Original Setup:
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003
I was adding a new node: (using same name/ID) FF-Node3 | 192.168.1.11 to the cluster when the issue arose. At this time you may notice my FF-Node3 and FF-Node2 to not have the correct {node_id} which was a longstanding issue from when I originally setup the cluster long ago. There has only been 2 Nodes for quite a cluster status/node output confirmed it. However, I added the new node 'FF-Node3 | 192.168.1.11' using `pvecm add 192.168.1.13`
Output of `
pvecm nodes` when connected via SSH to root@192.168.1.13
{Node_Name} | {IPv4_Address} | {node_id}
FF-Node1 | 192.168.1.13 | Id=001
FF-Node2 | 192.168.1.12 | id=003
FF-Node3 | 192.168.1.13 | id=002
Now you can probably guess I'm going to finally fix the ID Name/IPv4 address/node_id mismatch. Using that same SSH connection @.13, I fixed /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: FF-Node1
nodeid: 1
quorum_votes: 1
ring0_addr: FF-Node1
}
node {
name: FF-Node2
# nodeid: 3 <- before change
nodded: 2 # <-after change
quorum_votes: 1
ring0_addr: FF-Node2
}
node {
name: FF-Node3
# nodeid: 3 <-before
nodeid: 2 # <-afer change
quorum_votes: 1
# ring0_addr: 192.168.1.11 <. before change
ring0_addr: FF-Node3 # <- after change
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: FF-Farm
config_version: 10
interface {
bindnetaddr: 192.168.1.13
ringnumber: 0
}
(after update)
I restart corosync & proceed to access the WebUI via FF-Node1 as the Host. I always use FF-Node1's IP as the host to access the cluster
this is important.
I then attempted to move an offline CT from FF-Node1->FF-Node2. When I started receiving MITM / SSH Key mismatches.
TO FIX THE ISSUE:
Visit each host directly, then attempt to access
every other host in the cluster. Using the WebUI via .13, I could access it's VM's/CT's, but not Node2's or Node3's.
Use the output of the failed task, it should contain a suggestion to fix along the lines of:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256
2136C081mHeeXlW08xzV4YNz51rC/y2Z+NQWcb+hxo.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:4
remove with:
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 192.168.1.11
So I ran the suggested removal fix for both .11 and .12 hosts on .13 host (FF-Node1). When I accessed
the other hosts directly using JS Shell - NOT default. The shell prompted me to accept a new key as if I just ran ssh-copy-keygen & BAM! I had restored full usability. I then accessed FF-Node3 via
JS Web Shell and I was again prompted, accepted, and had came away with restored full WebUI functionality & fixed the bad SSH keys.
I got curious, then accessed the Cluster's WebUI via FF-Node2 Host (192.168.1.12). Sure enough, I could not access FF-Node1 or the new FF-Node3 via the VNCProxy Shell and had to repeat the steps above. And then again when I accessed the webUI via FF-Node3 Host
So when I manually changed the cluster Id's, new SSH keys were generated (which makes sense because now Node_ID != Node_IP).
@dmulk, yes SSH Keys are copied as Nodes are added to clusters and they are propagated throughout each Host. The problem is once I modified the cluster corosync & restarted it, the SSHKeys were propagated for the webUI as mentioned in the Wiki, but PVE Hosts still contained had old ID/SSHKey pair.
Visit a host (e.g. 192.168.1.13 in my example), remove all offending SSHkeys with recommend ssh-keygen fix (192.168.1.11, 192.168.1.12), then access-in WebUI JS Shell for
both offending host FF-Node2 and FF-Node3 and you should receive the 'accept SSH prompts & afterwards I had access!'
Hope this helps OP!
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_configuration