What ever happened as we've come to conclude is ssh related but something else.
This started happening right after I upgraded three of the nodes from 7.1.7 to 7.2 and re-installed one of the nodes, 04.
I have nodes 01, 03, 04 and 07.
I noticed the problem when I could not move guests between hosts but of course, have since lost track of which to which.
The problem simply won't go away. I can write all of the things that work and when it doesn't but that only seems to confuse those trying to help me which I appreciate. It's confusing to me also because I fix one or two nodes, then another won't work. I fix that one then another stops working.
From each node, the storage can be seen from command line;
From node 02
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.72 -- /usr/sbin/pvesm status --storage nfs-iso
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.73 -- /usr/sbin/pvesm status --storage nfs-iso
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.76 -- /usr/sbin/pvesm status --storage nfs-iso
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro03'
root@10.0.0.72 /bin/true
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro04'
root@10.0.0.73 /bin/true
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro07'
root@10.0.0.76 /bin/true
Then I check from node 04 since node 02 complains it cannot reach it but it's fine from 04 to 02.
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro02'
root@10.0.0.71 /bin/true
So back to node 02, I remove the key as suggested.
# ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pro04"
# ssh
root@10.0.0.73
It auto logs me in as root. No being prompted for key or password so I exit.
I run the tests again;
# /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro04'
root@10.0.0.73 /bin/true
Host key verification failed.
# /usr/bin/ssh -o 'BatchMode=yes' 10.0.0.73 -- /usr/sbin/pvesm status --storage nfs-iso
Name Type Status Total Used Available %
nfs-iso nfs active 3861157888 4948992 3856208896 0.13%
At this point, I have one more option that I know of, I can run 'pvecm updatecerts'.
I've used this also to try and get things synched up and I can get it so that everything works from the command line.
Then I go to the GUI and the whole thing starts all over again.
I'm obviously missing a step or an order but so far, it's a non stop whack-a-mole of fixing it from the command line only to see the problem show up again in the GUI. This thread has not helped to help me understand how things work so far but I'm trying to re-read it to see what I might have missed.
All I can think of at this point is to migrate all guests away from one host. Rebuild that from scratch using 7.2 then move all guests to that host, destroy the cluster, remove any mentions of other nodes on this single host (/etc/pve/nodes/<nodename>), rebuild all the other hosts then create a new cluster, migrate hosts etc etc.
Someone still learning but not wanting to post too quickly in the forums for fear of being told they aren't learning might find something like this and countless other ideas.
https://codingpackets.com/blog/proxmox-certificate-error-fix-after-node-replacement/
It's interesting because it coincides with my own experience of having upgraded nodes THEN starting to have these problems.
I don't know, is that a good idea? There are hundreds of posts on the net, how to fix this, that and sometimes we try the ideas we've found, some work, some probably cause new problems you're not aware of because the article you read was about a problem that looked like your own but maybe was slightly different.
That's the Internet and trying to learn new things