Hello, everyone.
Quick background story:
After upgrading the cluster to version 8, we observed that a lot of the nodes could not communicate correctly with one another because of wrong SSH keys in /etc/ssh/known_hosts file. We found out this blog post that explained some steps to fix this. I ended up assembling the following to help me through the process in all nodes:
I even got a script for SSH interconnection testing between all the nodes:
After applying those commands on all nodes, cluster got manageable again, but when I tried to migrate some VMs around, I got the following message:
I found this post as a possible solution for this issue, but I must admit I did not understand the solution, as people only talked about the id_rsa key file, but did not explain how to generate it, and how to apply it to all the nodes in the cluster. The latter is a specific point of doubt for me because I want to be sure of how this is supposed to work in the cluster setting. I don't want to risk getting the cluster non-operational because of a mistake in setting up and copying around the id_rsa key.
I apologize in advance for any mistakes, and would like to ask for help in pointing me in the right direction.
Thanks
Quick background story:
We had a cluster of 11 nodes on version 7, and we planned to upgrade them all to version 8 while applying some changes to the way we organized stuff, which means we decided to change the names of all the nodes of the cluster.
After careful consideration and research, we decided that the best approach was to simply reinstall and rejoin the nodes in the cluster, since node renaming is practically impossible. In addition, we installed and joined a few other nodes. Along the way, we found out that some servers were having problems installing version 8.0 from the ISO, and others were having trouble booting up after upgrading from version 7.4, despite those installations being pretty much vanilla.
After upgrading the cluster to version 8, we observed that a lot of the nodes could not communicate correctly with one another because of wrong SSH keys in /etc/ssh/known_hosts file. We found out this blog post that explained some steps to fix this. I ended up assembling the following to help me through the process in all nodes:
Bash:
#these commands MUST be issued on all nodes of the cluster
#remove HTTPS certificates
rm /etc/pve/pve-root-ca.pem
rm /etc/pve/priv/pve-root-ca.key
find /etc/pve/nodes -type f -name 'pve-ssl.key' -exec rm {} \;
rm /etc/pve/authkey.pub
rm /etc/pve/priv/authkey.key
rm /etc/pve/priv/authorized_keys
#recreate HTTPS certificates
pvecm updatecerts -f
#restart "pvedaemon" and "pveproxy" services
systemctl restart pvedaemon pveproxy
#remove SSH keys
rm /root/.ssh/known_hosts
rm /etc/ssh/ssh_known_hosts
#SSH from each node into all other nodes to ensure you have SSH access
#reboot
reboot
I even got a script for SSH interconnection testing between all the nodes:
Bash:
#!/bin/bash
# List of IP addresses of all nodes
ip_list=([IP address list])
# Loop on IPs
for ip in "${ip_list[@]}"; do
# Tries to connect using SSH with "StrictHostKeyChecking=accept-new" without executing remote commands (-N)
ssh -o StrictHostKeyChecking=accept-new -N "$ip" &
done
# Wait some time for the SSH connections to be established
sleep 5
After applying those commands on all nodes, cluster got manageable again, but when I tried to migrate some VMs around, I got the following message:
Code:
2023-09-12 14:38:19 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cph-2950-27' root@[IP adress] /bin/true
2023-09-12 14:38:19 Host key verification failed.
2023-09-12 14:38:19 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted
I found this post as a possible solution for this issue, but I must admit I did not understand the solution, as people only talked about the id_rsa key file, but did not explain how to generate it, and how to apply it to all the nodes in the cluster. The latter is a specific point of doubt for me because I want to be sure of how this is supposed to work in the cluster setting. I don't want to risk getting the cluster non-operational because of a mistake in setting up and copying around the id_rsa key.
I apologize in advance for any mistakes, and would like to ask for help in pointing me in the right direction.
Thanks
Last edited: