Cannot migrate VM/CT due to SSH key error

djardim

New Member
Aug 14, 2023
7
1
3
39
Brazil
Hello, everyone.

Quick background story:
We had a cluster of 11 nodes on version 7, and we planned to upgrade them all to version 8 while applying some changes to the way we organized stuff, which means we decided to change the names of all the nodes of the cluster.

After careful consideration and research, we decided that the best approach was to simply reinstall and rejoin the nodes in the cluster, since node renaming is practically impossible. In addition, we installed and joined a few other nodes. Along the way, we found out that some servers were having problems installing version 8.0 from the ISO, and others were having trouble booting up after upgrading from version 7.4, despite those installations being pretty much vanilla.

After upgrading the cluster to version 8, we observed that a lot of the nodes could not communicate correctly with one another because of wrong SSH keys in /etc/ssh/known_hosts file. We found out this blog post that explained some steps to fix this. I ended up assembling the following to help me through the process in all nodes:

Bash:
#these commands MUST be issued on all nodes of the cluster

#remove HTTPS certificates
rm /etc/pve/pve-root-ca.pem
rm /etc/pve/priv/pve-root-ca.key
find /etc/pve/nodes -type f -name 'pve-ssl.key' -exec rm {} \;
rm /etc/pve/authkey.pub
rm /etc/pve/priv/authkey.key
rm /etc/pve/priv/authorized_keys

#recreate HTTPS certificates
pvecm updatecerts -f

#restart "pvedaemon" and "pveproxy" services
systemctl restart pvedaemon pveproxy

#remove SSH keys
rm /root/.ssh/known_hosts
rm /etc/ssh/ssh_known_hosts

#SSH from each node into all other nodes to ensure you have SSH access


#reboot
reboot

I even got a script for SSH interconnection testing between all the nodes:

Bash:
#!/bin/bash

# List of IP addresses of all nodes
ip_list=([IP address list])

# Loop on IPs
for ip in "${ip_list[@]}"; do
    # Tries to connect using SSH with "StrictHostKeyChecking=accept-new" without executing remote commands (-N)
    ssh -o StrictHostKeyChecking=accept-new -N "$ip" &
done

# Wait some time for the SSH connections to be established
sleep 5

After applying those commands on all nodes, cluster got manageable again, but when I tried to migrate some VMs around, I got the following message:

Code:
2023-09-12 14:38:19 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cph-2950-27' root@[IP adress] /bin/true
2023-09-12 14:38:19 Host key verification failed.
2023-09-12 14:38:19 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

I found this post as a possible solution for this issue, but I must admit I did not understand the solution, as people only talked about the id_rsa key file, but did not explain how to generate it, and how to apply it to all the nodes in the cluster. The latter is a specific point of doubt for me because I want to be sure of how this is supposed to work in the cluster setting. I don't want to risk getting the cluster non-operational because of a mistake in setting up and copying around the id_rsa key.

I apologize in advance for any mistakes, and would like to ask for help in pointing me in the right direction.

Thanks
 
Last edited:
SSH know_hosts can be stored as IP as well as hostname.
When you changed the hostnames, you also need to change the /etc/hosts file.
Make sure you have the correct IP addresses matching the hostnames of the remote machines.
 
Have you followed the instructions for deleting a node in the manual, specially the part about removing a node from the cluster?

Can you give me the output of journalctl --since '2023-09-10' > $(hostname)-journal.txt from both a cluster node and the PVE you attempted to join?
 
When you changed the hostnames, you also need to change the /etc/hosts file.

But the node has been reinstalled, exactly because renaming a node is something practically impossible to do, as you would have to change the host name in several places, in several nodes, and even some of the configs, like SSL certificates, would have to be changed through the use of certain commands, so it is too far from being trivial.
 
Have you followed the instructions for deleting a node in the manual, specially the part about removing a node from the cluster?

Yes, I did. I used pvecm delnode [node name] on a second node to remove the target node from the cluster, and then I reinstalled the node from scratch. But I've kept the IP address, so that is why some servers find trouble recognizing the new server on the same IP.

Which makes me think that the delnode procedure needs some work in regards to removing all info about the removed node, not only from the cluster system, but from the operating systems of all the nodes individually, like public keys, SSH keys, etc. So it becomes easier to join a new node under the

Can you give me the output of journalctl --since '2023-09-10' > $(hostname)-journal.txt from both a cluster node and the PVE you attempted to join?

Actually we managed to solve this by issuing pvecm updatecerts -f on all nodes at the same time yesterday. But now I got another issue: some of the nodes are complaining about an unreadable .srl file, and cannot generate new SSL certificates because of that. I'm attaching the result of journalctl on a working node, and a faulty node so you can check it.

It's from 5 days ago, so I apologize in advance for the textual size of it.

Also, due to the size of the logs from pve-r430-35, I had to split the zip file in two chunks at compression time, and add a .zip extension to it in order for the forum server to accept the file upload. Sorry for the inconvenience.
 

Attachments

Last edited:
Seems like the issue resolved itself during a power outage on the weekend. I've been able to move VMs around normally since.
Let's see... I'll reply here if something goes bad again...

Thanks everyone!
 
  • Like
Reactions: s.lendl
Yep, putting the name of the machine in front of the key in ssh_known_hosts is critical. It doesn't get put there automatically when you connect.

Can someone please explain how the ssh certs are set up on a proxmox cluster machine? I see them in multiple locations:
/root/.ssh, /etc/ssh, and /etc/pve/priv. Putting the machine name in front of the key in the file worked, but I now have a mess of files with backups in multiple folders and some symbolic links. Would it be wise to do the steps in this post and start over? In either case, I would like to learn how these keys work and are interconnected. Thanks!
 
Yep, putting the name of the machine in front of the key in ssh_known_hosts is critical. It doesn't get put there automatically when you connect.

Can someone please explain how the ssh certs are set up on a proxmox cluster machine? I see them in multiple locations:
/root/.ssh, /etc/ssh, and /etc/pve/priv. Putting the machine name in front of the key in the file worked, but I now have a mess of files with backups in multiple folders and some symbolic links. Would it be wise to do the steps in this post and start over? In either case, I would like to learn how these keys work and are interconnected. Thanks!

There's a bug in pvecm updatecerts (this code also gets executed on e.g. cluster join) in relation to the SSH keys, patch has been posted here [b1].

Possible workaround (which uses SSH certs indeed) here [1].

One of the many threads dealing with this still end of 2023 (bug has been around for 10 years):
https://forum.proxmox.com/threads/pvecm-updatecert-f-not-working.135812/

If no one +1 themselves in the bugreport(s), there's unfortunately little incentive to do anything from the PVE team apparently.

Short explanation:

There's SSL certs used for most of the comms in the cluster by now and there's SSH keys (not certs) used for proxying shell, migration, replication, QDevice setup. The PVE setup has been subpar all along (way more used to rely on SSH than on SSL before) because there's 2 parts to every SSH connection:

- The Host (node being connected to) authentication - there's one single shared file /etc/pve/known_hosts (this is the file that gets corrupted by the PVE tool) for this symlinked from /etc/ssh/ssh_known_hosts from each individual node, which is unsupported setup as it breaks how regular tools such as ssh-keygen work - see another bug [b2].

- The User (connecting) authorization which depends on /etc/pve/priv/authorized_keys which again is symlinked from every single node's root's /root/.ssh/authorized_keys, which then results in yet another bug [b3].

I am trying to be as nice as possible by now because I am still being replied to from Proxmox staff nicely regarding any other things, but when it comes to SSH keys there's very much no interest left in fixing this [2].

Every week there's some post on this forum from someone who ran into issues with this, but well, apparently for 10 years this was acceptable. The most elegant way for you to bypass this is the use of SSH certs [1], which was turned down as well [b4].

[1] https://forum.proxmox.com/threads/s...ass-ssh-known_hosts-bug-s.137809/#post-614017

[2] https://forum.proxmox.com/threads/ssh-keys-across-nodes.136437/#post-605931

[b1] https://bugzilla.proxmox.com/show_bug.cgi?id=4886#c27

[b2] https://bugzilla.proxmox.com/show_bug.cgi?id=4252#c21

[b3] https://bugzilla.proxmox.com/show_bug.cgi?id=4670#c4

[b4] https://bugzilla.proxmox.com/show_bug.cgi?id=5053#c3
 
Last edited:
  • Like
Reactions: tcabernoch

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!