SSH Keys in a Proxmox Cluster, resolving replication "Host key verification failed" errors

Pyromancer

Member
Jan 25, 2021
24
6
8
47
Outline:

I'm wondering if there's any detailed documentation anywhere about exactly how the various SSH key files are used within a Proxmox cluster?

In particular, what is the relationship between /root/.ssh/known_hosts, /etc/ssh/ssh_known_hosts, and /etc/pve/priv/known_hosts, and if replacing keys, does anything need to be restarted before the change will "take"?

And also, when the replication commands such as
/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=nodename' root@x.y.z.o -- pvesr prepare-local-job 206-1 --scan tank1 tank1:vm-206-disk-0 --last_sync 1701859501
are run, exactly which keys and files are referenced?

Update:

I think I resolved the problem as follows:

I assumed that the public keys in each server's /etc/ssh/ssh_host_rsa_key.pub files (and the matching private keys) are the actual master keys.

I copied the public keys from each other node into both /etc/ssh/ssh_known_hosts on all three servers, and /etc/pve/priv/known_hosts which is shared across all three.

Once that was done, the replication errors went away and the cluster appears to be operating normally.

I've also successfully tested rebooting a node after the Qdevice had been removed, and both other nodes remained up and running, so that problem has also been resolved.

Is there anything I need to check or verify to be sure all is well?


Details of issue I was trying to resolve:

I have a three node cluster, two large machines and one smaller one, which started as a 2+Qdevice, then was 3+Qdevice, and is now just 3 nodes.

A while back one of the nodes had its single boot disk fail, so it was rebuilt using a hardware RAID1 pair and re-added to the cluster.

Given the other large machine had a boot disk of the same age, we decided to rebuild that too, and remove the Qdevice (as previously advised here) while about it. I decided to do the rebuild first and then remove the Qdevice, with hindsight this was the wrong order of operations.

Having already migrated all running VMs off it, I shut down the second node. This caused all nodes to reboot, presumably due to the Qdevice, though other than a few minutes downtime for the VMs this wasn't disastrous as both the other nodes rebooted OK. I'm guessing had I removed the Qdevice before shutting the node down, this wouldn't have happened?

Next I replaced the relevant disks on the shut down server, and then used the pvecm delnode command to remove it from the cluster, as per the documentation. This removed it from pvecm status but not the GUI, I then recursively moved /etc/pve/nodes/<name-of-node> to a backup directory, and reloaded the GUI in the browser, the former node disappeared as expected.

I replaced the relevant disks on the down server, and reinstalled proxmox 7.4 from a USB stick. Installation went well and I was able to re-import the ZFS pools via zpool import -m -f <poolname>. The machine was rebuilt with the same name and IP address it previously had.

I used the GUI to get the cluster information from node-1, and used it to join the rebuilt node-2 to the cluster, which worked perfectly.

On the command line of node-1, I then did pvecm qdevice remove which worked but gave an SSH error about wrong keys. It helpfully included a suggested command to remove the errant key, which I ran - and ever since, now the replication commands fail with key verification errors.

Now as node-1 was previously rebuilt, it's key changed then, and node-2's key changed on today's rebuild, but I'd not previously seen the errors that are now appearing.

As detailed above, the solution appears to have been to copy each server's public host key from /etc/ssh/ssh_host_rsa_key.pub to both the local known hosts and the pve shared file system known_hosts files.
 
Though I did search the forum for answers I'd not seen your article, I will study this in detail.

I'd been expecting SSH issues as I'd completely replaced the main servers with all-new installs but retaining their existing names and IP addresses, so wasn't particularly surprised but couldn't find any detailed documentation on exactly how the keys system works in Proxmox.

I see references to symlinks in your article, does this mean I've now got a non-standard system with the keys being separate files, but duplicated? Is this likely to cause any problems moving forward?

Now that the boot disks on each main machine are raid1 pairs, am hoping to avoid having to do total reinstalls in the future, the next major change will be to upgrade all three nodes from PVE7 to PVE8, which we'll do in the new year as we're now under a changes lockout over the holidays on the production systems.
 
Though I did search the forum for answers I'd not seen your article, I will study this in detail.

There's been several threads related to this issue over time, most recent ones e.g.
https://forum.proxmox.com/threads/pvecm-updatecert-f-not-working.135812/page-3#post-606420
https://forum.proxmox.com/threads/ssh-keys-across-nodes.136437/#post-605077

I'd been expecting SSH issues as I'd completely replaced the main servers with all-new installs but retaining their existing names and IP addresses, so wasn't particularly surprised but couldn't find any detailed documentation on exactly how the keys system works in Proxmox.

So theoretically, when you install it new and at no time you have both old and new present with the same name/IP, you should be all fine, but I suppose you were replacing them one by one. What happens is that the new node install key (under the same alias as for which there's already an entry) wants to get added to the known_hosts (the shared one via symlink), but because of the bug the tool already finds one such alias there (the old) and does not include the new one (as it is attempting to prune the file). You end up in endless loop wondering why that key is "wrong" and why the correct one does not stay there even after you manually add it.

I see references to symlinks in your article, does this mean I've now got a non-standard system with the keys being separate files, but duplicated? Is this likely to cause any problems moving forward?

The symlinks typically got disrupted because of running ssh-keygen -R (see bug 4252). If that happened, basically the pvecm updatecerts command (this is also run when you are joining a newly installed node) will reinstate it, but it also contains the mentioned bug. The tutorial is a workaround, or you can have your Perl file patched from what is available in bug 4886 as an attachment.

It's not going to cause problems going forward if you get the known_hosts cleaned up, whether with the patched tool, using SSH certs or manually (e.g. wiping it out and then running pvecm updatecerts on each and every node one by one).

Now that the boot disks on each main machine are raid1 pairs, am hoping to avoid having to do total reinstalls in the future, the next major change will be to upgrade all three nodes from PVE7 to PVE8, which we'll do in the new year as we're now under a changes lockout over the holidays on the production systems.

There's insufficient documentation on this, but the inference is that basically PVE is silly (my opinion) that it requires you to never reuse name or even an IP address of a node. So basically as absurd as it may sound, if you reinstall your node that was e.g. pve71 sitting on 10.x.y.71, you may as well put the new one on 10.x.y.81 and name it anything but what the old one was, e.g. pve81. Not my design decision, but that's what you get told here because bugfixes are too much of a hassle (my observation).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!