Outline:
I'm wondering if there's any detailed documentation anywhere about exactly how the various SSH key files are used within a Proxmox cluster?
In particular, what is the relationship between
And also, when the replication commands such as
are run, exactly which keys and files are referenced?
Update:
I think I resolved the problem as follows:
I assumed that the public keys in each server's
I copied the public keys from each other node into both
Once that was done, the replication errors went away and the cluster appears to be operating normally.
I've also successfully tested rebooting a node after the Qdevice had been removed, and both other nodes remained up and running, so that problem has also been resolved.
Is there anything I need to check or verify to be sure all is well?
Details of issue I was trying to resolve:
I have a three node cluster, two large machines and one smaller one, which started as a 2+Qdevice, then was 3+Qdevice, and is now just 3 nodes.
A while back one of the nodes had its single boot disk fail, so it was rebuilt using a hardware RAID1 pair and re-added to the cluster.
Given the other large machine had a boot disk of the same age, we decided to rebuild that too, and remove the Qdevice (as previously advised here) while about it. I decided to do the rebuild first and then remove the Qdevice, with hindsight this was the wrong order of operations.
Having already migrated all running VMs off it, I shut down the second node. This caused all nodes to reboot, presumably due to the Qdevice, though other than a few minutes downtime for the VMs this wasn't disastrous as both the other nodes rebooted OK. I'm guessing had I removed the Qdevice before shutting the node down, this wouldn't have happened?
Next I replaced the relevant disks on the shut down server, and then used the
I replaced the relevant disks on the down server, and reinstalled proxmox 7.4 from a USB stick. Installation went well and I was able to re-import the ZFS pools via zpool import -m -f <poolname>. The machine was rebuilt with the same name and IP address it previously had.
I used the GUI to get the cluster information from node-1, and used it to join the rebuilt node-2 to the cluster, which worked perfectly.
On the command line of node-1, I then did
Now as node-1 was previously rebuilt, it's key changed then, and node-2's key changed on today's rebuild, but I'd not previously seen the errors that are now appearing.
As detailed above, the solution appears to have been to copy each server's public host key from
I'm wondering if there's any detailed documentation anywhere about exactly how the various SSH key files are used within a Proxmox cluster?
In particular, what is the relationship between
/root/.ssh/known_hosts
, /etc/ssh/ssh_known_hosts
, and /etc/pve/priv/known_hosts
, and if replacing keys, does anything need to be restarted before the change will "take"?And also, when the replication commands such as
/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=nodename' root@x.y.z.o -- pvesr prepare-local-job 206-1 --scan tank1 tank1:vm-206-disk-0 --last_sync 1701859501
are run, exactly which keys and files are referenced?
Update:
I think I resolved the problem as follows:
I assumed that the public keys in each server's
/etc/ssh/ssh_host_rsa_key.pub
files (and the matching private keys) are the actual master keys.I copied the public keys from each other node into both
/etc/ssh/ssh_known_hosts
on all three servers, and /etc/pve/priv/known_hosts
which is shared across all three.Once that was done, the replication errors went away and the cluster appears to be operating normally.
I've also successfully tested rebooting a node after the Qdevice had been removed, and both other nodes remained up and running, so that problem has also been resolved.
Is there anything I need to check or verify to be sure all is well?
Details of issue I was trying to resolve:
I have a three node cluster, two large machines and one smaller one, which started as a 2+Qdevice, then was 3+Qdevice, and is now just 3 nodes.
A while back one of the nodes had its single boot disk fail, so it was rebuilt using a hardware RAID1 pair and re-added to the cluster.
Given the other large machine had a boot disk of the same age, we decided to rebuild that too, and remove the Qdevice (as previously advised here) while about it. I decided to do the rebuild first and then remove the Qdevice, with hindsight this was the wrong order of operations.
Having already migrated all running VMs off it, I shut down the second node. This caused all nodes to reboot, presumably due to the Qdevice, though other than a few minutes downtime for the VMs this wasn't disastrous as both the other nodes rebooted OK. I'm guessing had I removed the Qdevice before shutting the node down, this wouldn't have happened?
Next I replaced the relevant disks on the shut down server, and then used the
pvecm delnode
command to remove it from the cluster, as per the documentation. This removed it from pvecm status but not the GUI, I then recursively moved /etc/pve/nodes/<name-of-node>
to a backup directory, and reloaded the GUI in the browser, the former node disappeared as expected.I replaced the relevant disks on the down server, and reinstalled proxmox 7.4 from a USB stick. Installation went well and I was able to re-import the ZFS pools via zpool import -m -f <poolname>. The machine was rebuilt with the same name and IP address it previously had.
I used the GUI to get the cluster information from node-1, and used it to join the rebuilt node-2 to the cluster, which worked perfectly.
On the command line of node-1, I then did
pvecm qdevice remove
which worked but gave an SSH error about wrong keys. It helpfully included a suggested command to remove the errant key, which I ran - and ever since, now the replication commands fail with key verification errors.Now as node-1 was previously rebuilt, it's key changed then, and node-2's key changed on today's rebuild, but I'd not previously seen the errors that are now appearing.
As detailed above, the solution appears to have been to copy each server's public host key from
/etc/ssh/ssh_host_rsa_key.pub
to both the local known hosts and the pve shared file system known_hosts files.