(Proxmox 5) Storage Replication issue (Host key verification failed)

brwainer

Active Member
Jun 20, 2017
35
4
28
31
I've upgraded my cluster to Proxmox 5 from 4.4, following the upgrade instructions verbatim. I had no errors encountered during the upgrade. Before the upgrade, my cluster was fully working with no issues, including the use of pve-zsync. Hostnames and IPs on the cluster members have never changed since Proxmox was installed on them. After upgrading I tried to set up Storage Replication, and am running into the following error whenever a job tries to run (full text from the replication log):

2017-07-05 03:36:01 100-0: start replication job
2017-07-05 03:36:01 100-0: guest => VM 100, running => 0
2017-07-05 03:36:01 100-0: volumes => SSDs:vm-100-disk-1
2017-07-05 03:36:01 100-0: (remote_prepare_local_job) Host key verification failed.
2017-07-05 03:36:01 100-0: (remote_prepare_local_job)
2017-07-05 03:36:01 100-0: end replication job with error: command '/usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=813MTQ' root@192.168.10.14 -- pvesr prepare-local-job 100-0 SSDs:vm-100-disk-1 --last_sync 0' failed: exit code 255​

This is on a replication job from "KGPE-D16" (host A, 192.168.10.27) to "813MTQ" (host B, 192.168.10.14). Any replication jobs in the opposite direction have the same error. If I log into either host (KGPE-D16 in this example) and run the command manually, I get:
root@KGPE-D16:~# /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=813MTQ' root@192.168.10.14
Host key verification failed.​
But if I run it without the HostKeyAlias option I get:
root@KGPE-D16:~# /usr/bin/ssh -o 'BatchMode=yes' root@192.168.10.14
Linux 813MTQ 4.10.15-1-pve #1 SMP PVE 4.10.15-15 (Fri, 23 Jun 2017 08:57:55 +0200) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Jul 5 02:12:53 2017 from 192.168.10.199
root@813MTQ:~#​
I have looked up the "Host key verification failed" message and have run a few commands that came up in the results, including "ssh_keygen -R 192.168.10.14" and adding a host section (including HostKeyAlias) into /root/.ssh/config - nothing has seemed to help. Here is my apt (using no-subscription repo, this is the test cluster):
root@813MTQ:~# apt update
Ign:1 http://ftp.us.debian.org/debian stretch InRelease
Hit:2 http://security.debian.org stretch/updates InRelease
Hit:3 http://ftp.us.debian.org/debian stretch Release
Hit:5 http://download.proxmox.com/debian stretch InRelease
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.​
This seems to be either a bug with the ssh command that Storage Replication is using (HostKeyAlias), or possibly with the upgrade from 4.4 to 5.0, but if there is anything I can do to fix this locally I'm willing to try.
 
This seems to be either a bug with the ssh command that Storage Replication is using (HostKeyAlias), or possibly with the upgrade from 4.4 to 5.0, but if there is anything I can do to fix this locally I'm willing to try.

The problem is that the SSH knwon_hostfiles is hashed in the default debian SSH package, which we use here. This tries to hide to which servers a user connected in the past and is normally good, but causes some issues here.

CORRECTION EDIT: this has to work already, after a deep look at the code. AFAIK, only a node rename could be problematic. If iot does not work open a bug report at https://bugzilla.proxmox.com with all details how to reproduce. Also use
Code:
pvecm updatecerts
to add the host keys again, after a node rename.

--- old post below --

Either add the node entries from a cluster to /etc/hosts and use the node name directly when adding a node to a cluster

In your case, where the cluster is already established I suggest to still add the nodes via their node name and (any) reachable IP to /etc/hosts of all nodes and then use ssh-keyscan to gather the keys.

In my example I have two node – andromeda and pegasus, I'd add an entry in each /etc/hosts file (thios entry musn't necessary point to the migration network IP, it can be any IP where both nodes can talk to each other).

Then, use the following command once on any node (no need to do it on multiple nodes) to add the ssh keys
Code:
ssh-keyscan -t rsa andromeda pegasus >> /etc/pve/priv/known_hosts
replace "andromeda pegasus" with the list of your clusters node names, you may add all at once.

You should see an output alà:
Code:
# andromeda:22 SSH-2.0-OpenSSH_7.4p1 Debian-10
# pegasus:22 SSH-2.0-OpenSSH_7.4p1 Debian-10
...

Sorry for any inconvenience caused, we're looking for an transparent solution here.
 
Last edited:
FYI, I looked a bit into this and it should work as is already... I remember that I could trigger such a situation sometimes ago, but I cannot do so currently, and after a deep look at the code it also seems that we act correctly.
We hash the entries but we add both, the management IP and the node name as entry, so using "HostKeyAlias=nodename" has to work.

The single thing which I could imagine is that the hostname (=nodename) of a node was changed?

If so use:
Code:
pvecm updatecerts

on the node where the name was changed. DO this instead of my rather complicate keyscan from above, it should rewrite the public key if not already there and fix the issue.
 
Thanks for the input, hostnames have never changed - At least not within this test cluster which is only a few weeks old. I added all the cluster nodes into each other's /etc/hosts files to rule out any name resolution issues, then ran "pvecm updatecerts" as requested, it did not make a difference. I then ran "ssh-keyscan -t rsa KGPE-D16 813MTQ >> /etc/pve/priv/known_hosts" which you had initially recommended but later correct, and this worked. I checked the "/etc/pve/priv/known_hosts" to see what the difference was, and I found this (ssh keys shortened for brevity):

Code:
root@813MTQ:~/.ssh# cat /etc/pve/priv/known_hosts
|1|QsV66Fb46CUDz2Pl4EJPjSiMPKM=|lIS/ua4VLcP+nRnECc43UA10pyg= ssh-rsa AAAAB3NzaC1........le2dMrDQUz
|1|tRDO5kakbzoibY5AfPOh5py0nrc=|N4iXDKp87PBA65ED/nghfzOqGNk= ssh-rsa AAAAB3NzaC1........le2dMrDQUz
|1|6KtxRtjau10DPMf70YkfaJGqd/E=|xHi1DWP8FKoivZb+XQo4x1HAu1A= ssh-rsa AAAAB3NzaC1........E0SEnQpQbR
|1|3EY2HK3YCtBGj59bi5QjSme4FCs=|YkMZJ8GRXepSQo9XXU0Bi0l+wPo= ssh-rsa AAAAB3NzaC1........E0SEnQpQbR
813MTQ ssh-rsa AAAAB3NzaC1........le2dMrDQUz
KGPE-D16 ssh-rsa AAAAB3NzaC1........E0SEnQpQbR

If you compare the SSH keys, it is pretty clear to see that the two entries added to the end are not hashed, and those two work, but the prior entries are hashed. I hope this can point you towards finding and resolving the issue.
 
If you compare the SSH keys, it is pretty clear to see that the two entries added to the end are not hashed, and those two work, but the prior entries are hashed. I hope this can point you towards finding and resolving the issue.

Thanks a lot, this may help, I'll give it a look.
then ran "pvecm updatecerts" as requested, it did not make a difference. I

OK, maybe our hashing algorithm has a problem with some node names.

This is not a solution. We uses purchased certs

Understandable, please use the ssh-keyscan workaround from above for now.
Was yours a new cluster or a updated one?
 
Found the reason for the problem...

We preserve case sensitivity before hashing, openssh lowercases everything before comparison.
If not hashed this still works, as ssh can lowercase both values. But if one its hashed it has no access to the original value (duh) and expects the convention that it was already lowercased before hashing.
Thus, you get the "auth failed" error if a node name contains uppercase letters and a separate migration network is set. Ugh, I'll send a patch fixing this soon... Thanks again for the report.
 
This issue was addressed in pve-cluster 5.0-12 which is available in the pvetest repo and should soonish trickle down to the other repositories.
The commit which includes the main fix can be found at: https://git.proxmox.com/?p=pve-cluster.git;a=commitdiff;h=e4f92a201dc457483da285b1055a2fe665750fd4
Summarizing note: Only cluster including uppercase node names were affected. Further this package solves the issue for newly added nodes or created clusters, else please run `pvecm updatecerts` if possible (i.e. no custom SSL certs) or add the keys to the known_hosts file with ssh-keyscan as described in my post above.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!