[TUTORIAL] SSH Host Key Certificates - How to bypass SSH known_hosts bug(s)

So something is off here already, despite I do not see the command itself in the first listing:

Here is the log with success:
2024-03-04 15:07:22 use dedicated network address for sending migration traffic (10.10.33.70)
2024-03-04 15:07:23 starting migration of VM 101 to node 'pve01' (10.10.33.70)
2024-03-04 15:07:23 found local disk 'local-zfs:vm-101-disk-0' (attached)
2024-03-04 15:07:23 copying local disk images
2024-03-04 15:07:23 full send of rpool/data/vm-101-disk-0@__migration__ estimated size is 28.8K
2024-03-04 15:07:23 total estimated size is 28.8K
2024-03-04 15:07:23 TIME SENT SNAPSHOT rpool/data/vm-101-disk-0@__migration__
2024-03-04 15:07:24 successfully imported 'local-zfs:vm-101-disk-0'
2024-03-04 15:07:24 volume 'local-zfs:vm-101-disk-0' is 'local-zfs:vm-101-disk-0' on the target
2024-03-04 15:07:24 migration finished successfully (duration 00:00:02)
TASK OK

The above said it was connecting to pve01. The one below clearly was trying to migrate to what it believed was pve03.

And with error:
2024-03-04 15:09:52 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve03' root@10.10.10.0 /bin/true
2024-03-04 15:09:52 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2024-03-04 15:09:52 @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
2024-03-04 15:09:52 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2024-03-04 15:09:52 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
2024-03-04 15:09:52 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
2024-03-04 15:09:52 It is also possible that a host key has just been changed.
2024-03-04 15:09:52 The fingerprint for the RSA key sent by the remote host is
2024-03-04 15:09:52 SHA256:hKjSLXqdWwDIZ/zXDcr8Z1QenvI3+wMjwSrm8iMKePk.
2024-03-04 15:09:52 Please contact your system administrator.
2024-03-04 15:09:52 Add correct host key in /root/.ssh/known_hosts to get rid of this message.
2024-03-04 15:09:52 Offending RSA key in /etc/ssh/ssh_known_hosts:6
2024-03-04 15:09:52 remove with:
2024-03-04 15:09:52 ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pve03"
2024-03-04 15:09:52 Host key for pve03 has changed and you have requested strict checking.
2024-03-04 15:09:52 Host key verification failed.
2024-03-04 15:09:52 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

Are these excerpts above from the same node to the same node before and after changing migration network?

The other issue is I hope the 10.10.10.0 is some sort of typo?
 
So something is off here already, despite I do not see the command itself in the first listing:



The above said it was connecting to pve01. The one below clearly was trying to migrate to what it believed was pve03.



Are these excerpts above from the same node to the same node before and after changing migration network?

The other issue is I hope the 10.10.10.0 is some sort of typo?
Apologies for the confusion. I had simply migrated one way and then changed the nework and migrated back.
I think you may have identified the real problem though. I have a second bridge (vmbr1) for the 10G adapter that I setup with 10.10.10.0/24
This is the adressing that I see in the dropdown in the migration settings dialog. Did I setup vmbr1 incorrectly?
 
I have a second bridge (vmbr1) for the 10G adapter that I setup with 10.10.10.0/24

This is the adressing that I see in the dropdown in the migration settings dialog.

You can't use the .0 with /24 mask as an address of a node, that's your network address. How did you set up the migration network exactly?

Did I setup vmbr1 incorrectly?

How did you set up the bridge?
 
vmbr1 was initially created so that some VM/CT could utilize the 10G adapter. In the bridge creation dialog I simply added the network address 10.10.10.0/24 and the bridge port of my adapter. This was successful for getting the VM/CT access I needed.

Today was a new project where I wanted to get migrations to also utilize to 10G adapter. In the Datacenter>Options>Migration setting dialog is see two options

10.10.33.70/24 vmbr0
10.10.10.0/24 vmbr1

I am assuming now that setting up vmbr1 for VM/CT use is different than what is necessary for migration use?
 
vmbr1 was initially created so that some VM/CT could utilize the 10G adapter.

Yes :) But is the /etc/network/interfaces secret? :)

In the bridge creation dialog I simply added the network address 10.10.10.0/24 and the bridge port of my adapter. This was successful for getting the VM/CT access I needed.

I suspect you have iface vmbr1 without any address assigned to it on the node. If you do not plan to use the 10G adapter for VMs, you can also dismantle the bridge. Though you do not have to.

Today was a new project where I wanted to get migrations to also utilize to 10G adapter. In the Datacenter>Options>Migration setting dialog is see two options

10.10.33.70/24 vmbr0
10.10.10.0/24 vmbr1

These are fine, it's telling it which network to use, but what about your ip a now?

I am assuming now that setting up vmbr1 for VM/CT use is different than what is necessary for migration use?

To be frank, I have not idea what GUI does there, but we can fix it if you provide the two above.
 
Not a secret :)

auto lo
iface lo inet loopback

iface enp87s0 inet manual

iface enp90s0 inet manual

iface enp2s0f0 inet manual

iface enp2s0f1 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.10.33.70/24
gateway 10.10.33.1
bridge-ports enp87s0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

iface wlp91s0 inet manual

auto vmbr1
iface vmbr1 inet static
address 10.10.10.0/24
bridge-ports enp2s0f1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

ip a result:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp87s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
link/ether 58:47:ca:74:f0:3f brd ff:ff:ff:ff:ff:ff
3: enp90s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 58:47:ca:74:f0:40 brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 58:47:ca:74:f0:3d brd ff:ff:ff:ff:ff:ff
5: enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
link/ether 58:47:ca:74:f0:3e brd ff:ff:ff:ff:ff:ff
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 58:47:ca:74:f0:3f brd ff:ff:ff:ff:ff:ff
inet 10.10.33.70/24 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::5a47:caff:fe74:f03f/64 scope link
valid_lft forever preferred_lft forever
7: wlp91s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 4c:50:dd:6b:a7:db brd ff:ff:ff:ff:ff:ff
17: veth113i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:3d:79:b4:7e:33 brd ff:ff:ff:ff:ff:ff link-netnsid 0
18: veth117i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:4f:ba:60:d5:14 brd ff:ff:ff:ff:ff:ff link-netnsid 1
21: veth102i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:89:25:5a:23:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 3
22: veth114i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:3c:70:74:e3:92 brd ff:ff:ff:ff:ff:ff link-netnsid 4
23: veth115i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:19:9e:57:bb:86 brd ff:ff:ff:ff:ff:ff link-netnsid 5
24: veth119i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:95:0b:bb:a3:46 brd ff:ff:ff:ff:ff:ff link-netnsid 6
25: veth118i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether fe:7f:f6:e5:4d:b6 brd ff:ff:ff:ff:ff:ff link-netnsid 7
27: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 58:47:ca:74:f0:3e brd ff:ff:ff:ff:ff:ff
inet 10.10.10.0/24 scope global vmbr1
valid_lft forever preferred_lft forever
inet6 fe80::5a47:caff:fe74:f03e/64 scope link
valid_lft forever preferred_lft forever
33: veth104i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr1 state UP group default qlen 1000
link/ether fe:cb:7f:a8:a2:0c brd ff:ff:ff:ff:ff:ff link-netnsid 2
 
Not a secret :)

Code:
auto lo
iface lo inet loopback

iface enp87s0 inet manual

iface enp90s0 inet manual

iface enp2s0f0 inet manual

iface enp2s0f1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 10.10.33.70/24
        gateway 10.10.33.1
        bridge-ports enp87s0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

iface wlp91s0 inet manual

auto vmbr1
iface vmbr1 inet static
        address 10.10.10.0/24
        bridge-ports enp2s0f1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Here, you need to give the node itself a proper address, not address 10.10.10.0/24, also give it a gateway, the very same way like you did to the original bridge. This needs to be the case for both nodes, each its own address. Afterwards systemctl restart networking, then check the same as below:

ip a result:

Code:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: enp87s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 58:47:ca:74:f0:3f brd ff:ff:ff:ff:ff:ff
3: enp90s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 58:47:ca:74:f0:40 brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 58:47:ca:74:f0:3d brd ff:ff:ff:ff:ff:ff
5: enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether 58:47:ca:74:f0:3e brd ff:ff:ff:ff:ff:ff
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 58:47:ca:74:f0:3f brd ff:ff:ff:ff:ff:ff
    inet 10.10.33.70/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::5a47:caff:fe74:f03f/64 scope link
       valid_lft forever preferred_lft forever
7: wlp91s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 4c:50:dd:6b:a7:db brd ff:ff:ff:ff:ff:ff
17: veth113i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:3d:79:b4:7e:33 brd ff:ff:ff:ff:ff:ff link-netnsid 0
18: veth117i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:4f:ba:60:d5:14 brd ff:ff:ff:ff:ff:ff link-netnsid 1
21: veth102i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:89:25:5a:23:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 3
22: veth114i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:3c:70:74:e3:92 brd ff:ff:ff:ff:ff:ff link-netnsid 4
23: veth115i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:19:9e:57:bb:86 brd ff:ff:ff:ff:ff:ff link-netnsid 5
24: veth119i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:95:0b:bb:a3:46 brd ff:ff:ff:ff:ff:ff link-netnsid 6
25: veth118i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether fe:7f:f6:e5:4d:b6 brd ff:ff:ff:ff:ff:ff link-netnsid 7
27: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 58:47:ca:74:f0:3e brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.0/24 scope global vmbr1
       valid_lft forever preferred_lft forever
    inet6 fe80::5a47:caff:fe74:f03e/64 scope link
       valid_lft forever preferred_lft forever
33: veth104i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr1 state UP group default qlen 1000
    link/ether fe:cb:7f:a8:a2:0c brd ff:ff:ff:ff:ff:ff link-netnsid 2

And you should see proper address in inet 10.10.10.x/24 under item 27.
 
Thank you !! All working now.
Initially, it was not clear to me that this bridge was going to include/define the node IP. I
I was able to create the bridge in the GUI and add the gateway via the CLI in each of the 3 nodes.

All seems to be working now. Your help was invaluable!

Jim
 
  • Like
Reactions: esi_y
Thank you !! All working now.
Initially, it was not clear to me that this bridge was going to include/define the node IP. I
I was able to create the bridge in the GUI and add the gateway via the CLI in each of the 3 nodes.

All seems to be working now. Your help was invaluable!

Jim

I recommend reading up more on bridging. For a typical use case, you want to bridge a NIC to the VMs but not to have an IP on it with the host. This can be even done with VLANs. In your case, you now have migration network shared with (some) of your VMs, but that also means your hosts are available via SSH on that same network segment. Now when you consider that by default PVE is using just plain passwords and allows root login, should any VM of yours get compromised, the next target would be that host's root password. So yeah, maybe reconsider the setup.
 
Saw this issue today. Had removed one cluster node, reinstalled and added back with same IP's but different hostname. Almost everything worked ok like migrations with both local and Ceph storage, VM noVNC/Spice consoles, pve host consoles (xterm)... but LXC consoles didn't. There's a single container among 250 VMs, took me a while to notice.

I got this solved manually editing /etc/pve/priv/known_hosts and removing the line indicated by the error message. Then, used this in one node of the cluster to add the ssh-rsa key back to /etc/pve/priv/known_hosts:

Code:
/usr/bin/ssh-keyscan <HOST_IP> | grep ssh-rsa >> /etc/pve/priv/known_hosts

where <HOST_IP> is the IP address of the reinstalled host, as present in such host /etc/hosts file (that is, the one used by other hosts in the cluster to communicate with it).

So far so good.
 
Saw this issue today. Had removed one cluster node, reinstalled and added back with same IP's but different hostname. Almost everything worked ok like migrations with both local and Ceph storage, VM noVNC/Spice consoles, pve host consoles (xterm)... but LXC consoles didn't. There's a single container among 250 VMs, took me a while to notice.

That's strange, it should depend on the whether your host is relaying the session of the LXC, not on a particular one.

I got this solved manually editing /etc/pve/priv/known_hosts and removing the line indicated by the error message. Then, used this in one node of the cluster to add the ssh-rsa key back to /etc/pve/priv/known_hosts:

Code:
/usr/bin/ssh-keyscan <HOST_IP> | grep ssh-rsa >> /etc/pve/priv/known_hosts

where <HOST_IP> is the IP address of the reinstalled host, as present in such host /etc/hosts file (that is, the one used by other hosts in the cluster to communicate with it).

SSH does not have any problem with stale entries, i.e. it is perfectly fine to have duplicates in known_hosts file, it will work as long as any of the entries present will do. The reason you get host key verification failed is that pvecm updatecerts (it is automatically also run on some cluster operations) codepath that contains the bug removes the newly added key in an effort to "merge" the old file with the new key it is aware of, thus dropping the new key.

What you did was removed the old key manually and then added manually only added the new key, but by the IP address, not alias. You also grew your file by a number of void comment lines.

So far so good.

You can test which key and how is being selected by running what PVE internals do, i.e.
Code:
/usr/bin/ssh -vvv -o 'HostKeyAlias=$alias' root@$ipaddress -- /bin/true

Otherwise one can only speculate what happens later on and why. Of course this case gets more complicated as one has some more leftover skeleton entries in the closet with multiple nodes.
 
That's strange, it should depend on the whether your host is relaying the session of the LXC, not on a particular one.
There's only one LXC in this cluster.

SSH does not have any problem with stale entries, i.e. it is perfectly fine to have duplicates in known_hosts file, it will work as long as any of the entries present will do.
I know, but my TOC does :), so I removed stale entries.

You also grew your file by a number of void comment lines.
The comment lines output by /usr/bin/ssh-keyscan go to stderr, which is not redirected to /etc/pve/priv/known_hosts.

You can test which key and how is being selected by running what PVE internals do, i.e.
Tested and it does pick the just added key.
 
The comment lines output by /usr/bin/ssh-keyscan go to stderr, which is not redirected to /etc/pve/priv/known_hosts.

Fair enough. :)

Tested and it does pick the just added key.

I was getting at the fact - I do not know - what happens e.g. when you next time call pvecm updatecerts. Because it appears you added key by IP address and PVE tooling uses aliases, which you have removed. Do you now have entry only by IP address for that node?
 
I was getting at the fact - I do not know - what happens e.g. when you next time call pvecm updatecerts. Because it appears you added key by IP address and PVE tooling uses aliases, which you have removed. Do you now have entry only by IP address for that node?
I get your point now. Currently, /etc/pve/priv/known_hosts have 2 entries for the renamed host, one for the name ($alias) and another for the IP. Both have the same value and both work ok via ssh and PVE (live mig, storage migration, consoles, etc).

I've just check a backup of the known_hosts file and the entry for the hostname was already there with the right value before I added the entry for the IP. Makes sense: I just reused the IP, not the name, so when I re added the host to the cluster, PVE created the right entry.
 
I get your point now. Currently, /etc/pve/priv/known_hosts have 2 entries for the renamed host, one for the name ($alias) and another for the IP. Both have the same value and both work ok via ssh and PVE (live mig, storage migration, consoles, etc).

And you are saying that when you remove the one by the IP, the LXC console does not work? From the same node where LXC is running or different node?

I've just check a backup of the known_hosts file and the entry for the hostname was already there with the right value before I added the entry for the IP.

This is very strange then and in my opinion worth troubleshooting (when the IP entry is missing). All cluster ops use - as far as I know - HostKeyAlias in the command invocation so should be completely ignoring your newly added entry with IP (which gets used when you invoke ssh manually without the option.)

Makes sense: I just reused the IP, not the name, so when I re added the host to the cluster, PVE created the right entry.

To me it does not make sense, not saying you are wrong but it defeats what I know about PVE internals. If there was the alias entry from the beginning, the IP entry (for LXC console) is completely superfluous (though harmless).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!