[SOLVED] pvecm updatecert -f - not working

Proximate

Member
Feb 13, 2022
219
11
23
64
I'm able to ssh between hosts but using the cluster, I cannot move vms around without getting key related errors.
I've run this set of commands on first one host then all of them and no change.

Code:
pvecm updatecert -f
systemctl stop pve-cluster
systemctl start pve-cluster

I have read everything I can find and nothing seems to work. At this point, I fear I'm going to break the cluster and very much need a little help.
Does anyone know how I can re-generate the keys without breaking anything? There are 4 hosts in the cluster.
 
Can you show the error messages you receive when trying to move or migrate VMs in the Cluster ?
 
Hi, certainly. Keep in mind I've tried the suggestion shown in the GUI notices as well.
This, I don't recall ever setting manually, "you have requested strict checking".

Code:
2023-11-02 08:48:36 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pro07' root@10.0.0.76 /bin/true
2023-11-02 08:48:36 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2023-11-02 08:48:36 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
2023-11-02 08:48:36 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2023-11-02 08:48:36 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
2023-11-02 08:48:36 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
2023-11-02 08:48:36 It is also possible that a host key has just been changed.
2023-11-02 08:48:36 The fingerprint for the RSA key sent by the remote host is
2023-11-02 08:48:36 SHA256:jWteCyets35Lx0oRQqfj07fvI4BiCtvqWRLKWeG54pU.
2023-11-02 08:48:36 Please contact your system administrator.
2023-11-02 08:48:36 Add correct host key in /root/.ssh/known_hosts to get rid of this message.
2023-11-02 08:48:36 Offending RSA key in /etc/ssh/ssh_known_hosts:1
2023-11-02 08:48:36   remove with:
2023-11-02 08:48:36   ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pro07"
2023-11-02 08:48:36 Host key for pro07 has changed and you have requested strict checking.
2023-11-02 08:48:36 Host key verification failed.
2023-11-02 08:48:36 ERROR: migration aborted (duration 00:00:01): Can't connect to destination address using public key
TASK ERROR: migration aborted
 
Does anyone have any thoughts on this? I've been stuck with this for well over a week. Not sure how to get things working again.
 
ssh into the machines, look into your known_hosts file, see if keys match what you find in /etc/ssh in the various nodes, and go from there. you can use this command to probe the public key for host B while connected to host A:
Bash:
ssh-keyscan -t rsa <host> | sed "s/^[^ ]* //"
 
They keys are definitely different but that's what I don't understand. I can ssh between hosts without using passwords thanks to the ssh keys but only the GUI is not allowing interconnections between hosts.
 
Hi, I hope I can respond to all of this.

>Can you confirm you can log into this specific machine that has this in the log and from this very machine you can
>manually SSH into pro06 without any issue from the CLI?

Yes, I can ssh into all of the hosts with only one error as seen below.

First, I tested again ssh'ing from host to host. I'm doing this from pro07. There are five hosts, pro01, pro02, pro03, pro04 and pro07.
Code:
root@pro07:~# ssh 10.0.0.70
Linux pro01 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Nov  3 07:02:52 2023 from 10.10.10.10
root@pro01:~# exit
logout
Connection to 10.0.0.70 closed.

root@pro07:~# ssh 10.0.0.71
Linux pro02 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Nov  3 09:03:28 2023 from 10.0.0.70
root@pro02:~# exit
logout
Connection to 10.0.0.71 closed.

root@pro07:~# ssh 10.0.0.72
Linux pro03 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Nov  3 09:03:39 2023 from 10.0.0.70
root@pro03:~# exit
logout
Connection to 10.0.0.72 closed.

root@pro07:~# ssh 10.0.0.73
Linux pro04 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Nov  3 09:03:51 2023 from 10.0.0.70
root@pro04:~# exit
logout
Connection to 10.0.0.73 closed.

root@pro07:~# ssh 10.0.0.76
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:jWteCyets35Lx0oRQqfj07fvI4BiCtvqWRLKWeG54pU.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:2
  remove with:
  ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "10.0.0.76"
Host key for 10.0.0.76 has changed and you have requested strict checking.
Host key verification failed.
root@pro07:~# ssh 10.0.0.76
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:jWteCyets35Lx0oRQqfj07fvI4BiCtvqWRLKWeG54pU.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:2
  remove with:
  ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "10.0.0.76"
Host key for 10.0.0.76 has changed and you have requested strict checking.
Host key verification failed.

I did the same from pro01 and didn't get the above error when ssh'ing back to the localhost.

>In any case, after running, as suggested, on this machine the command ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pro07", the
>issue still persists?

Correct but only from the GUI.

>When you say "GUI notices" does this mean you have multiple notices from multiple nodes like this with different hosts or
>is this all just the pro07 node showing up in the notices?

From all nodes. No matter which I try to migrate something from, even if connected to that specific host, wanting to migrate to another host, I get the error.

>If so, can you check the /etc/hosts and check for pro07 if it has the right IP address listed? Have you, at any point, reinstalled
>the node giving it the same name before which you had not manually removed it from the cluster? How many nodes
>are in this cluster?

Yes, I re-installed some nodes but had removed nodes from the cluster when I did so and re-joined them later.

>Can you also run ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "$IP_address" with the actual IP address of this node,
>presumably 10.0.0.76?

Code:
root@pro07:~# ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 10.0.0.76
# Host 10.0.0.76 found: line 2
/etc/ssh/ssh_known_hosts updated.
Original contents retained as /etc/ssh/ssh_known_hosts.old

The problem happens no matter which node I use. However, ssh'ing back to the localhost did show an ssh key error so maybe that is related in some way.

>The migrations don't work either way? Do you have any dedicated migrations network set in the Datacentre options?

Correct, migrations aren't working but I do have access to all of the nodes from any one node's GUI. Just cannot migrate or do anything that involves intercommunication's between nodes I suppose.

In the Datacenter options, I only have default settings. I've been trying to find time to learn more about proxmox so mostly default settings for now.
The only Migrations setting I find is 'Migration Settings' which is set to Default.
 
Sorry for being quiet, I didn't get any notice of these replies. I'll check them now and try to respond to everything, or solve the problem :).
Maybe I just need to migrate vms from the .76 host then re-install so make sure everything is clean again if you think it's only this host causing the problems.

Will read now.
 
You mention multiple places for ssh-keys which is what I've come up against. It seems once in this kind of situation, it's easier to rebuild everything and start over, especially since I don't have the knowledge yet.

However, that would mean moving vms, rebuilding hosts, creating a new cluster. Seems like a lot of work. There must be a simpler way.

BTW, yes, I've re-installed nodes using the same previous node names. You've probably hit the nail on the head.
The rebuilt nodes used the same IPs and host names to keep track of machines on the network.
If I have to use new IPs and names, that will mess things up but not that huge of a problem.
 
You mention multiple places for ssh-keys which is what I've come up against. It seems once in this kind of situation, it's easier to rebuild everything and start over, especially since I don't have the knowledge yet.

However, that would mean moving vms, rebuilding hosts, creating a new cluster. Seems like a lot of work. There must be a simpler way.

BTW, yes, I've re-installed nodes using the same previous node names. You've probably hit the nail on the head.
The rebuilt nodes used the same IPs and host names to keep track of machines on the network.
If I have to use new IPs and names, that will mess things up but not that huge of a problem.
I actually was about to ask if Proxmox staff if they e.g. plan to move over the migrations also over to the SSL API and not SSH. It's a mess for me too, but basically I got into the habit of never ever using the same name for machines unless it's a backup restore with original everything (including SSH keys). It does not matter to me even if I have cluster like pve5, pve12 and pve23. It's also good for looking into the logs later on (if you collect by syslog), when you see oh the since-dead pve3 used to have an issue that now I am getting with its reincarnation. :)

I wish they made use of DNS names, that way I would not worry about not reusing IPs. When you are "migrating", you basically bring down separate a node (ideally you remove separate it before it goes off forever) and then miraculously new one later merges the cluster. Yes you have to migrate away the VMs while you do it, but what a better way to try how it all works. Or maybe just keep replicas on the other node all of the time. I basically want to be able to do this anytime in the future when a node dies. Myself I hope to automate this with PXE boot of the node, so "reinstalling" is never an issue. The install should have no value, the cluster database does and the VMs.
 
Last edited by a moderator:
you basically bring down a node (ideally you remove it before it goes off forever)
I think this is wrong. As stated in [1] you should shutdown the node and then remove the node from the cluster:
As mentioned above, it is critical to power off the node before removal, and make sure that it will not power on again(in the existing cluster network) with its current configuration.If you power on the node as it is, the cluster could end up broken,and it could be difficult to restore it to a functioning state.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
I think this is wrong. As stated in [1] you should shutdown the node and then remove the node from the cluster:


[1] https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

Yes, but no. At least within the context we were discussing, I did not use the best vocabulary, but this was situation covered here:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_separate_node_without_reinstall

But thanks for keeping an eye on me possibly writing out something wrong, I will change the word in my original post.

EDIT: The reason why it's not so obvious to anyone but the two of us is that the whole part about "removing" but not removing, rather separating, was parallel thread:
https://forum.proxmox.com/threads/removing-node-permission-denied.136033/#post-602873
 
Last edited by a moderator:
This I need to make a note of.

>it is critical to power off the node before removal, and make sure that it will not power on again(in the existing cluster network)
>with its current configuration.

I feel like I need to rebuild everything but since I cannot migrate, that's a problem. I've gotten caught up in some other emergencies preventing me from spending time on this.

I wish the developers would find a way to fix this. Once you're logged into a host and have access to the cluster, there should be a simple way to re-sync all of the nodes.
 
I'm having a hard time following the thread at this point. I seem to have only one problem but multiple places to have to look to fix it.
I can ssh fine without passwords using the command line but I cannot migrate between hosts from the GUI.

This means the GUI is using different keys than ssh but there are multiple places to look so could easily break things worse.
I wish there was a series of steps I could take from each host command line and get this working but it's not clear what.

Since I cannot migrate, could I rebuild and restore using the daily proxmox backups I've been running? I suppose so long as the new proxmox hosts got connected back to the storage after being rebuilt that this could be the answer.
 
Last edited:
This I need to make a note of.

>it is critical to power off the node before removal, and make sure that it will not power on again(in the existing cluster network)
>with its current configuration.

When you think of it, when you separate a node from the cluster first, it should not be giving any sign of heartbeat to the rest of the nodes, i.e. for all purporses it can be considered dead for them. And never "powering on". So something else must have been wrong if you did separate it before removal. Just to get my use of terms here clear, by separating I mean following this procedure on the node to be orphaned:
https://forum.proxmox.com/threads/separate-a-node-without-reinstalling.119095/
By removing I mean this part of guide:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

Please note even in that case you will have skeletons in the wardrobe in the form of stale/leftover migration jobs as mentioned in the latter. As you cannot plan for node failures, the way to remove the jobs after they can no longer be performed is to wipe them out from /etc/pve/replication.cfg. You may want to try that one last thing (in your case it is last, in case of those planning to take a node down, it should be first, at least according to the recommendations, not necessarily to my taste either.

I feel like I need to rebuild everything but since I cannot migrate, that's a problem. I've gotten caught up in some other emergencies preventing me from spending time on this.

Could you try to clean up (the relevant parts or all of it if you do not care) in the replication.cfg as one last thing? Although I admit it should be unrelated to your symptoms.

I wish the developers would find a way to fix this. Once you're logged into a host and have access to the cluster, there should be a simple way to re-sync all of the nodes.

The biggest challenge is if your remainder of the cluster has actually quorum, say you have 7 nodes and end up with network failure with 3+3+1 node groups being all isolated. You can't reliably change anything from anywhere unless you manually override some of the quorum checks, which in turn makes it big fun to try to recover the cluster into some sort of common understanding later on.

So I do not think they have easy way to make "just one script" either - it's very circumstances dependent.
 
Last edited by a moderator:
I'm having a hard time following the thread at this point. I seem to have only one problem but multiple places to have to look to fix it.
I can ssh fine without passwords using the command line but I cannot migrate between hosts from the GUI.
Now after all the changes, can you show us the state of the cluster from one of the nodes you believe is running well?
# pvecm nodes

This means the GUI is using different keys than ssh but there are multiple places to look so could easily break things worse.
I wish there was a series of steps I could take from each host command line and get this working but it's not clear what.
That should have been the pvecm updatecert -f but the title of your thread says it all. :D

Since I cannot migrate, could I rebuild and restore using the daily proxmox backups I've been running? I suppose so long as the new proxmox hosts got connected back to the storage after being rebuilt that this could be the answer.

Are you just experimenting or is this going to be your final setup? If you are just playing around, you can make migration network insecure by setting migration=insecure datacetre-wide, see:
https://pve.proxmox.com/wiki/Manual:_datacenter.cfg

EDIT: This is also in the GUI under the datacentre options.

I use it in that way myself but not because it's broken, simply I do not need it on secure network.
 
Last edited by a moderator:
could I rebuild and restore using the daily proxmox backups I've been running?

Well, this should be the cleanest way of recovering from a cluster getting completely broken in any case. But the question is - where do you have the backups stored on? Is that off the cluster nodes?
 
I checked all five nodes. All have nothing in cat /etc/pve/replication.cfg.
0 byte file on all nodes.
 
I logged into all five nodes so have five ssh sessions open.
The only thing I noticed is that ssh'ing to the localhost shows verification failed which is normal.

I also opened 5 separate tabs, each connected to the separate nodes.
I get to and from any node using ssh without issues but from the GUI, get errors.
In fact, I even get communications and other errors now and then from the GUI.


Code:
root@pro07:~# pvecm status
Cluster information
-------------------
Name:             proclust
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Nov  9 19:37:23 2023
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.cb
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.76 (local)
0x00000002          1 10.0.0.73
0x00000003          1 10.0.0.72
0x00000004          1 10.0.0.71
0x00000005          1 10.0.0.70

root@pro07:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pro07 (local)
         2          1 pro04
         3          1 pro03
         4          1 pro02
         5          1 pro01
 
I just tested again.
From the GUI, I can reach all nodes, use most of the commands but cannot use noVNC between hosts.
If I want to access vm on pro04 for example, I have to use the tab I have open to that host otherwise I get 'Failed to connect to server'.
So in other words, I have cluster access to everything but cannot do anything when it comes to vms on any host unless I'm connected directly to that hosts GUI.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!