Feedback on admin guide - removing node from cluster

iGadget

Member
Apr 9, 2020
26
7
8
45
In the current version of the admin guide, the instructions on removing a node from a cluster (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) are missing a vital piece of info - existing replication jobs (and perhaps more, but at least this).

If you fail to disable / remove replication jobs to/from the node you are about to remove, these replication jobs can interfere with the removal, resulting in errors like this:
Code:
# pvecm delnode node2
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2
error during cfs-locked 'file-corosync_conf' operation: command 'corosync-cfgtool -k 2' failed: exit code 1

Also, the jobs you failed to remove remain present on all involved nodes even after the node has been removed. When that happens, you cannot remove them anymore since that requires a sync with the node which you have just removed from the cluster, resulting in a never ending 'pending removal' status of those jobs.

I found out about this issue the hard way and the only(?) way to fix it was to re-install the node with the same hostname / IP, re-adding the node to the cluster and then triggering a replication job (i.e. by migrating the VM / container involved in the affected replication jobs).

Who should I ping to have this (and perhaps additional) info added to the admin guide?
 
Thank you for reporting this!
Who should I ping to have this (and perhaps additional) info added to the admin guide?
It would be great if you could post the problem to the PVE Bugzilla.
 
  • Like
Reactions: iGadget
@Dominic - A few weeks 'wiser' (and another unvoluntary removal of a node from the cluster later), I'm starting to wonder is this isn't actually a bug rather than just a missing piece of documentation?
I'm asking because my latest encounter with the removal of a node happened with a node that would no longer boot, so there would have been no way for me do the additional steps required, even if they *were* documented.

Shouldn't Proxmox be able to handle the (I'm assuming not-so-uncommon) scenario of removing unrecoverable nodes from a cluster more graciously?

Another, perhaps related, issue I ran into today is SSH fingerprints & keys of removed nodes spread across several config files. Am I correct that this data *should* have all been removed when executing the pvecm delnode command?

The reason I came across this issue was because node1 in my micro-cluster just issued a 'WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!' when I tried connecting from node1 to node2 with SSH.
The weird thing is, even though /etc/ssh/ssh_known_hosts is identical on all 3 nodes, only node1 throws this warning while node3 just connects to node2 without any issue.
What am I missing?

[UPDATE] - as of this evening, node3 also refuses to connect to node2. So apparently, there was a moment when /etc/ssh/ssh_known_hosts was out of sync between at least nodes 1 and 3.
And if that's the case, I wonder what caused the (as I hope it to be) wrong / outdated /etc/ssh/ssh_known_hosts version to a) appear and b) be replicated across the cluster?

[UPDATE2] - what's also interesting is that when I connect from my workstation or laptop to node2 via SSH, I get no warning at all. Could this be because they're both using ECDSA finger prints instead of RCA? And if so, how come those have not changed? Also - why isn't Proxmox using ECDSA?
 
Last edited: