Feedback on admin guide - removing node from cluster

iGadget

Member
Apr 9, 2020
26
7
8
45
In the current version of the admin guide, the instructions on removing a node from a cluster (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) are missing a vital piece of info - existing replication jobs (and perhaps more, but at least this).

If you fail to disable / remove replication jobs to/from the node you are about to remove, these replication jobs can interfere with the removal, resulting in errors like this:
Code:
# pvecm delnode node2
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2
error during cfs-locked 'file-corosync_conf' operation: command 'corosync-cfgtool -k 2' failed: exit code 1

Also, the jobs you failed to remove remain present on all involved nodes even after the node has been removed. When that happens, you cannot remove them anymore since that requires a sync with the node which you have just removed from the cluster, resulting in a never ending 'pending removal' status of those jobs.

I found out about this issue the hard way and the only(?) way to fix it was to re-install the node with the same hostname / IP, re-adding the node to the cluster and then triggering a replication job (i.e. by migrating the VM / container involved in the affected replication jobs).

Who should I ping to have this (and perhaps additional) info added to the admin guide?
 
Thank you for reporting this!
Who should I ping to have this (and perhaps additional) info added to the admin guide?
It would be great if you could post the problem to the PVE Bugzilla.
 
  • Like
Reactions: iGadget
@Dominic - A few weeks 'wiser' (and another unvoluntary removal of a node from the cluster later), I'm starting to wonder is this isn't actually a bug rather than just a missing piece of documentation?
I'm asking because my latest encounter with the removal of a node happened with a node that would no longer boot, so there would have been no way for me do the additional steps required, even if they *were* documented.

Shouldn't Proxmox be able to handle the (I'm assuming not-so-uncommon) scenario of removing unrecoverable nodes from a cluster more graciously?

Another, perhaps related, issue I ran into today is SSH fingerprints & keys of removed nodes spread across several config files. Am I correct that this data *should* have all been removed when executing the pvecm delnode command?

The reason I came across this issue was because node1 in my micro-cluster just issued a 'WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!' when I tried connecting from node1 to node2 with SSH.
The weird thing is, even though /etc/ssh/ssh_known_hosts is identical on all 3 nodes, only node1 throws this warning while node3 just connects to node2 without any issue.
What am I missing?

[UPDATE] - as of this evening, node3 also refuses to connect to node2. So apparently, there was a moment when /etc/ssh/ssh_known_hosts was out of sync between at least nodes 1 and 3.
And if that's the case, I wonder what caused the (as I hope it to be) wrong / outdated /etc/ssh/ssh_known_hosts version to a) appear and b) be replicated across the cluster?

[UPDATE2] - what's also interesting is that when I connect from my workstation or laptop to node2 via SSH, I get no warning at all. Could this be because they're both using ECDSA finger prints instead of RCA? And if so, how come those have not changed? Also - why isn't Proxmox using ECDSA?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!