In the current version of the admin guide, the instructions on removing a node from a cluster (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) are missing a vital piece of info - existing replication jobs (and perhaps more, but at least this).
If you fail to disable / remove replication jobs to/from the node you are about to remove, these replication jobs can interfere with the removal, resulting in errors like this:
Also, the jobs you failed to remove remain present on all involved nodes even after the node has been removed. When that happens, you cannot remove them anymore since that requires a sync with the node which you have just removed from the cluster, resulting in a never ending 'pending removal' status of those jobs.
I found out about this issue the hard way and the only(?) way to fix it was to re-install the node with the same hostname / IP, re-adding the node to the cluster and then triggering a replication job (i.e. by migrating the VM / container involved in the affected replication jobs).
Who should I ping to have this (and perhaps additional) info added to the admin guide?
If you fail to disable / remove replication jobs to/from the node you are about to remove, these replication jobs can interfere with the removal, resulting in errors like this:
Code:
# pvecm delnode node2
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 2
error during cfs-locked 'file-corosync_conf' operation: command 'corosync-cfgtool -k 2' failed: exit code 1
Also, the jobs you failed to remove remain present on all involved nodes even after the node has been removed. When that happens, you cannot remove them anymore since that requires a sync with the node which you have just removed from the cluster, resulting in a never ending 'pending removal' status of those jobs.
I found out about this issue the hard way and the only(?) way to fix it was to re-install the node with the same hostname / IP, re-adding the node to the cluster and then triggering a replication job (i.e. by migrating the VM / container involved in the affected replication jobs).
Who should I ping to have this (and perhaps additional) info added to the admin guide?