[SOLVED] Unable to properly remove node from cluster

Dragonn · Feb 19, 2021

Hello,

I am struggling to remove single Proxmox node from cluster properly. I am following guide in docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node and it looks like node is only partially removed.

Basicaly I did something like

Code:

# ensured no VMs are on node
systemctl stop pve-ha-lrm pve-ha-crm corosync pve-cluster pvedaemon pveproxy
dd if=/dev/urandom of=/dev/sda
shutdown now

Then I tried to remove node from cluster and got error

Code:

P virt1[root](15:01:30)-(~)
-> pvecm delnode virt98
Killing node 98
Could not kill node (error = CS_ERR_NOT_EXIST)
error during cfs-locked 'file-corosync_conf' operation: command 'corosync-cfgtool -k 98' failed: exit code 1

On many places I cannot find any trace:

Code:

P virt1[root](15:01:38)-(~)
-> grep virt98 /etc/pve/.members 

P virt1[root](15:01:43)-(~)
-> grep 98 /etc/corosync/corosync.conf 

P virt1[root](15:02:15)-(~)
-> pvecm delnode virt98
error during cfs-locked 'file-corosync_conf' operation: Node/IP: virt98 is not a known host of the cluster.

But I can still see it in GUI and on many places:

Code:

P virt1[root](15:44:28)-(~)
-> jq .node_status.virt98 /etc/pve/ha/manager_status
"gone"

P virt1[root](15:44:30)-(~)
-> ls -l /etc/pve/nodes/virt98
total 2
-rw-r----- 1 root www-data   84 Feb 19 14:58 lrm_status
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 lxc
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 openvz
drwx------ 2 root www-data    0 Feb  1 16:18 priv
-rw-r----- 1 root www-data 1675 Feb  1 16:18 pve-ssl.key
-rw-r----- 1 root www-data 1712 Feb  1 16:18 pve-ssl.pem
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 qemu-server

Also I was unable to find any reason why could removing node from corosync failed. Syslog looks as expected to me:

Code:

Feb 19 15:01:39 virt1 pvecm[30430]: <root@pam> deleting node virt98 from cluster
Feb 19 15:01:39 virt1 pmxcfs[34727]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 21)
Feb 19 15:01:39 virt1 corosync[6398]:   [CFG   ] Config reload requested by node 1
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configuring link 0
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configured link number 0: local addr: 192.168.248.76, port=5405
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configuring link 1
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configured link number 1: local addr: 192.168.232.60, port=5406
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 (passive) best link: 0 (pri: 0)
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 has no active links
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 (passive) best link: 0 (pri: 0)
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 has no active links
Feb 19 15:01:39 virt1 pmxcfs[34727]: [status] notice: update cluster info (cluster name  virt, version = 21)

Do you have any idea how to delete it properly and (most importantly) what have I done wrong?

Thanks for your time.

dylanw · Feb 23, 2021

Hi,

Could you post the output of pvecm status and cat /etc/pve/corosync.conf. Is the cluster still behaving normally otherwise?

Dragonn said:
systemctl stop pve-ha-lrm pve-ha-crm corosync pve-cluster pvedaemon pveproxy
dd if=/dev/urandom of=/dev/sda
shutdown now

Which drive did you overwrite here?

Dragonn said:
Also I was unable to find any reason why could removing node from corosync failed. Syslog looks as expected to me:

From the first line of this log, it seems as though it got removed.

Dragonn said:
Do you have any idea how to delete it properly and (most importantly) what have I done wrong?

I can't say for certain what you did wrong, but it does seem that you over complicated things a bit, and perhaps mixed the two node removal methods (i.e., normal separation and separation without reinstalling).

Dragonn · Feb 23, 2021

Sure, no problem. But I can see no traces of virt98 in corosync configurations or runtime.

Code:

P virt3[root](15:20:22)-(~)
-> pvecm status
Cluster information
-------------------
Name:             virt
Config Version:   21
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Feb 23 13:01:12 2021
Quorum provider:  corosync_votequorum
Nodes:            19
Node ID:          0x00000003
Ring ID:          1.7e1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   19
Highest expected: 19
Total votes:      19
Quorum:           10  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.251.48
0x00000002          1 192.168.248.232
0x00000003          1 192.168.249.14 (local)
0x00000004          1 192.168.249.55
0x00000005          1 192.168.249.72
0x00000006          1 192.168.248.253
0x00000007          1 192.168.249.56
0x00000008          1 192.168.249.21
0x00000009          1 192.168.249.74
0x0000000a          1 192.168.248.76
0x0000000b          1 192.168.249.59
0x0000000c          1 192.168.249.58
0x0000000d          1 192.168.249.19
0x0000000e          1 192.168.249.91
0x0000000f          1 192.168.249.31
0x00000010          1 192.168.249.73
0x00000012          1 192.168.249.191
0x00000061          1 192.168.251.52
0x00000063          1 192.168.251.45

Code:

P virt3[root](13:01:12)-(~)
-> cat /etc/pve/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: virt1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.251.48
    ring1_addr: 192.168.232.51
  }
  node {
    name: virt10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.248.76
    ring1_addr: 192.168.232.60
  }
  node {
    name: virt11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.249.59
    ring1_addr: 192.168.232.61
  }
  node {
    name: virt12
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.249.58
    ring1_addr: 192.168.232.62
  }
  node {
    name: virt13
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 192.168.249.19
    ring1_addr: 192.168.232.63
  }
  node {
    name: virt14
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 192.168.249.91
    ring1_addr: 192.168.232.64
  }
  node {
    name: virt15
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 192.168.249.31
    ring1_addr: 192.168.232.65
  }
  node {
    name: virt16
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 192.168.249.73
    ring1_addr: 192.168.232.66
  }
  node {
    name: virt18
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 192.168.249.191
    ring1_addr: 192.168.232.68
  }
  node {
    name: virt2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.248.232
    ring1_addr: 192.168.232.52
  }
  node {
    name: virt3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.249.14
    ring1_addr: 192.168.232.53
  }
  node {
    name: virt4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.249.55
    ring1_addr: 192.168.232.54
  }
  node {
    name: virt5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.249.72
    ring1_addr: 192.168.232.55
  }
  node {
    name: virt6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.248.253
    ring1_addr: 192.168.232.56
  }
  node {
    name: virt7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.249.56
    ring1_addr: 192.168.232.57
  }
  node {
    name: virt8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.249.21
    ring1_addr: 192.168.232.58
  }
  node {
    name: virt9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.249.74
    ring1_addr: 192.168.232.59
  }
  node {
    name: virt97
    nodeid: 97
    quorum_votes: 1
    ring0_addr: 192.168.251.52
    ring1_addr: 192.168.232.47
  }
  node {
    name: virt99
    nodeid: 99
    quorum_votes: 1
    ring0_addr: 192.168.251.45
    ring1_addr: 192.168.232.49
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: virt
  config_version: 21
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

dylanw said:
Which drive did you overwrite here?

I just wiped system drive of removed hypervisor virt98. This is one of the easiest way how do I recycle hardware servers, because after this I can safely and simply reboot server into PXE live image, wipe server and reinstall it to whatever I need.

I am not sure what can I do now. Can I just simply run rm -rf /etc/pve/nodes/virt98, rm -rf /etc/pve/priv/lock/ha_agent_virt98_lock/ and consider it done?

Code:

P virt3[root](13:10:45)-(/etc/pve)
-> find /etc/pve -name \*virt98\*
/etc/pve/nodes/virt98
/etc/pve/priv/lock/ha_agent_virt98_lock

P virt3[root](13:11:00)-(/etc/pve)
-> ls -lh /etc/pve/nodes/virt98
total 1.5K
-rw-r----- 1 root www-data   84 Feb 19 14:58 lrm_status
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 lxc
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 openvz
drwx------ 2 root www-data    0 Feb  1 16:18 priv
-rw-r----- 1 root www-data 1.7K Feb  1 16:18 pve-ssl.key
-rw-r----- 1 root www-data 1.7K Feb  1 16:18 pve-ssl.pem
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 qemu-server

P virt3[root](13:11:01)-(/etc/pve)
-> ls -lh /etc/pve/priv/lock/ha_agent_virt98_lock/
total 0

dylanw · Feb 24, 2021

Okay, it seems that must have got deleted from some step you did, but I really can't say what happened here.

Dragonn said:
I am not sure what can I do now. Can I just simply run rm -rf /etc/pve/nodes/virt98, rm -rf /etc/pve/priv/lock/ha_agent_virt98_lock/ and consider it done?

Seeing as the node has been wiped clean and the cluster is reporting a healthy, quorate status without mention of it, I would say those directories can be simply removed without issue.

Dragonn · Feb 25, 2021

Thank you very much @dylanw , it looks like deleting folders was enought to clean it up. I cannot find it anywhere else.

Asano · Sep 5, 2021

@dylanw just wanted to say that I had the exact same case as OP but without doing anything special. So basically I shut down the node to remove for good and executed `pvecm delnode node4` on a remaining node. The cluster is much smaller than OPs (3 nodes after the removal) but the error and remaining folders was exactly the same. Just as OP I removed what was left manually and thanks to this thread I assume the cluster is healthy and everything fine. But this is confusing and maybe reproduceable if I also had it. So maybe it would be worth investigating more.

whataboutpereira · Sep 9, 2021

Same sequence of events here. Migrated containers off the node to be removed, shut it down and then:

Code:

$ pvecm delnode silencio
Killing node 3
Could not kill node (error = CS_ERR_NOT_EXIST)
command 'corosync-cfgtool -k 3' failed: exit code 1

$ pvecm delnode silencio
Node/IP: silencio is not a known host of the cluster.

woloss · Sep 13, 2021

whataboutpereira said:
Same sequence of events here. Migrated containers off the node to be removed, shut it down and then

I guess something is changed in 7.0 (maybe in latest builds of 6.x), got the same issue. Because it's a homelab and I was lazy, turned off node for like half a day, then decided to finally remove it from corosync and received same problem. Manual removal of folders worked.

whataboutpereira · Sep 13, 2021

woloss said:
I guess something is changed in 7.0 (maybe in latest builds of 6.x), got the same issue. Because it's a homelab and I was lazy, turned off node for like half a day, then decided to finally remove it from corosync and received same problem. Manual removal of folders worked.

Forgot to mention that I'm running 6.4-13 here.

dylanw · Sep 13, 2021

Just to clarify, the error message seen here isn't necessarily an issue (command 'corosync-cfgtool -k 3' failed: exit code 1). It simply means that corosync could not kill the node, because if you follow the documentation correctly, the node will already be offline. The steps to remove the node are still carried out.
Note also that it is the intended behavior that a node's configuration remains in the cluster's /etc/pve/nodes directory, as this directory contains important configuration information that you may require at a later point. The presence of this directory shouldn't cause any issues.

Are you seeing the removed nodes as offline in the GUI after removing them and refreshing the browser? If so, do you have any remaining config files on the server, such as VM configs? I am aware of other such issues that can cause more problematic issues here [1], and will update the documentation to reflect this soon. While doing so, I will also make mention of the points mentioned here.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=3375

whataboutpereira · Sep 13, 2021

dylanw said:
Are you seeing the removed nodes as offline in the GUI after removing them and refreshing the browser? If so, do you have any remaining config files on the server, such as VM configs? I am aware of other such issues that can cause more problematic issues here [1], and will update the documentation to reflect this soon. While doing so, I will also make mention of the points mentioned here.

No issues so far. So we can take it as an overzealous error message.

Binary Bandit · Nov 3, 2021

Just wanted to confirm that this is an overzealous error message ... same exact issue for our 5 node cluster (7.0) when shrinking it to 3 nodes.

Search

Search

[SOLVED] Unable to properly remove node from cluster

Dragonn

Member

dylanw

Proxmox Retired Staff

Dragonn

Member

dylanw

Proxmox Retired Staff

Dragonn

Member

Asano

Well-Known Member

whataboutpereira

Well-Known Member

woloss

Member

whataboutpereira

Well-Known Member

dylanw

Proxmox Retired Staff

whataboutpereira

Well-Known Member

Binary Bandit

Well-Known Member

We value your privacy