[SOLVED] Unable to properly remove node from cluster

Dragonn

Member
May 23, 2020
21
4
23
Prague
Hello,

I am struggling to remove single Proxmox node from cluster properly. I am following guide in docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node and it looks like node is only partially removed.

Basicaly I did something like
Code:
# ensured no VMs are on node
systemctl stop pve-ha-lrm pve-ha-crm corosync pve-cluster pvedaemon pveproxy
dd if=/dev/urandom of=/dev/sda
shutdown now

Then I tried to remove node from cluster and got error
Code:
P virt1[root](15:01:30)-(~)
-> pvecm delnode virt98
Killing node 98
Could not kill node (error = CS_ERR_NOT_EXIST)
error during cfs-locked 'file-corosync_conf' operation: command 'corosync-cfgtool -k 98' failed: exit code 1

On many places I cannot find any trace:
Code:
P virt1[root](15:01:38)-(~)
-> grep virt98 /etc/pve/.members 

P virt1[root](15:01:43)-(~)
-> grep 98 /etc/corosync/corosync.conf 

P virt1[root](15:02:15)-(~)
-> pvecm delnode virt98
error during cfs-locked 'file-corosync_conf' operation: Node/IP: virt98 is not a known host of the cluster.

But I can still see it in GUI and on many places:
Code:
P virt1[root](15:44:28)-(~)
-> jq .node_status.virt98 /etc/pve/ha/manager_status
"gone"

P virt1[root](15:44:30)-(~)
-> ls -l /etc/pve/nodes/virt98
total 2
-rw-r----- 1 root www-data   84 Feb 19 14:58 lrm_status
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 lxc
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 openvz
drwx------ 2 root www-data    0 Feb  1 16:18 priv
-rw-r----- 1 root www-data 1675 Feb  1 16:18 pve-ssl.key
-rw-r----- 1 root www-data 1712 Feb  1 16:18 pve-ssl.pem
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 qemu-server

Also I was unable to find any reason why could removing node from corosync failed. Syslog looks as expected to me:
Code:
Feb 19 15:01:39 virt1 pvecm[30430]: <root@pam> deleting node virt98 from cluster
Feb 19 15:01:39 virt1 pmxcfs[34727]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 21)
Feb 19 15:01:39 virt1 corosync[6398]:   [CFG   ] Config reload requested by node 1
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configuring link 0
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configured link number 0: local addr: 192.168.248.76, port=5405
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configuring link 1
Feb 19 15:01:39 virt1 corosync[6398]:   [TOTEM ] Configured link number 1: local addr: 192.168.232.60, port=5406
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 (passive) best link: 0 (pri: 0)
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 has no active links
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 (passive) best link: 0 (pri: 0)
Feb 19 15:01:39 virt1 corosync[6398]:   [KNET  ] host: host: 98 has no active links
Feb 19 15:01:39 virt1 pmxcfs[34727]: [status] notice: update cluster info (cluster name  virt, version = 21)

Do you have any idea how to delete it properly and (most importantly) what have I done wrong?

Thanks for your time.
 
Hi,

Could you post the output of pvecm status and cat /etc/pve/corosync.conf. Is the cluster still behaving normally otherwise?
systemctl stop pve-ha-lrm pve-ha-crm corosync pve-cluster pvedaemon pveproxy
dd if=/dev/urandom of=/dev/sda
shutdown now
Which drive did you overwrite here?

Also I was unable to find any reason why could removing node from corosync failed. Syslog looks as expected to me:
From the first line of this log, it seems as though it got removed.

Do you have any idea how to delete it properly and (most importantly) what have I done wrong?
I can't say for certain what you did wrong, but it does seem that you over complicated things a bit, and perhaps mixed the two node removal methods (i.e., normal separation and separation without reinstalling).
 
Sure, no problem. But I can see no traces of virt98 in corosync configurations or runtime.

Code:
P virt3[root](15:20:22)-(~)
-> pvecm status
Cluster information
-------------------
Name:             virt
Config Version:   21
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Feb 23 13:01:12 2021
Quorum provider:  corosync_votequorum
Nodes:            19
Node ID:          0x00000003
Ring ID:          1.7e1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   19
Highest expected: 19
Total votes:      19
Quorum:           10  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.251.48
0x00000002          1 192.168.248.232
0x00000003          1 192.168.249.14 (local)
0x00000004          1 192.168.249.55
0x00000005          1 192.168.249.72
0x00000006          1 192.168.248.253
0x00000007          1 192.168.249.56
0x00000008          1 192.168.249.21
0x00000009          1 192.168.249.74
0x0000000a          1 192.168.248.76
0x0000000b          1 192.168.249.59
0x0000000c          1 192.168.249.58
0x0000000d          1 192.168.249.19
0x0000000e          1 192.168.249.91
0x0000000f          1 192.168.249.31
0x00000010          1 192.168.249.73
0x00000012          1 192.168.249.191
0x00000061          1 192.168.251.52
0x00000063          1 192.168.251.45

Code:
P virt3[root](13:01:12)-(~)
-> cat /etc/pve/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: virt1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.251.48
    ring1_addr: 192.168.232.51
  }
  node {
    name: virt10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.248.76
    ring1_addr: 192.168.232.60
  }
  node {
    name: virt11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.249.59
    ring1_addr: 192.168.232.61
  }
  node {
    name: virt12
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.249.58
    ring1_addr: 192.168.232.62
  }
  node {
    name: virt13
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 192.168.249.19
    ring1_addr: 192.168.232.63
  }
  node {
    name: virt14
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 192.168.249.91
    ring1_addr: 192.168.232.64
  }
  node {
    name: virt15
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 192.168.249.31
    ring1_addr: 192.168.232.65
  }
  node {
    name: virt16
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 192.168.249.73
    ring1_addr: 192.168.232.66
  }
  node {
    name: virt18
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 192.168.249.191
    ring1_addr: 192.168.232.68
  }
  node {
    name: virt2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.248.232
    ring1_addr: 192.168.232.52
  }
  node {
    name: virt3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.249.14
    ring1_addr: 192.168.232.53
  }
  node {
    name: virt4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.249.55
    ring1_addr: 192.168.232.54
  }
  node {
    name: virt5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.249.72
    ring1_addr: 192.168.232.55
  }
  node {
    name: virt6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.248.253
    ring1_addr: 192.168.232.56
  }
  node {
    name: virt7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.249.56
    ring1_addr: 192.168.232.57
  }
  node {
    name: virt8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.249.21
    ring1_addr: 192.168.232.58
  }
  node {
    name: virt9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.249.74
    ring1_addr: 192.168.232.59
  }
  node {
    name: virt97
    nodeid: 97
    quorum_votes: 1
    ring0_addr: 192.168.251.52
    ring1_addr: 192.168.232.47
  }
  node {
    name: virt99
    nodeid: 99
    quorum_votes: 1
    ring0_addr: 192.168.251.45
    ring1_addr: 192.168.232.49
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: virt
  config_version: 21
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Which drive did you overwrite here?
I just wiped system drive of removed hypervisor virt98. This is one of the easiest way how do I recycle hardware servers, because after this I can safely and simply reboot server into PXE live image, wipe server and reinstall it to whatever I need.

I am not sure what can I do now. Can I just simply run rm -rf /etc/pve/nodes/virt98, rm -rf /etc/pve/priv/lock/ha_agent_virt98_lock/ and consider it done?

Code:
P virt3[root](13:10:45)-(/etc/pve)
-> find /etc/pve -name \*virt98\*
/etc/pve/nodes/virt98
/etc/pve/priv/lock/ha_agent_virt98_lock

P virt3[root](13:11:00)-(/etc/pve)
-> ls -lh /etc/pve/nodes/virt98
total 1.5K
-rw-r----- 1 root www-data   84 Feb 19 14:58 lrm_status
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 lxc
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 openvz
drwx------ 2 root www-data    0 Feb  1 16:18 priv
-rw-r----- 1 root www-data 1.7K Feb  1 16:18 pve-ssl.key
-rw-r----- 1 root www-data 1.7K Feb  1 16:18 pve-ssl.pem
drwxr-xr-x 2 root www-data    0 Feb  1 16:18 qemu-server

P virt3[root](13:11:01)-(/etc/pve)
-> ls -lh /etc/pve/priv/lock/ha_agent_virt98_lock/
total 0
 
Okay, it seems that must have got deleted from some step you did, but I really can't say what happened here.

I am not sure what can I do now. Can I just simply run rm -rf /etc/pve/nodes/virt98, rm -rf /etc/pve/priv/lock/ha_agent_virt98_lock/ and consider it done?
Seeing as the node has been wiped clean and the cluster is reporting a healthy, quorate status without mention of it, I would say those directories can be simply removed without issue.
 
@dylanw just wanted to say that I had the exact same case as OP but without doing anything special. So basically I shut down the node to remove for good and executed `pvecm delnode node4` on a remaining node. The cluster is much smaller than OPs (3 nodes after the removal) but the error and remaining folders was exactly the same. Just as OP I removed what was left manually and thanks to this thread I assume the cluster is healthy and everything fine. But this is confusing and maybe reproduceable if I also had it. So maybe it would be worth investigating more.
 
Same sequence of events here. Migrated containers off the node to be removed, shut it down and then:

Code:
$ pvecm delnode silencio
Killing node 3
Could not kill node (error = CS_ERR_NOT_EXIST)
command 'corosync-cfgtool -k 3' failed: exit code 1

$ pvecm delnode silencio
Node/IP: silencio is not a known host of the cluster.
 
Same sequence of events here. Migrated containers off the node to be removed, shut it down and then
I guess something is changed in 7.0 (maybe in latest builds of 6.x), got the same issue. Because it's a homelab and I was lazy, turned off node for like half a day, then decided to finally remove it from corosync and received same problem. Manual removal of folders worked.
 
  • Like
Reactions: Hyacin
I guess something is changed in 7.0 (maybe in latest builds of 6.x), got the same issue. Because it's a homelab and I was lazy, turned off node for like half a day, then decided to finally remove it from corosync and received same problem. Manual removal of folders worked.

Forgot to mention that I'm running 6.4-13 here.
 
Just to clarify, the error message seen here isn't necessarily an issue (command 'corosync-cfgtool -k 3' failed: exit code 1). It simply means that corosync could not kill the node, because if you follow the documentation correctly, the node will already be offline. The steps to remove the node are still carried out.
Note also that it is the intended behavior that a node's configuration remains in the cluster's /etc/pve/nodes directory, as this directory contains important configuration information that you may require at a later point. The presence of this directory shouldn't cause any issues.

Are you seeing the removed nodes as offline in the GUI after removing them and refreshing the browser? If so, do you have any remaining config files on the server, such as VM configs? I am aware of other such issues that can cause more problematic issues here [1], and will update the documentation to reflect this soon. While doing so, I will also make mention of the points mentioned here.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=3375
 
Are you seeing the removed nodes as offline in the GUI after removing them and refreshing the browser? If so, do you have any remaining config files on the server, such as VM configs? I am aware of other such issues that can cause more problematic issues here [1], and will update the documentation to reflect this soon. While doing so, I will also make mention of the points mentioned here.

No issues so far. So we can take it as an overzealous error message. :)
 
  • Like
Reactions: dylanw

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!