Network change caused the node to be down

Dec 28, 2019
32
2
8
31
Hi,

By mistake, we made a network change to one of our proxmox nodes and it is totally down now. After referring to a few articles, I come to know that the network changes after the cluster setup may break the connection and we need to remove the node and set it up as a fresh one again. The cluster has 7 nodes. The 7'th one got issue now.

I referred to the article (https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_remove_a_cluster_node) for the instructions to remove the node from the existing system. To get the ID of the node for removal, I executed the command 'pvecm nodes' but it doesn't list the problematic node.

root@satapx-sg1-n1:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 satapx-xxx-n1 (local)
2 1 satapx-xxx-n3
3 1 satapx-xxx-n4
4 1 satapx-xxx-n2
5 1 satapx-xxx-n5
6 1 satapx-xxx-n6


Below is the result of 'pvecm status'


# pvecm status
Cluster information
-------------------
Name: satapx-cluster
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Mar 30 14:16:00 2020
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 1.149
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 117.xxx.x.x (local)
0x00000002 1 117.xxx.x.x
0x00000003 1 117.xxx.x.x
0x00000004 1 117.xxx.x.x
0x00000005 1 117.xxx.x.x
0x00000006 1 117.xxx.x.x

On web GUI, the 7'th node is showing with a red cross mark.

In this case, how can I safely remove this node from the existing proxmox installation? I need to remove node 7 and add it to the same cluster after installing a fresh proxmox.

Thanks in advance.
 

Attachments

  • satapx-xxx-n7.png
    satapx-xxx-n7.png
    46.7 KB · Views: 5
The command only lists running and connected nodes.

If you know the ID of the failed node 7 you can continue with the procedure. If you are unsure about the node ID you can check the contents of the corosync config file with cat /etc/pve/corosync.conf. In the nodelist section you should see the names of each node.

What did you change in the network config that cannot be changed to it's original state so it can connect to the cluster again?
 
Hi @aaron , Thank you for your response.

I know the ID is 7and the name is 'satapx-xxx-n7'. So, executing the command "pvecm delnode satapx-xxx-n7" will completely remove the node from this Proxmox cluster? (after power down the node N7). Is there anything else that I should follow to keep other nodes safe?
 
Don't forget about the Note and clean up the known_hosts in /roo/.ssh and /etc/pve/priv/.

I really would like to know what you changed in the network config that cannot be reverted but needs a reinstall :)
 
Hi @aaron , we have managed to revert the network changes made. The changes were done from the Proxmox GUI but the interfaces file at the backend was empty somehow. We have copied an interfaces file from another node and mold it to work for N7. N7 back online now. The change made earlier was regarding a network bridge but to a wrong interface.

Now, we really need to remove nodes from the cluster (for another pve INPX) to replace SATA SSD with NVMe. So, I would like to double confirm the removal process,

To remove Node 3,

1) Login to another node (eg: Node 1)
2) Issue a 'pvecm nodes' command to identify the node ID and name to remove
3) Power Off N3
4) Execute command 'pvecm delnode INPX-XXX-N3'
5) Check the node list again with pvecm nodes or pvecm status to confirm the node is removed
6) Remove SSH fingerprint from the known_hosts

Once removed, We will replace SSD and do a fresh Proxmox installation and join to this same cluster.

Can I follow these same instructions to remove the Monitor nodes also?
 
Yes, if you follow the guide in the docs and clean up the ssh fingerprints everything should be okay to be readded after a fresh install.
 
Hi @aaron , I have one question,
The Node I am going to remove is one of the Monitors. So, should I follow some more instructions to remove the monitor first and add another node to the monitor list (before the complete removal of problematic node from the cluster)?
 
Hi,

I removed node N1 from the cluster by following the instruction https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Before removal, I removed the node from Monitors and Managers.

I execute 'pvexm delnode node_name'

The node removed from GUI. But an entry for the same exists under /etc/pve/nodes

And, the OSDs haven't removed. Since the node removed already, I am unable to remove OSDs from GUI.

How can I remove the dead OSDs from GUI?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!