[SOLVED] Removal of node from ceph failed

Dec 28, 2019
32
2
8
30
I recently removed one of the nodes (among 3) from my Proxmox. I followed the instructions from "https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node ". After removing the node using "pvecm delnode nodename" the node removed from Proxmox GUI. But, Entry exists under OSD and Crush Map (this is what I can see now and not sure if more entries exist).

Does anyone face this same issue? I want to re-join the same node after updating to NVMe SSD. But I am afraid it won't work since the entries of the removed node haven't destroyed completely yet.
 
There are a few more things we have to follow than what Proxmox is saying here - https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node .
Ceph is an independent step. It is not part of the Proxmox VE cluster and hence not mentioned in that part of the documentation.

It is not complete. It won't remove the node from places like crush map, /etc/pve/nodes etc
The subfolder in /etce/pve/nodes, is left by intention. It might be that there are custom (eg. user-provided) files of importance. The removal of the empty bucket in Ceph's crush map is currently not implemented. I opened a feature request for it.
https://bugzilla.proxmox.com/show_bug.cgi?id=2695
 
Okay, if that is the case, how can we conclude that we successfully removed a node from the cluster? Like the ceph things, /etc/pve/nodes/nodename folder and crushmap, is there something more I have to remove manually? Thanks.
 
Thank you. Btw, I have one more question, is there any master concept in pve nodes? I mean, one node will hold the master conf and others just syncing to it? If that is the case, removing that (master) node will need some additional steps?

Why I am asking because, from the ceph conf, I could see the cluster_network and public_network is defined using the IP address of Node 1 which is the node created first in the cluster. And from an article, I saw the term, master copy of ceph.
 
Kindly do the following after following due procedure for removal of server from proxmox cluster

1. Make the OSD out for node to be removed, if node is already failed this step is not required
2. Login to shell and do the following

a. ceph osd crush remove osd.{id} =>> Here use the correct ID of the OSD, repeat steps for all OSD to be removed, dont use {} in the command
b. ceph osd crush rm <nodename> ==> Here you give the node to be removed

The above two shall remove the info from crush map. but you need to remove from ceph config also use the following

ceph osd rm osd.{id}
ceph auth del osd.{id}


If your node is not crashed, then first do the following before doing above steps

1. Take the OSD from GUI
2. Stop ceph services systemctl stop ceph.target
 
Thank you. Btw, I have one more question, is there any master concept in pve nodes? I mean, one node will hold the master conf and others just syncing to it? If that is the case, removing that (master) node will need some additional steps?
Proxmox VE is a multi-master system. No additional steps, only what is described in the documentation.

Why I am asking because, from the ceph conf, I could see the cluster_network and public_network is defined using the IP address of Node 1 which is the node created first in the cluster. And from an article, I saw the term, master copy of ceph.
This is just the first IP of the node where the config was initialized. Ceph will calculate the network address of it. There is no master in that regard.
 
Thank you for the instructions.

I removed a dead node from one of our pve cluster.

*Removed node using "pvecm delnode".
* Removed OSDs (From crush map, odd map, and removed auth key)
*Removed Empty OSD bucket
* Removed entry from /etc/pve/nodes

The dead node was a part of monitor and manager. It removed from 'manager' but NOT from 'Monitor'.

In the Monitor section, this node still exists as 'Unknown' Host. How to remove this entry safely?
 

Attachments

  • monitor-unknown.png
    monitor-unknown.png
    12.2 KB · Views: 65
Btw, the monitor is not removed from the 'ceph.conf' even after I executed 'ceph mon remove'

The ceph conf still contains the dead node IP - 192.168.11.2

mon_host = 192.168.11.2 192.168.11.4 192.168.11.3

Can I manually edit this file? If yes, do I need to restart any services after that to load the new ceph conf?
 
*Removed node using "pvecm delnode".
* Removed OSDs (From crush map, odd map, and removed auth key)
These should be switched. Since it will need to write to the /etc/pve/ceph.conf.

The ceph conf still contains the dead node IP - 192.168.11.2
This happened because of the above.

Can I manually edit this file? If yes, do I need to restart any services after that to load the new ceph conf?
Yes you can. And in this case, no service restart is necessary.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!