Hi
We have now started the process of replacing some nodes in our cluster. The cluster runs Ceph + HA.
What we have done so far is "out" each OSD and let the cluster rebalance after each OSD. Then stopped the OSD and deleted MGR, MON and MDS roles from the node to remove.
Then we rebooted the node into a state where PVE was no longer running as per the documentation for removing a node. We then removed the node from the HA group that we had created and finally executed the "pvecm delnode <node name>" from one of the remaining nodes.
At this point in time the node is gone from the server list in the GUI, but it is still shown in Ceph under the list of OSDs as a node, but no OSDs are defined on it
Under HA, it is shown like this:
We have tried to pve-ha-crm.service on all remaining nodes. However, this yielded no difference.
Will the node drop out by it self from here or should it have been removed from the HA group prior to rebooting it into a unusable state? If it should have been removed prior to rebooting it, how can we fix it now?
UPDATE: Seems that it just took some time before HA removed the node, but it is still shown in Ceph OSD list.
We have now started the process of replacing some nodes in our cluster. The cluster runs Ceph + HA.
What we have done so far is "out" each OSD and let the cluster rebalance after each OSD. Then stopped the OSD and deleted MGR, MON and MDS roles from the node to remove.
Then we rebooted the node into a state where PVE was no longer running as per the documentation for removing a node. We then removed the node from the HA group that we had created and finally executed the "pvecm delnode <node name>" from one of the remaining nodes.
At this point in time the node is gone from the server list in the GUI, but it is still shown in Ceph under the list of OSDs as a node, but no OSDs are defined on it
Under HA, it is shown like this:
Code:
lrm nodeXXXX (old timestamp - dead?, Tue Oct 19 16:05:31 2021)
We have tried to pve-ha-crm.service on all remaining nodes. However, this yielded no difference.
Will the node drop out by it self from here or should it have been removed from the HA group prior to rebooting it into a unusable state? If it should have been removed prior to rebooting it, how can we fix it now?
UPDATE: Seems that it just took some time before HA removed the node, but it is still shown in Ceph OSD list.
Last edited: