Proper OSD replacement procedure

Dec 30, 2024
50
14
8
Munich, Germany
In our cluster there are currently three hosts with four 3.84TB RI OSDs each.
I want to replace the four OSDs with 3.2TB MU SSDs, eventually adding a fifth OSD (later on).

Currently there is only around 2.2TB used per OSD so this should work.
CEPH version is 19.2.3-pve2, pve-manager/9.1.1/42db4a6cf33dac83 (running kernel: 6.17.2-2-pve)

There is a procedure from the CEPH documentation which I'd like to follow (Replace OSD), see https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

Unfortunately the instructions seem to be somehow incomplete.

That command will return always "EBUSY":
ceph osd safe-to-destroy osd.10

Which is to be expected since the OSD is still "up" and "in".

I feel like these instructions miss taking the OSD down.
Obviously I'd like to set some cluster flags like noout, norebalance, nobackfill to avoid unnecessary load.

Searching the forum I can find some posts talking about that issue but it seems none of them use the "official" documented method.

Any hints from a CEPH "white-beard" are greatly appreciated!
I don't want to CRUSH crash my storage cluster ;-)

Thanks!
 
I just did this yesterday, using the Proxmox GUI. YMMV I suppose, but since I added the new drives first, what I did was:
  • Out the OSD. This will start to rebalance in the background but all PGs remain blue or green. (blue I believe was due to the additional OSDs added)
  • Down the OSD which will rebalance about 10x faster but the cluster changes status to WARN and some PGs are yellow until recopied.
  • Notably if I set norebalance it wouldn't leave the WARN state. (but in my case I'd already installed the new OSD)
  • Now that the OSD is Down, Destroy the OSD and clean the disk.
  • [Here, you could install the new OSD and enable rebalance, I suppose]
  • Wait for cluster to recover to Healthy.
  • Repeat.
I recommend one OSD at a time and definitely not two on different hosts (don't want to drop two copies of the same PG, or overload the network/cluster with rebalancing). You could try disabling the flags but I'd personally want the cluster healthy (at least green and blue, not yellow/WARN) before removing a second OSD.
 
  • Like
Reactions: carles89 and UdoB
Thanks for your feedback!

The documentation of CEPH states:

Replacing an OSD differs from Removing the OSD in that the replaced OSD’s ID and CRUSH map entry must be kept intact after the OSD is destroyed for replacement.

As far as I understand your method will create a new OSD not nut keep the previous ID and CRUSH map, right?

This is why I created this post since I'd like to follow the official procedure, maybe get to know WHY it's recommended that way, pros and cons... and eventually why everyone else is doing it differently :)

Examples:

https://forum.proxmox.com/threads/ceph-osd-disk-replacement.54591/
1. down (stop) and out an osd (will probably already be in this state for a failed drive)
2. remove it from the tree and crush map ("destroy" in the gui)
3. replace disk
4. create new osd
5. profit

https://docs.redhat.com/en/document...ml/administration_guide/changing_an_osd_drive

https://www.ibm.com/docs/en/storage-ceph/8.1.0?topic=osds-replacing-osd-drive

...The procedure includes these steps:
Removing an OSD from the Ceph cluster
Replacing the physical drive
Adding an OSD to the Ceph cluster
 
Oh I see what you're asking. Interesting, because if you just destroy an OSD in the GUI the next one created gets the first available number starting at 0. I haven't changed the CRUSH map so haven't been concerned about change retention. I did check that "primary affinity" is (by default) enabled for the new OSD.

I think the question is, does Proxmox use ceph osd destroy or ceph osd purge when doing a Destroy? Without looking, I would think the former from the labeling but perhaps purge should also be an option.

Purge "removes the OSD from the CRUSH map, removes the OSD’s authentication key, and removes the OSD from the OSD map."

I'm not sure I see much benefit either way unless one has manually modified the CRUSH rules and wants to re-use them? Though if they have done that users may not realize they are inheriting old rules by using the Proxmox GUI.
 
In my case I'd like to perform all operations on CLI since it's a bit more transparent about what is actually happening.

From what I've learned so far, following steps seem to be according to CEPH docs:

Code:
# Add flags to avoid recovery
ceph osd set noout
ceph osd set norecover
ceph osd set nobackfill
ceph osd set norebalance

# stop OSD process
systemctl stop ceph-osd@8

# according to CEPH docs
ceph osd destroy 8 --yes-i-really-mean-it
# clear old drive
ceph-volume lvm zap /dev/sdX

# now change disks or use ones which have been installed in the chassis already
ceph-volume lvm create --osd-id 8 --data /dev/sdX

# start OSD process
systemctl start ceph-osd@8

# unset flags
ceph osd unset noout
ceph osd unset norecover
ceph osd unset nobackfill
ceph osd unset norebalance

# now wait for ceph to backfill data to the SSD
# watch progress

ceph -s
ceph osd tree

In that example I just replace the physical drive for OSD n° 8

Eventually CEPH should be happy again, in healthy state, ready to get the next drive replaced.

The time between setting the flags and unsetting them should be as short as possible.
Are those steps reasonable to run on a production cluster? ;)
 
weight should be directly related to the physical capacity of the storage medium
Right but it's possible to set it, and I did once (reasons not relevant here). My point being that the old setting wasn't kept. FWIW so far I'm not seeing any indication of the OSDs I have destroyed in Proxmox over time being left behind...

Anyway for the commands you listed if I skim through them that looks reasonable.