Starting osd after unplugging and plugging the drive without rebooting the server

labynko · May 21, 2021

Hello.
If I accidentally remove the drive from the server and then plug it back in, what is the procedure for osd to work correctly without rebooting the server?

labynko · May 22, 2021

I managed to find a solution:

1. Determine which OSD failed after removing the disk (the status shows that the failure occurred with osd.2):
ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09357 root default
-7 0.03119 host proxmox1
4 ssd 0.01559 osd.4 up 1.00000 1.00000
5 ssd 0.01559 osd.5 up 1.00000 1.00000
-5 0.03119 host proxmox2
2 ssd 0.01559 osd.2 down 1.00000 1.00000
3 ssd 0.01559 osd.3 up 1.00000 1.00000
-3 0.03119 host proxmox3
0 ssd 0.01559 osd.0 up 1.00000 1.00000
1 ssd 0.01559 osd.1 up 1.00000 1.00000

2. On the proxmox2 host, get a list of used LVM volumes for OSD (find the used device name for osd.2):
ceph-volume lvm list

====== osd.2 =======

[block] /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a

block device /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a
block uuid mwh3Wx-ayW8-iv9u-UGNM-lMZ2-E8Gp-1dgsh6
cephx lockbox secret
cluster fsid fd69522b-8c80-4efe-b900-f2d8c7c00e43
cluster name ceph
crush device class None
encrypted 0
osd fsid c8e1aae8-707b-416f-b392-c6786ddd9a7a
osd id 2
osdspec affinity
type block
vdo 0
devices /dev/sdb

====== osd.3 =======

[block] /dev/ceph-59a2d958-3256-4d6c-94fe-b7bc71f227fa/osd-block-93ad57d6-e882-4e54-84b1-a16ba7b56629

block device /dev/ceph-59a2d958-3256-4d6c-94fe-b7bc71f227fa/osd-block-93ad57d6-e882-4e54-84b1-a16ba7b56629
block uuid cItkqn-kLfZ-mLVs-XZQ2-Zd4e-RVsX-NstKsB
cephx lockbox secret
cluster fsid fd69522b-8c80-4efe-b900-f2d8c7c00e43
cluster name ceph
crush device class None
encrypted 0
osd fsid 93ad57d6-e882-4e54-84b1-a16ba7b56629
osd id 3
osdspec affinity
type block
vdo 0
devices /dev/sdc

3. Deactivate osd.2 volume:
lvm lvchange -a n /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a

4. Activate osd.2 volume:
lvm lvchange -a y /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a

5. Launch osd.2 with the "osd id" and "osd fsid" values specified in the command, which were obtained in the 2nd step:
ceph-volume lvm activate 2 c8e1aae8-707b-416f-b392-c6786ddd9a7a

Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a --path /var/lib/ceph/osd/ceph-2 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-e6316ef4-99bf-40d7-ad9d-673bd450ed97/osd-block-c8e1aae8-707b-416f-b392-c6786ddd9a7a /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-1
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/systemctl enable ceph-volume@lvm-2-c8e1aae8-707b-416f-b392-c6786ddd9a7a
Running command: /usr/bin/systemctl enable --runtime ceph-osd@2
Running command: /usr/bin/systemctl start ceph-osd@2
--> ceph-volume lvm activate successful for osd ID: 2

6. Check that all OSDs are working:
ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09357 root default
-7 0.03119 host proxmox1
4 ssd 0.01559 osd.4 up 1.00000 1.00000
5 ssd 0.01559 osd.5 up 1.00000 1.00000
-5 0.03119 host proxmox2
2 ssd 0.01559 osd.2 up 1.00000 1.00000
3 ssd 0.01559 osd.3 up 1.00000 1.00000
-3 0.03119 host proxmox3
0 ssd 0.01559 osd.0 up 1.00000 1.00000
1 ssd 0.01559 osd.1 up 1.00000 1.00000

Tmanok · Apr 24, 2022

Labynko!

Thank you so much! This helped me restore two mislabeled OSDs! Wrong serial printed on the caddy/tray and I only noticed about 2 minutes later! Safe to say that if you can run these commands within 5 minutes you won't crash your OSD. I crashed one after about the 5-8 minute mark.

Tmanok

xenago · May 4, 2025

I'd also like to thank Labynko for that amazing post.

This is still a working solution on Debian-based OS for reconnecting flaky disks to ceph clusters without rebooting. Extremely useful in situations where a disk is disconnecting randomly and thus enable the OSD to be restarted at any time.

Search

Search

Starting osd after unplugging and plugging the drive without rebooting the server

labynko

New Member

labynko

New Member

Tmanok

Renowned Member

xenago

New Member

We value your privacy