removing OSD on failed Hardware leaves the OSD service

pemadiot

Member
Feb 16, 2021
8
0
21
54
Hello all,

Here is the situation :
We have a Ceph cluster on top of proxmox on Dell Hardware.
On of the DELL virtual disk failed, hence the corresponding OSD failed.
This is a HDD disk not NVME and "thankfully" the bluestore was not split out on local NVME disks.

Anyway, we followed the recommendation to remove properly the OSD from CEPH prior to get replacement of the physical drive.
For that we followed the directions given on : pve-docs/chapter-pveceph.html#_replace_osds

related article on RHAT : https://people.redhat.com/bhubbard/nature/default/rados/operations/add-or-rm-osds/
but cannot do the ceph-volume lvm zap /dev/sdb as there is no more /dev/sdb

Things seems to go all well, little did we notice the the OSD.4 in this case removed, was actually removed but the service associated with it remained :
Bash:
# ceph osd crush ls proxmoxNode
osd.6
osd.5
osd.7
osd.16
osd.17
osd.18
osd.19
osd.20
osd.36

# systemctl list-units --type service |grep ceph
  ceph-crash.service     loaded active running Ceph crash dump collector
  ceph-osd@16.service    loaded active running Ceph object storage daemon osd.16
  ceph-osd@17.service    loaded active running Ceph object storage daemon osd.17
  ceph-osd@18.service    loaded active running Ceph object storage daemon osd.18
  ceph-osd@19.service    loaded active running Ceph object storage daemon osd.19
  ceph-osd@20.service    loaded active running Ceph object storage daemon osd.20
  ceph-osd@36.service    loaded active running Ceph object storage daemon osd.36
● ceph-osd@4.service     loaded failed failed  Ceph object storage daemon osd.4
  ceph-osd@5.service     loaded active running Ceph object storage daemon osd.5
  ceph-osd@6.service     loaded active running Ceph object storage daemon osd.6
  ceph-osd@7.service     loaded active running Ceph object storage daemon osd.7

The device associated to this OSD (used to be /dev/sdb) was in effect removed from the system, yet we have this information listing the LVM PVs
Bash:
# pvs
  Error reading device /dev/ceph-b4dd4437-ca3b-4676-9476-43d479e91b80/osd-block-1d61dd7b-4fff-434f-aad3-1d27273fe45b at 0 length 512.
  Error reading device /dev/ceph-b4dd4437-ca3b-4676-9476-43d479e91b80/osd-block-1d61dd7b-4fff-434f-aad3-1d27273fe45b at 0 length 4096.
  PV           VG                                        Fmt  Attr PSize  PFree
  /dev/nvme0n1 ceph-9ad5af69-2af0-4d77-9297-46cd579b3589 lvm2 a--  <1.46t 424.00m
  /dev/nvme1n1 ceph-b2cd2032-4a87-4e83-bd9b-982568385b1e lvm2 a--  <1.46t 424.00m
  /dev/sda3    pve                                       lvm2 a--   1.09t <16.00g
  /dev/sdc     ceph-8d9d0312-1de4-4617-917e-c1640f68b570 lvm2 a--  10.69t      0
  /dev/sdd     ceph-c72b9c09-3dfe-4b41-9c07-23a8a089e897 lvm2 a--  10.69t      0
  /dev/sde     ceph-6d001322-fe2f-461f-b068-db9c833571b9 lvm2 a--  10.69t      0
  /dev/sdf     ceph-4fc85931-8fa4-496a-9312-c9da6e38b910 lvm2 a--  10.69t      0
  /dev/sdg     ceph-0613871a-a88f-45d7-9544-05334867aacf lvm2 a--  10.69t      0
  /dev/sdh     ceph-cb93a5d9-93f8-4596-82ca-175750cd7a01 lvm2 a--  10.69t      0
  /dev/sdi     ceph-9d0b7e0e-db70-4ba0-aa23-173847631d7c lvm2 a--  10.69t      0
  /dev/sdj     ceph-a2eb34ea-fd05-46da-b693-29c64a790748 lvm2 a--  10.69t      0
  /dev/sdk     ceph-521bbd29-951a-4aa5-95f1-6a679c5ebcb0 lvm2 a--  10.69t      0
  /dev/sdl     SQL_RAID                                  lvm2 a--  21.38t 508.00m
....

Now the question is, before removing "forcefully" the systemd unit "ceph-osd@4.service", would anyone share possible prior similar experience and actions taken to recreate the missing OSD ?
Knowing that the hardware new disk is present in the system not in under /dev/sdb, but under /dev/sdm, would it be safe to create a renewed ceph OSD volume OSD.4 leveraging the /dev/sdm physical device ?

looking forward your thoughts

best regards
 
Wrong forum....
 
Hello,
Just so you know, we recreated the OSD from Proxmox, with the replaced disk available (/dev/sd<newletter>), and somehow, it associated back to OSD.4.
It's not completly clear to me how this happened, but it did, and all is good now.