Hello all,
Here is the situation :
We have a Ceph cluster on top of proxmox on Dell Hardware.
On of the DELL virtual disk failed, hence the corresponding OSD failed.
This is a HDD disk not NVME and "thankfully" the bluestore was not split out on local NVME disks.
Anyway, we followed the recommendation to remove properly the OSD from CEPH prior to get replacement of the physical drive.
For that we followed the directions given on : pve-docs/chapter-pveceph.html#_replace_osds
related article on RHAT : https://people.redhat.com/bhubbard/nature/default/rados/operations/add-or-rm-osds/
but cannot do the ceph-volume lvm zap /dev/sdb as there is no more /dev/sdb
Things seems to go all well, little did we notice the the OSD.4 in this case removed, was actually removed but the service associated with it remained :
The device associated to this OSD (used to be /dev/sdb) was in effect removed from the system, yet we have this information listing the LVM PVs
Now the question is, before removing "forcefully" the systemd unit "ceph-osd@4.service", would anyone share possible prior similar experience and actions taken to recreate the missing OSD ?
Knowing that the hardware new disk is present in the system not in under /dev/sdb, but under /dev/sdm, would it be safe to create a renewed ceph OSD volume OSD.4 leveraging the /dev/sdm physical device ?
looking forward your thoughts
best regards
Here is the situation :
We have a Ceph cluster on top of proxmox on Dell Hardware.
On of the DELL virtual disk failed, hence the corresponding OSD failed.
This is a HDD disk not NVME and "thankfully" the bluestore was not split out on local NVME disks.
Anyway, we followed the recommendation to remove properly the OSD from CEPH prior to get replacement of the physical drive.
For that we followed the directions given on : pve-docs/chapter-pveceph.html#_replace_osds
related article on RHAT : https://people.redhat.com/bhubbard/nature/default/rados/operations/add-or-rm-osds/
but cannot do the ceph-volume lvm zap /dev/sdb as there is no more /dev/sdb
Things seems to go all well, little did we notice the the OSD.4 in this case removed, was actually removed but the service associated with it remained :
Bash:
# ceph osd crush ls proxmoxNode
osd.6
osd.5
osd.7
osd.16
osd.17
osd.18
osd.19
osd.20
osd.36
# systemctl list-units --type service |grep ceph
ceph-crash.service loaded active running Ceph crash dump collector
ceph-osd@16.service loaded active running Ceph object storage daemon osd.16
ceph-osd@17.service loaded active running Ceph object storage daemon osd.17
ceph-osd@18.service loaded active running Ceph object storage daemon osd.18
ceph-osd@19.service loaded active running Ceph object storage daemon osd.19
ceph-osd@20.service loaded active running Ceph object storage daemon osd.20
ceph-osd@36.service loaded active running Ceph object storage daemon osd.36
● ceph-osd@4.service loaded failed failed Ceph object storage daemon osd.4
ceph-osd@5.service loaded active running Ceph object storage daemon osd.5
ceph-osd@6.service loaded active running Ceph object storage daemon osd.6
ceph-osd@7.service loaded active running Ceph object storage daemon osd.7
The device associated to this OSD (used to be /dev/sdb) was in effect removed from the system, yet we have this information listing the LVM PVs
Bash:
# pvs
Error reading device /dev/ceph-b4dd4437-ca3b-4676-9476-43d479e91b80/osd-block-1d61dd7b-4fff-434f-aad3-1d27273fe45b at 0 length 512.
Error reading device /dev/ceph-b4dd4437-ca3b-4676-9476-43d479e91b80/osd-block-1d61dd7b-4fff-434f-aad3-1d27273fe45b at 0 length 4096.
PV VG Fmt Attr PSize PFree
/dev/nvme0n1 ceph-9ad5af69-2af0-4d77-9297-46cd579b3589 lvm2 a-- <1.46t 424.00m
/dev/nvme1n1 ceph-b2cd2032-4a87-4e83-bd9b-982568385b1e lvm2 a-- <1.46t 424.00m
/dev/sda3 pve lvm2 a-- 1.09t <16.00g
/dev/sdc ceph-8d9d0312-1de4-4617-917e-c1640f68b570 lvm2 a-- 10.69t 0
/dev/sdd ceph-c72b9c09-3dfe-4b41-9c07-23a8a089e897 lvm2 a-- 10.69t 0
/dev/sde ceph-6d001322-fe2f-461f-b068-db9c833571b9 lvm2 a-- 10.69t 0
/dev/sdf ceph-4fc85931-8fa4-496a-9312-c9da6e38b910 lvm2 a-- 10.69t 0
/dev/sdg ceph-0613871a-a88f-45d7-9544-05334867aacf lvm2 a-- 10.69t 0
/dev/sdh ceph-cb93a5d9-93f8-4596-82ca-175750cd7a01 lvm2 a-- 10.69t 0
/dev/sdi ceph-9d0b7e0e-db70-4ba0-aa23-173847631d7c lvm2 a-- 10.69t 0
/dev/sdj ceph-a2eb34ea-fd05-46da-b693-29c64a790748 lvm2 a-- 10.69t 0
/dev/sdk ceph-521bbd29-951a-4aa5-95f1-6a679c5ebcb0 lvm2 a-- 10.69t 0
/dev/sdl SQL_RAID lvm2 a-- 21.38t 508.00m
....
Now the question is, before removing "forcefully" the systemd unit "ceph-osd@4.service", would anyone share possible prior similar experience and actions taken to recreate the missing OSD ?
Knowing that the hardware new disk is present in the system not in under /dev/sdb, but under /dev/sdm, would it be safe to create a renewed ceph OSD volume OSD.4 leveraging the /dev/sdm physical device ?
looking forward your thoughts
best regards