Broken OSD in Ceph

JBE · Mar 12, 2020

Helllo,
We have an OSD that won't start on one of our three nodes. Its very strange as when I try and start it , the system throws up no errors and the logs say it has started OK. However, it wont start. I am thinking of changing the hardware, ie. swapping the SSD itself for a replacement and then creating a new OSD from this and letting the Ceph rebuild. However, can I add the physical disk in to the node without switching it off? its a live production environment and I am trying to minimise downtime. Obvelouly, If I get the OSD up and running, this may be a better option, but as yo see the node where the disk is down has only four disks on it anyway. Any advice would be great.

Here is a screenshot:

Alwin · Mar 12, 2020

The OSDs are replaceable without downtime to the VM/CT. But in any case, what is the systemctl status ceph-osd@12.service saying and the /var/log/ceph/ceph-osd.12.log should also give some hints.

JBE · Mar 12, 2020

Here is another screenshot

Alwin · Mar 12, 2020

ceph osd tree should show you if the OSD up and in. If so, then it is running orderly and a ceph -s should also show health_ok.

JBE · Mar 12, 2020

Thanks Alwin
I am a promox Noob and have inherited this system. I thought that I could destroy the OST and then re-create it. However, when I try this in the GUI . it says there are no disks to create. However, when I look in the disks list the ost.12 is listed. Do I have to find a way to totally remove it before I can add it again? I can see from the log that my storage is almost full. However, I think I could have some worn disks. However, looking at the wearout it does not seem that bad, so I am totally confused. Please see screenshots.

Alwin · Mar 12, 2020

In any case, you will need 1x more disk in zeus, as otherwise the distribution is uneven when a OSD fails (usage level > 90%).

Run a sgdisk -Z /dev/sdX, to get the partition table deleted. Then the disk will show up as empty. At this point it should then be possible to create the OSD 12 again.

JBE · Mar 12, 2020

Hi,
When I try that command I get an error that says Problem opening /dev/sdx for reading error is 2 .

JBE · Mar 12, 2020

Would this work? I dont want to try it and make things worse:
ceph-volume lvm zap /dev/sde ( sde being the device name of the disk that is still showing

reboot machine

Alwin · Mar 13, 2020

The command does the same.

JBE · Mar 13, 2020

OK. I got the OSD back on line by trying the command posted above. However, I think I still need to add an additional one to balance the cluster. Thanks for your help

Search

Search

Broken OSD in Ceph

JBE

Member

Alwin

Proxmox Retired Staff

JBE

Member

Alwin

Proxmox Retired Staff

JBE

Member

Alwin

Proxmox Retired Staff

JBE

Member

JBE

Member

Alwin

Proxmox Retired Staff

JBE

Member

We value your privacy