Broken OSD in Ceph

JBE

Member
Nov 15, 2019
10
0
21
51
Helllo,
We have an OSD that won't start on one of our three nodes. Its very strange as when I try and start it , the system throws up no errors and the logs say it has started OK. However, it wont start. I am thinking of changing the hardware, ie. swapping the SSD itself for a replacement and then creating a new OSD from this and letting the Ceph rebuild. However, can I add the physical disk in to the node without switching it off? its a live production environment and I am trying to minimise downtime. Obvelouly, If I get the OSD up and running, this may be a better option, but as yo see the node where the disk is down has only four disks on it anyway. Any advice would be great.

Here is a screenshot:


Screenshot 2020-03-12 at 12.40.47.png
 
The OSDs are replaceable without downtime to the VM/CT. But in any case, what is the systemctl status ceph-osd@12.service saying and the /var/log/ceph/ceph-osd.12.log should also give some hints.
 
ceph osd tree should show you if the OSD up and in. If so, then it is running orderly and a ceph -s should also show health_ok.
 
Thanks Alwin
I am a promox Noob and have inherited this system. I thought that I could destroy the OST and then re-create it. However, when I try this in the GUI . it says there are no disks to create. However, when I look in the disks list the ost.12 is listed. Do I have to find a way to totally remove it before I can add it again? I can see from the log that my storage is almost full. However, I think I could have some worn disks. However, looking at the wearout it does not seem that bad, so I am totally confused. Please see screenshots. Screen Shot 2020-03-12 at 14.04.53.pngScreen Shot 2020-03-12 at 14.02.07.png
 
In any case, you will need 1x more disk in zeus, as otherwise the distribution is uneven when a OSD fails (usage level > 90%).

Run a sgdisk -Z /dev/sdX, to get the partition table deleted. Then the disk will show up as empty. At this point it should then be possible to create the OSD 12 again.
 
Hi,
When I try that command I get an error that says Problem opening /dev/sdx for reading error is 2 .
 
Would this work? I dont want to try it and make things worse:
ceph-volume lvm zap /dev/sde ( sde being the device name of the disk that is still showing

Screen Shot 2020-03-12 at 14.38.34.png
reboot machine
 
The command does the same.
 
OK. I got the OSD back on line by trying the command posted above. However, I think I still need to add an additional one to balance the cluster. Thanks for your helpScreenshot 2020-03-13 at 08.53.17.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!