remove stuck OSD

sc3705

Member
Jul 3, 2020
27
1
8
41
I have an odd issue here where an OSD won't delete. My cluster is supposed to have 5 disks for each node but I've been running with 4 disks since I created the cluster a couple weeks ago (I added a 5th disk to one node but removed it a short time later). Today we installed the hardware needed for one of my nodes so we can utilize all 5 drives. After installing the new SAS controller, we added the 5th OSD to all nodes but had trouble with the last one (Why is it always the last one....). We had a heck of a hard time getting the 5th disk added to the last node. It kept hanging at "-> ceph-volume lvm prepare successful for: /dev/sde". I tried it via the gui and command line and in the end, we ended up rebooting the node.

When it came back up the disk to LVM "mapping" was changed, likely because the sas controller was changed during reboot. What I mean by that is before the reboot I was having trouble with /dev/sde but after the reboot /dev/sde was part of the cluster and has a functional OSD. However, /dev/sda was then showing as an available disk. I was then able the add the 5th disk to the last node. Now all 3 nodes have 5 disks each, each with 1 OSD. There is however an additional OSD that's not assigned to anything. OSD 14, I can't seem to remove it. I issued the destroy command, it comes back successful every time but the OSD remains when I do osd tree and it's listed as a down osd in the cluster. The final OSD I was having trouble with eventually assumed OSD 15 once I reboot and got it added.

Full disclosure; 2 nodes were worked on today. We added sas cards to both of them. The 1st server we added the SAS card and it came up with no issues and we were able to utilize all 5 disks without error. The second server already had all 5 disks connected but we were only using 4. We only added the SAS card to the 2nd node because we had to reboot it and we figured it's down anyway, might as well take the time now so we don't have to restart it again in a week. The issue existed before we installed the new SAS card and we didn't think it was related to the controller so we thought it would be safe.

How do I remove an OSD when the "ceph osd destroy 14 --force" command works but fails?

1596399099840.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!