I have an odd issue here where an OSD won't delete. My cluster is supposed to have 5 disks for each node but I've been running with 4 disks since I created the cluster a couple weeks ago (I added a 5th disk to one node but removed it a short time later). Today we installed the hardware needed for one of my nodes so we can utilize all 5 drives. After installing the new SAS controller, we added the 5th OSD to all nodes but had trouble with the last one (Why is it always the last one....). We had a heck of a hard time getting the 5th disk added to the last node. It kept hanging at "-> ceph-volume lvm prepare successful for: /dev/sde". I tried it via the gui and command line and in the end, we ended up rebooting the node.
When it came back up the disk to LVM "mapping" was changed, likely because the sas controller was changed during reboot. What I mean by that is before the reboot I was having trouble with /dev/sde but after the reboot /dev/sde was part of the cluster and has a functional OSD. However, /dev/sda was then showing as an available disk. I was then able the add the 5th disk to the last node. Now all 3 nodes have 5 disks each, each with 1 OSD. There is however an additional OSD that's not assigned to anything. OSD 14, I can't seem to remove it. I issued the destroy command, it comes back successful every time but the OSD remains when I do osd tree and it's listed as a down osd in the cluster. The final OSD I was having trouble with eventually assumed OSD 15 once I reboot and got it added.
Full disclosure; 2 nodes were worked on today. We added sas cards to both of them. The 1st server we added the SAS card and it came up with no issues and we were able to utilize all 5 disks without error. The second server already had all 5 disks connected but we were only using 4. We only added the SAS card to the 2nd node because we had to reboot it and we figured it's down anyway, might as well take the time now so we don't have to restart it again in a week. The issue existed before we installed the new SAS card and we didn't think it was related to the controller so we thought it would be safe.
How do I remove an OSD when the "ceph osd destroy 14 --force" command works but fails?
When it came back up the disk to LVM "mapping" was changed, likely because the sas controller was changed during reboot. What I mean by that is before the reboot I was having trouble with /dev/sde but after the reboot /dev/sde was part of the cluster and has a functional OSD. However, /dev/sda was then showing as an available disk. I was then able the add the 5th disk to the last node. Now all 3 nodes have 5 disks each, each with 1 OSD. There is however an additional OSD that's not assigned to anything. OSD 14, I can't seem to remove it. I issued the destroy command, it comes back successful every time but the OSD remains when I do osd tree and it's listed as a down osd in the cluster. The final OSD I was having trouble with eventually assumed OSD 15 once I reboot and got it added.
Full disclosure; 2 nodes were worked on today. We added sas cards to both of them. The 1st server we added the SAS card and it came up with no issues and we were able to utilize all 5 disks without error. The second server already had all 5 disks connected but we were only using 4. We only added the SAS card to the 2nd node because we had to reboot it and we figured it's down anyway, might as well take the time now so we don't have to restart it again in a week. The issue existed before we installed the new SAS card and we didn't think it was related to the controller so we thought it would be safe.
How do I remove an OSD when the "ceph osd destroy 14 --force" command works but fails?