[SOLVED] Ceph OSD change DB Disk

AlexLup

Well-Known Member
Mar 19, 2018
218
14
58
42
Hi,
I've had issues when I put in new journal disks and wanted to move existing disks from one journal disk to the new ones.

The issues where, I set the osd into Out mode, then Stopped the OSD, and destroyed it.
Recreating the OSD with the new DB device make the OSD never to show up!

This is a known bug in Ceph luminous release: https://tracker.ceph.com/issues/22354

To solve this, I Outed the OSD, Stopped the OSD, Destroyed it in PVE to remove the journal 1gb partition and ran this script:

lsblk
read -p "Enter /dev/HDD<name> to be zapped: " devname

ceph osd tree

read -p "Enter osd.<nr> to be zapped: " osdnr


echo "*** Running ...\tsystemctl stop ceph-osd@$osdnr"
systemctl stop ceph-osd@$osdnr

echo "*** Running ...\tumount /var/lib/ceph/osd/ceph-$osdnr"
umount /var/lib/ceph/osd/ceph-$osdnr

echo "*** Running ...\tdd if=/dev/zero of=/dev/$devname bs=1M count=2048"
dd if=/dev/zero of=/dev/$devname bs=1M count=2048

echo "*** Running ...\tsgdisk -Z /dev/$devname\n"
sgdisk -Z /dev/$devname

echo "*** Running ...\tsgdisk -g /dev/$devname\n"
sgdisk -g /dev/$devname

echo "*** Running ...\tpartprobe"
partprobe /dev/$devname

echo "*** Running ...\tceph-disk zap /dev/$devname\n"
ceph-disk zap /dev/$devname

echo "*** Running ...\tceph osd out $osdnr\n"
ceph osd out $osdnr

echo "*** Running ...\tceph osd crush remove osd.$osdnr\n"
ceph osd crush remove osd.$osdnr

echo "*** Running ...\tceph auth del osd.$osdnr\n"
ceph auth del osd.$osdnr

echo "*** Running ...\tceph osd rm $osdnr\n"
ceph osd rm $osdnr


echo "*** Running ...\tpartprobe\n"
partprobe /dev/$devname

Hopes it helps someone in the interwebz :D
 
/etc/pve/proxmox_custom_conf/zap_disk.sh (putting the scripts here syncs to all other nodes as well!)
Warning: The database (/etc/pve) holds only 30MB (is in RAM), keep in mind that filling up /etc/pve can bring your cluster to halt.

Comment about your script, while this is a handy replacement for doing all those commands by hand, please be careful, you are stopping and deleting the disk before you remove it from the cluster and there is nothing in the script that verifies that the OSD ID is actually the disk that you want to remove.
 
Thanks for the heads up!
I will add some warnings to the script and update it on the forum here.

It would be good if the destroy button actually destroyed the disk from the GUI.
Is it possible for me to edit the commands of the GUI button in proxmox ?
 
It would be good if the destroy button actually destroyed the disk from the GUI.
Is it possible for me to edit the commands of the GUI button in proxmox ?
Depends on how good your programming skills are. Patches are always welcome. ;)

The CLI/GUI does not use dd to remove the leftover part of an OSD afterwards. Usually only needed when the same disk is reused as an OSD. As ceph-disk is deprecated now (Mimic) in favor of ceph-volume, the OSD create/destroy will change in the future anyway.

But you can shorten your script, with the use of 'pveceph destroyosd <NUM>' and run dd afterwards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!