PVE8.2 how to properly reinstall Ceph?

_--James--_

Member
May 17, 2023
54
32
18
Running through a few DR scenarios and our scripts we used to reinstall Ceph (both at the cluster level and per node) are not working under 8. While we can pull Ceph off the node/cluster, when we go to add the 2nd node back in(on either an existing or new install) Ceph blows up at the cluster level and all Ceph members go back to an unconfigured state.

Below is the purge scripting we use, what are we missing?
-Mind you, we want to be able to do both a full cluster rebuild on Ceph if needed and a single PVE node without pulling Ceph out of the cluster.


ceph osd down 0 && ceph osd destroy 0 --force
ceph osd down 1 && ceph osd destroy 1 --force
ceph osd down 2 && ceph osd destroy 2 --force
rm /etc/ceph/ceph.conf
systemctl stop ceph-mon@pve1
systemctl stop ceph-mon@pve2
systemctl stop ceph-mon@pve3
systemctl disable ceph-mon@pve1
systemctl disable ceph-mon@pve2
systemctl disable ceph-mon@pve3
umount /var/lib/ceph/osd/ceph-0
umount /var/lib/ceph/osd/ceph-1
umount /var/lib/ceph/osd/ceph-3
rm -rf /var/lib/ceph
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /etc/systemd/system/ceph*
reboot

pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
apt purge ceph-base ceph-mgr-modules-core
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
rm -f /etc/pve/ceph/*
rm -rf /etc/pve/ceph
rm -r /etc/pve/ceph.conf
rm -r /etc/ceph
rm -rf /etc/pve/priv/ceph.*
reboot
 
So, that worked fine up to 8.1-4 but no longer completely works with 8.2.2. It still uninstalls Ceph, but there are still components left behind that prevent the reinstall and reconfiguration. We are trying to find out what changed between the builds and it seems to be in corosync.

So far, it seems that we have to completely nuke and pave the target host we want to rebuild Ceph on. then Pull the host from the cluster and Ceph config...etc. That works as expected. But if we want to do a ceph reinstall on existing target host it will destroy Ceph in an existing cluster unless we nuke and pave.
 
So, that worked fine up to 8.1-4 but no longer completely works with 8.2.2. It still uninstalls Ceph, but there are still components left behind that prevent the reinstall and reconfiguration. We are trying to find out what changed between the builds and it seems to be in corosync.

So far, it seems that we have to completely nuke and pave the target host we want to rebuild Ceph on. then Pull the host from the cluster and Ceph config...etc. That works as expected. But if we want to do a ceph reinstall on existing target host it will destroy Ceph in an existing cluster unless we nuke and pave.

Came across your thread having a similar issue. Did you try this?
https://forum.proxmox.com/threads/h...fig-and-start-from-scratch.68949/#post-310629

rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
pveceph purge
apt -y purge ceph-mon ceph-osd ceph-mgr ceph-mds
rm /etc/init.d/ceph
rm -rf /etc/ceph

for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done

(I added the rm -rf /etc/ceph). It got me to a point of being able to reinstall ceph from the GUI but it's still having issues
 
Thanks for the reply, forgot to update this thread with our findings!

What we are now doing to pull Ceph off a node, to be reconfigured back into the cluster. The entire goal was to DR ceph if we needed to reinstall it on the cluster and/or any given host. Prior to 8.2 we have had ceph upgrade failures that sometimes required purging hosts to reinstall ceph to the desired version.

(all of this was tested with and without the corosync host purge process)

#on host to be purged
#mark OSDs down, edit for all target OSDs
ceph osd down 4 && ceph osd destroy 4 --force
ceph osd down 5 && ceph osd destroy 5 --force

#stop services and delete config/key files - add ceph-mon@hostid as required
systemctl stop ceph-mon@pve3
systemctl disable ceph-mon@pve3
rm -rf /etc/pve/ceph.conf
rm /etc/ceph/ceph.conf
rm -rf /etc/systemd/system/ceph*
rm -rf /var/lib/ceph
killall -9 ceph-mon ceph-mgr ceph-mds

## Look for - Removed /etc/systemd/system/ceph-mon.target.wants/ceph-mon@labnode1.service.

#purge ceph from node
pveceph purge

#clean up OSD LVMs - add OSD ID's as needed (ceph-#), add /dev/nvme* as needed.
umount /var/lib/ceph/osd/ceph-4
umount /var/lib/ceph/osd/ceph-5
ceph-volume lvm zap /dev/nvme4n1 --destroy && ceph-volume lvm zap /dev/nvme5n1 --destroy

#remove Ceph install
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
apt purge ceph-base ceph-mgr-modules-core

#clean up left over ceph data
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
rm -f /etc/pve/ceph/*
rm -rf /etc/pve/ceph
rm -r /etc/pve/ceph.conf
rm -r /etc/ceph
rm -rf /etc/pve/priv/ceph.*

#reboot was not necessary during testing, but in a production environment I would reboot purged hosts


and what we are now doing to pull the desired node and its osds from the Ceph Cluster

#on a clustered ceph node
#destroy purged hosts OSDs, build below script for all target OSDs
ceph osd purge 4 --yes-i-really-mean-it
ceph osd purge 5 --yes-i-really-mean-it

#remove purged nodes from ceph crush_map
ceph osd crush remove pve3

#remove purged osd auth keys
ceph auth del osd.4
ceph auth del osd.5

#remove purged osd from ceph_db
ceph osd rm 4
ceph osd rm 5

#remove purged monitors
ceph mon remove pve3

#validate global config, if still listed remove references to purged host/osds/monitors
cat /etc/pve/ceph.conf

#validate ceph config, should match global config
cat /etc/ceph/ceph.conf


#reinstall Ceph on purged node(s)
#if errors during reinstall
mkdir /var/lib/ceph
mkdir /var/lib/ceph/bootstrap-osd

#if errors during configuration
mkdir /etc/pve/ceph

#add monitors/osds back in
 
Thanks for the reply, forgot to update this thread with our findings!

What we are now doing to pull Ceph off a node, to be reconfigured back into the cluster. The entire goal was to DR ceph if we needed to reinstall it on the cluster and/or any given host. Prior to 8.2 we have had ceph upgrade failures that sometimes required purging hosts to reinstall ceph to the desired version.

(all of this was tested with and without the corosync host purge process)

#on host to be purged
#mark OSDs down, edit for all target OSDs
ceph osd down 4 && ceph osd destroy 4 --force
ceph osd down 5 && ceph osd destroy 5 --force

#stop services and delete config/key files - add ceph-mon@hostid as required
systemctl stop ceph-mon@pve3
systemctl disable ceph-mon@pve3
rm -rf /etc/pve/ceph.conf
rm /etc/ceph/ceph.conf
rm -rf /etc/systemd/system/ceph*
rm -rf /var/lib/ceph
killall -9 ceph-mon ceph-mgr ceph-mds

## Look for - Removed /etc/systemd/system/ceph-mon.target.wants/ceph-mon@labnode1.service.

#purge ceph from node
pveceph purge

#clean up OSD LVMs - add OSD ID's as needed (ceph-#), add /dev/nvme* as needed.
umount /var/lib/ceph/osd/ceph-4
umount /var/lib/ceph/osd/ceph-5
ceph-volume lvm zap /dev/nvme4n1 --destroy && ceph-volume lvm zap /dev/nvme5n1 --destroy

#remove Ceph install
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
apt purge ceph-base ceph-mgr-modules-core

#clean up left over ceph data
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
rm -f /etc/pve/ceph/*
rm -rf /etc/pve/ceph
rm -r /etc/pve/ceph.conf
rm -r /etc/ceph
rm -rf /etc/pve/priv/ceph.*

#reboot was not necessary during testing, but in a production environment I would reboot purged hosts


and what we are now doing to pull the desired node and its osds from the Ceph Cluster

#on a clustered ceph node
#destroy purged hosts OSDs, build below script for all target OSDs
ceph osd purge 4 --yes-i-really-mean-it
ceph osd purge 5 --yes-i-really-mean-it

#remove purged nodes from ceph crush_map
ceph osd crush remove pve3

#remove purged osd auth keys
ceph auth del osd.4
ceph auth del osd.5

#remove purged osd from ceph_db
ceph osd rm 4
ceph osd rm 5

#remove purged monitors
ceph mon remove pve3

#validate global config, if still listed remove references to purged host/osds/monitors
cat /etc/pve/ceph.conf

#validate ceph config, should match global config
cat /etc/ceph/ceph.conf


#reinstall Ceph on purged node(s)
#if errors during reinstall
mkdir /var/lib/ceph
mkdir /var/lib/ceph/bootstrap-osd

#if errors during configuration
mkdir /etc/pve/ceph

#add monitors/osds back in
This worked for me on 8.2.7 only difference is i couldn't unmount and destroy volumes, had to destroy osd's after reinstallation then add them back
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!