CEPH cluster-^%&**

May 12, 2023
4
0
1
Excuse the bad language - but the situation calls for it.

So , We have a 3 Node Cluster with CEPH ( well had ). It was decided to implement a full-mesh setup for Ceph, seperating the storage traffic from the rest...All good no worries...until we had catastrophic host hdd failure , forcing us to re-install PVE and re-join cluster , also not much of an issue.
UNTIL.. Ceph installed on the new node...and froze up completely...forcing a reboot... which somehow annihilated the whole cluster...monitors dead, mgr dead, mds...yes also dead.
I have attempted purge and install with no luck.
ceph -s times out
create mon times out

PVE cluster service and corosync all fine

I have the OSD data , I do have a copy of ONE monitor store and keyrings.

Running contingency - essential service VM's restored from backup to "static" un-cephed drives - very much not ideal.

Please someone send up a flare in the darkness and help a noob-ish guy out to sort this nightmare out.
 
There was a procedure for restoring from one monitor, but i cannot find it here, there probably is a way to search through the forum. But some logs are still needed.
 
I decided to re-install Ceph on all nodes using :
systemctl restart pvestatd
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /etc/ceph /etc/pve/ceph.conf /etc/pve/priv/ceph* /var/lib/ceph
pveceph purge
systemctl restart pvestatd
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
systemctl restart pvestatd
rm /etc/init.d/ceph
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
systemctl restart pvestatd
mkdir -p /etc/ceph
mkdir -p /var/lib/ceph/bootstrap-osd
mkdir -p /var/lib/ceph/mgr
mkdir -p /var/lib/ceph/mon
pveceph install
systemctl restart pvestatd
pveceph init
systemctl restart pvestatd


which worked - until after creating first monitor, then none of the other nodes could reach ceph cluster all timeouts.
I confirmed comms on configured ips , cluster services all ok .
I am at a loss.