CEPH cluster-^%&**

May 12, 2023
4
0
1
Excuse the bad language - but the situation calls for it.

So , We have a 3 Node Cluster with CEPH ( well had ). It was decided to implement a full-mesh setup for Ceph, seperating the storage traffic from the rest...All good no worries...until we had catastrophic host hdd failure , forcing us to re-install PVE and re-join cluster , also not much of an issue.
UNTIL.. Ceph installed on the new node...and froze up completely...forcing a reboot... which somehow annihilated the whole cluster...monitors dead, mgr dead, mds...yes also dead.
I have attempted purge and install with no luck.
ceph -s times out
create mon times out

PVE cluster service and corosync all fine

I have the OSD data , I do have a copy of ONE monitor store and keyrings.

Running contingency - essential service VM's restored from backup to "static" un-cephed drives - very much not ideal.

Please someone send up a flare in the darkness and help a noob-ish guy out to sort this nightmare out.
 
There was a procedure for restoring from one monitor, but i cannot find it here, there probably is a way to search through the forum. But some logs are still needed.
 
I decided to re-install Ceph on all nodes using :
systemctl restart pvestatd
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /etc/ceph /etc/pve/ceph.conf /etc/pve/priv/ceph* /var/lib/ceph
pveceph purge
systemctl restart pvestatd
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
systemctl restart pvestatd
rm /etc/init.d/ceph
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
systemctl restart pvestatd
mkdir -p /etc/ceph
mkdir -p /var/lib/ceph/bootstrap-osd
mkdir -p /var/lib/ceph/mgr
mkdir -p /var/lib/ceph/mon
pveceph install
systemctl restart pvestatd
pveceph init
systemctl restart pvestatd


which worked - until after creating first monitor, then none of the other nodes could reach ceph cluster all timeouts.
I confirmed comms on configured ips , cluster services all ok .
I am at a loss.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!