CEPH cluster hangs. PLEASE HELP! URGENTLY!

albert_a

Well-Known Member
Mar 22, 2018
43
9
48
42
Hello, everybody!

Serious trouble.

`ceph status` hangs, every ceph command hangs!

PVE shows "got timeout"

reinstalled the node, tried to join the node just by copying keys, but could not create monitor (error: monitor existed), then I removed the monitor from monmap, and that's it all cluster hanged
 
I played around with the following commands:
Code:
# modified monmap
ceph-mon -i <NODE> --extract-monmap /tmp/monmap
monmaptool --print /tmp/monmap
monmaptool --rm mon.<NODE> /tmp/monmap
monmaptool --add mon.<NODE> <NODE_IP>:6789 /tmp/monmap
monmaptool --print /tmp/monmap
ceph-mon -i <NODE> --inject-monmap /tmp/monmap

# restarted the monitor, checked status, and ran it manually
systemctl restart ceph-mon@<NODE>
systemctl status ceph-mon@<NODE>
ceph-mon -f --cluster ceph --id ap2 --setuser ceph --setgroup ceph

# fixed the permissions (which were chnged by some of above commands):
chown ceph:ceph -R /var/lib/ceph/mon/ceph-<NODE>/store.db

I tried to make adequate monmap on all the nodes restarted monitors several times and some of the manipulations worked.
 
So things work now?
 
@Alwin Yes, it works now, thanks! But I'm afraid some inconsistencies might remain... So just to be sure, I will recreate cluster node by node, while simultaneously copying data from the shrinking old ceph cluster to a growing new one.

By the way, is there any way to do it as I wanted initially?
I mean reinstalling system on all the nodes in a cluster (for example for security reasons) but keeping ceph data, keeping ceph cluster intact
Is it possible?
 
You can re-install nodes, but they need to be removed from the Ceph & Corosync cluster first.