[SOLVED] Ceph crash, monitors full

Fred Saunier

Well-Known Member
Aug 24, 2017
55
2
48
Brussels, BE
Hello all,

We have been successfully running a 5-node ceph nautilus cluster for a while, on Proxmox 6.4. We have unfortunately undergone a serious power failure which crashed the ceph, 5 of the 27 osds being ejected from the cluster. These could only be reinserted after having been destroyed and re-created.

Ceph has since then been reconstructing.

However, the ceph cluster has become completely unresponsive after the 3 monitors became full. I noticed /var/log/ceph/ceph.log was huge on the 3 monitors, deleting this file solved the issue and got the ceph running again (still reconstructing).

The ceph has again become unresponsive, this time because of the monitor db inflating at an alarming rate. I tried the command
Bash:
ceph tell mon.prox5 compact
but nothing happened (the command did not complete and was aborted).

I destroyed and recreated the monitor having the issue, leaving the 2 others running. The monitor got destroyed, space available was recovered. As soon as it was restarted, store.db has inflated again to fill up the entire / partition in a couple of minutes.

Any suggestion as to how to solve this?

Thanks
 
Last edited:
I am posting here what I did to solve my issue, as it may help someone else.

Upon investigation, it would appear it is not unusual for the store.db to grow quite large during important rebuild operations. What I have done to solve the issue, is to connect an extra drive on each monitor (2TB to make sure, as I had them on my hands, but smaller disks could have done the job as well I suppose). One of the monitors has its extra drive connected externally through a usb dock, for lack of other available space. I formatted each drive as ext4, mounted them temporarily in /mnt, and rsynced the content of /var/lib/ceph/mon it them like so:

Bash:
mkfs.ext4 -m 0 -L "CEPHMON" /dev/sdh1
mkdir /mnt/sdh1
mount /dev/sdh1 /mnt/sdh1
rsync -uaAvz --progress /var/lib/ceph/mon/ /mnt/sdh1

Declare the drive as a mount point in /etc/fstab
Bash:
UUID=XXXXXXX    /var/lib/ceph/mon    ext4    defaults     0    0

Stop the monitor daemon (the monitor here being called 'prox3'):
Bash:
systemctl stop ceph-mon@prox3

Delete the content of /var/lib/ceph/mon to reclame space on /
Bash:
cd /var/lib/ceph/mon
rm -fr *

Mount the external drive in /var/lib/ceph/mon
Bash:
mount /var/lib/ceph/mon

Restart the daemon and check it is now running normally
Bash:
systemctl start ceph-mon@prox3
systemctl status ceph-mon@prox3

and then I did the same on the other 2 monitors, prox4 and prox5. After a couple of minutes, CEPH became again available and resumed its rebuilding operations.
 
Last edited: