[SOLVED] Ceph crash, monitors full

Fred Saunier · Jul 14, 2021

Hello all,

We have been successfully running a 5-node ceph nautilus cluster for a while, on Proxmox 6.4. We have unfortunately undergone a serious power failure which crashed the ceph, 5 of the 27 osds being ejected from the cluster. These could only be reinserted after having been destroyed and re-created.

Ceph has since then been reconstructing.

However, the ceph cluster has become completely unresponsive after the 3 monitors became full. I noticed /var/log/ceph/ceph.log was huge on the 3 monitors, deleting this file solved the issue and got the ceph running again (still reconstructing).

The ceph has again become unresponsive, this time because of the monitor db inflating at an alarming rate. I tried the command

Bash:

ceph tell mon.prox5 compact

but nothing happened (the command did not complete and was aborted).

I destroyed and recreated the monitor having the issue, leaving the 2 others running. The monitor got destroyed, space available was recovered. As soon as it was restarted, store.db has inflated again to fill up the entire / partition in a couple of minutes.

Any suggestion as to how to solve this?

Thanks

Fred Saunier · Jul 15, 2021

I am posting here what I did to solve my issue, as it may help someone else.

Upon investigation, it would appear it is not unusual for the store.db to grow quite large during important rebuild operations. What I have done to solve the issue, is to connect an extra drive on each monitor (2TB to make sure, as I had them on my hands, but smaller disks could have done the job as well I suppose). One of the monitors has its extra drive connected externally through a usb dock, for lack of other available space. I formatted each drive as ext4, mounted them temporarily in /mnt, and rsynced the content of /var/lib/ceph/mon it them like so:

Bash:

mkfs.ext4 -m 0 -L "CEPHMON" /dev/sdh1
mkdir /mnt/sdh1
mount /dev/sdh1 /mnt/sdh1
rsync -uaAvz --progress /var/lib/ceph/mon/ /mnt/sdh1

Declare the drive as a mount point in /etc/fstab

Bash:

UUID=XXXXXXX    /var/lib/ceph/mon    ext4    defaults     0    0

Stop the monitor daemon (the monitor here being called 'prox3'):

Bash:

systemctl stop ceph-mon@prox3

Delete the content of /var/lib/ceph/mon to reclame space on /

Bash:

cd /var/lib/ceph/mon
rm -fr *

Mount the external drive in /var/lib/ceph/mon

Bash:

mount /var/lib/ceph/mon

Restart the daemon and check it is now running normally

Bash:

systemctl start ceph-mon@prox3
systemctl status ceph-mon@prox3

and then I did the same on the other 2 monitors, prox4 and prox5. After a couple of minutes, CEPH became again available and resumed its rebuilding operations.

Search

Search

[SOLVED] Ceph crash, monitors full

Fred Saunier

Well-Known Member

Fred Saunier

Well-Known Member

We value your privacy