ceph monitor stopped cannot restart..troubleshooting suggestions?

jnewman33 · May 19, 2024

Hey all,

Hobbyist user here. I have a three node cluster with ceph and after being away on vacation returrned to find a monitor down and the following status message:

root@pve01:~# systemctl status ceph-mon@pve01
× ceph-mon@pve01.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: signal) since Sun 2024-05-19 15:07:58 EDT; 22min ago
Duration: 86ms
Process: 5432 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve01 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 5432 (code=killed, signal=ABRT)
CPU: 60ms

May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Scheduled restart job, restart counter is at 6.
May 19 15:07:58 pve01 systemd[1]: Stopped ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:07:58 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:28:17 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
root@pve01:~#

I have been looking at other posts but I may be in over my head. Not sure what logs would be helpful here? Any suggestions on where to start troubleshooting this?

Thanks,
James

gurubert · May 20, 2024

There should be a more detailed logfile in /var/log/ceph

jnewman33 · May 20, 2024

gurubert-

Thank you so much for your reply. I did have a look at the logs you referenced but failed to recognize anything as the obvious error starting point.

I have also been looking at the monitor syslog via the GUI and see the following ALOT in that log:

May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(TransactionRef)' thread 7d24f91296c0 time 2024-05-14T11:28:53.634348-0400
May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: 355: ceph_abort_msg("failed to write to db")
May 14 11:28:53 pve01 ceph-mon[1362]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
May 14 11:28:53 pve01 ceph-mon[1362]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd4) [0x7d24ff866dd3]
May 14 11:28:53 pve01 ceph-mon[1362]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0xa8a) [0x5b24aed32a4a]

This is a small snippet. Does this help in determng my next steps?

Thanks again for you help,
James

gurubert · May 20, 2024

"failed to write to db" hints to a filesystem issue.

Is the filesystem for /var/lib/ceph on that node OK? Full maybe?

jnewman33 · May 20, 2024

The /var/lib/ceph filesystem appears to be OK and disk utilization is at 1%.

Is it possible to just delete and re-create a new monitor? Will that restore quorum?

gurubert · May 21, 2024

Is there a store.db directory in /var/lib/ceph/mon/ceph-pve01 ?

jnewman33 · May 21, 2024

gurubert-

Yes, that directory exists.

gurubert · May 22, 2024

jnewman33 said:
Is it possible to just delete and re-create a new monitor? Will that restore quorum?

Yes to both questions. It looks like the only viable option now.

You do have two other MONs still running, correct?

jnewman33 · May 22, 2024

gurubert-

Yes, the other two nodes and their respective OSD and monitors are healthy.

Search

Search

ceph monitor stopped cannot restart..troubleshooting suggestions?

jnewman33

New Member

gurubert

Distinguished Member

jnewman33

New Member

gurubert

Distinguished Member

jnewman33

New Member

gurubert

Distinguished Member

jnewman33

New Member

gurubert

Distinguished Member

jnewman33

New Member

We value your privacy