ceph monitor stopped cannot restart..troubleshooting suggestions?

jnewman33

New Member
Dec 1, 2022
16
1
3
Hey all,

Hobbyist user here. I have a three node cluster with ceph and after being away on vacation returrned to find a monitor down and the following status message:

root@pve01:~# systemctl status ceph-mon@pve01
× ceph-mon@pve01.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: signal) since Sun 2024-05-19 15:07:58 EDT; 22min ago
Duration: 86ms
Process: 5432 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve01 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 5432 (code=killed, signal=ABRT)
CPU: 60ms

May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Scheduled restart job, restart counter is at 6.
May 19 15:07:58 pve01 systemd[1]: Stopped ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:07:58 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:28:17 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
root@pve01:~#

I have been looking at other posts but I may be in over my head. Not sure what logs would be helpful here? Any suggestions on where to start troubleshooting this?


Thanks,
James
 
gurubert-

Thank you so much for your reply. I did have a look at the logs you referenced but failed to recognize anything as the obvious error starting point.

I have also been looking at the monitor syslog via the GUI and see the following ALOT in that log:

May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(TransactionRef)' thread 7d24f91296c0 time 2024-05-14T11:28:53.634348-0400
May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: 355: ceph_abort_msg("failed to write to db")
May 14 11:28:53 pve01 ceph-mon[1362]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
May 14 11:28:53 pve01 ceph-mon[1362]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd4) [0x7d24ff866dd3]
May 14 11:28:53 pve01 ceph-mon[1362]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0xa8a) [0x5b24aed32a4a]

This is a small snippet. Does this help in determng my next steps?

Thanks again for you help,
James
 
The /var/lib/ceph filesystem appears to be OK and disk utilization is at 1%.

Is it possible to just delete and re-create a new monitor? Will that restore quorum?
 
Last edited: