ceph monitor stopped cannot restart..troubleshooting suggestions?

jnewman33

New Member
Dec 1, 2022
16
1
3
Hey all,

Hobbyist user here. I have a three node cluster with ceph and after being away on vacation returrned to find a monitor down and the following status message:

root@pve01:~# systemctl status ceph-mon@pve01
× ceph-mon@pve01.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: signal) since Sun 2024-05-19 15:07:58 EDT; 22min ago
Duration: 86ms
Process: 5432 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve01 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 5432 (code=killed, signal=ABRT)
CPU: 60ms

May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Scheduled restart job, restart counter is at 6.
May 19 15:07:58 pve01 systemd[1]: Stopped ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:07:58 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:07:58 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
May 19 15:28:17 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'signal'.
May 19 15:28:17 pve01 systemd[1]: Failed to start ceph-mon@pve01.service - Ceph cluster monitor daemon.
root@pve01:~#

I have been looking at other posts but I may be in over my head. Not sure what logs would be helpful here? Any suggestions on where to start troubleshooting this?


Thanks,
James
 
gurubert-

Thank you so much for your reply. I did have a look at the logs you referenced but failed to recognize anything as the obvious error starting point.

I have also been looking at the monitor syslog via the GUI and see the following ALOT in that log:

May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(TransactionRef)' thread 7d24f91296c0 time 2024-05-14T11:28:53.634348-0400
May 14 11:28:53 pve01 ceph-mon[1362]: ./src/mon/MonitorDBStore.h: 355: ceph_abort_msg("failed to write to db")
May 14 11:28:53 pve01 ceph-mon[1362]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
May 14 11:28:53 pve01 ceph-mon[1362]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd4) [0x7d24ff866dd3]
May 14 11:28:53 pve01 ceph-mon[1362]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0xa8a) [0x5b24aed32a4a]

This is a small snippet. Does this help in determng my next steps?

Thanks again for you help,
James
 
The /var/lib/ceph filesystem appears to be OK and disk utilization is at 1%.

Is it possible to just delete and re-create a new monitor? Will that restore quorum?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!