I have a 2 node cluster running ceph (I know that's not ideal). On one machine, the ceph mon service frequently keeps crashing. Looking at the syslogs for the last crash, it was preceded by:
Which seems to occur each day at 00:00, which I assume is just a daily cleanup step.
However, on the days that it crashes, it seems to be unable to restart some time after the hangup signal. The logs can be found in `log_08122024`.
I recently upgraded from ceph reef to squid, but crashes happened on both. The logs from before the upgrade don't seem to indicate any sort of error, but also indicated that ceph mon was unable to start after 6 retries. The full log can be found in `log_05122024`.
On the other machine, the mon daemon has not had any issues. Would love any help figuring out what might be causing these issues. I only started getting these crashes in high frequency lately, maybe within the past 2 months or so. It may also be worth noting that only the mon daemon seems to be crashing, the MDS and OSD daemons also running on the same node do not seem to have any issues (other than the degraded cluster). Also, after manually removing the mon, recovering the cluster, and recreating the mon, the mon runs without issue for some time before the next failure.
Code:
Dec 09 00:00:46 <node> ceph-mon[1207324]: 2024-12-09T00:00:46.474-0800 7ebf05a006c0 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 1847586) UID: 0
Dec 09 00:00:46 <node> ceph-mon[1207324]: 2024-12-09T00:00:46.474-0800 7ebf05a006c0 -1 mon.<node>@1(peon) e25 *** Got Signal Hangup ***
Dec 09 00:00:46 <node> ceph-mon[1207324]: 2024-12-09T00:00:46.499-0800 7ebf05a006c0 -1 received signal: Hangup from (PID: 1847587) UID: 0
Dec 09 00:00:46 <node> ceph-mon[1207324]: 2024-12-09T00:00:46.499-0800 7ebf05a006c0 -1 mon.<node>@1(peon) e25 *** Got Signal Hangup ***
Which seems to occur each day at 00:00, which I assume is just a daily cleanup step.
However, on the days that it crashes, it seems to be unable to restart some time after the hangup signal. The logs can be found in `log_08122024`.
Code:
Dec 09 00:32:59 <node> ceph-mon[1207324]: [281B blob data]
Dec 09 00:32:59 <node> ceph-mon[1207324]: PutCF( prefix = paxos key = '97673388' value size = 668)
Dec 09 00:32:59 <node> ceph-mon[1207324]: PutCF( prefix = paxos key = 'pending_v' value size = 8)
Dec 09 00:32:59 <node> ceph-mon[1207324]: PutCF( prefix = paxos key = 'pending_pn' value size = 8)
Dec 09 00:32:59 <node> ceph-mon[1207324]: ./src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(TransactionRef)' thread 7ebefec006c0 time 2024-12-09T00:32:59.070747-0800
Dec 09 00:32:59 <node> ceph-mon[1207324]: ./src/mon/MonitorDBStore.h: 355: ceph_abort_msg("failed to write to db")
Dec 09 00:32:59 <node> ceph-mon[1207324]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd4) [0x7ebf094c8bad]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 2: (MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0xa8a) [0x6064b39c85ca]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 3: (Paxos::handle_begin(boost::intrusive_ptr<MonOpRequest>)+0x390) [0x6064b3acb4b0]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 4: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x363) [0x6064b3ad68f3]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x11c9) [0x6064b3993f29]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 6: (Monitor::_ms_dispatch(Message*)+0x3e4) [0x6064b3994534]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x45) [0x6064b39c9f25]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x108) [0x7ebf09766208]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 9: (DispatchQueue::entry()+0x63f) [0x7ebf0976408f]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ebf0982e78d]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7ebf090a81c4]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7ebf0912885c]
Dec 09 00:32:59 <node> ceph-mon[1207324]: *** Caught signal (Aborted) **
Dec 09 00:32:59 <node> ceph-mon[1207324]: in thread 7ebefec006c0 thread_name:ms_dispatch
Dec 09 00:32:59 <node> ceph-mon[1207324]: ceph version 19.2.0 (3815e3391b18c593539df6fa952c9f45c37ee4d0) squid (stable)
Dec 09 00:32:59 <node> ceph-mon[1207324]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7ebf0905b050]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7ebf090a9ebc]
Dec 09 00:32:59 <node> ceph-mon[1207324]: 3: gsignal()
Dec 09 00:32:59 <node> ceph-mon[1207324]: 4: abort()
Dec 09 00:32:59 <node> ceph-mon[1207324]: 5: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x18a) [0x7ebf094c8c63]
...
Dec 09 00:32:59 <node> systemd[1]: ceph-mon@<node>.service: Main process exited, code=killed, status=6/ABRT
Dec 09 00:32:59 <node> systemd[1]: ceph-mon@<node>.service: Failed with result 'signal'.
Dec 09 00:32:59 <node> systemd[1]: ceph-mon@<node>.service: Consumed 6min 45.288s CPU time.
Dec 09 00:33:09 <node> systemd[1]: ceph-mon@<node>.service: Scheduled restart job, restart counter is at 1.
Dec 09 00:33:09 <node> systemd[1]: Stopped ceph-mon@<node>.service - Ceph cluster monitor daemon.
Dec 09 00:33:09 <node> systemd[1]: ceph-mon@<node>.service: Consumed 6min 45.288s CPU time.
Dec 09 00:33:09 <node> systemd[1]: Started ceph-mon@<node>.service - Ceph cluster monitor daemon.
I recently upgraded from ceph reef to squid, but crashes happened on both. The logs from before the upgrade don't seem to indicate any sort of error, but also indicated that ceph mon was unable to start after 6 retries. The full log can be found in `log_05122024`.
On the other machine, the mon daemon has not had any issues. Would love any help figuring out what might be causing these issues. I only started getting these crashes in high frequency lately, maybe within the past 2 months or so. It may also be worth noting that only the mon daemon seems to be crashing, the MDS and OSD daemons also running on the same node do not seem to have any issues (other than the degraded cluster). Also, after manually removing the mon, recovering the cluster, and recreating the mon, the mon runs without issue for some time before the next failure.
Attachments
Last edited: