Good evening,
we are running ceph 16.2.9 on Proxmox 7.3 with 69 OSDs and cephfs. Since an unexpected power outage in our hometown, and diesel that was only backing up 90 minutes, we cannot access one of our cephfs pools anymore. The mds are trying to start, reconnect, replay and end up stopped.
Has anybody ever seen such log entries? I cannot make sense of this at all:
If more details are needed, let me know.
Thanks a lot!
we are running ceph 16.2.9 on Proxmox 7.3 with 69 OSDs and cephfs. Since an unexpected power outage in our hometown, and diesel that was only backing up 90 minutes, we cannot access one of our cephfs pools anymore. The mds are trying to start, reconnect, replay and end up stopped.
Has anybody ever seen such log entries? I cannot make sense of this at all:
Code:
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: -1> 2023-02-07T14:52:22.773+0000 7f16e7efa700 -1 ./src/mds/StrayManager.cc: In function 'void StrayManager::_eval_stray_remote(CDentry*, CDentry*)' thread 7f16e7efa700 time 2023-02-07T14:52:22.771445+0000
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: ./src/mds/StrayManager.cc: 619: FAILED ceph_assert(stray_in->get_inode()->nlink >= 1)
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f16ed3e2fde]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 2: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7f16ed3e3169]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 3: (StrayManager::_eval_stray_remote(CDentry*, CDentry*)+0x5d4) [0x5617f42c83c4]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 4: (StrayManager::_eval_stray(CDentry*)+0x605) [0x5617f42c8dc5]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 5: (StrayManager::eval_stray(CDentry*)+0x1f) [0x5617f42c915f]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 6: (MDCache::scan_stray_dir(dirfrag_t)+0x338) [0x5617f4225de8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 7: (MDSContext::complete(int)+0x5b) [0x5617f442f86b]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 8: (MDSRank::_advance_queues()+0x80) [0x5617f411d720]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1cf) [0x5617f411dfff]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x5617f411e9e8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x5617f40f8b6f]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7f16ed612cb8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 13: (DispatchQueue::entry()+0x5ef) [0x7f16ed6103bf]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f16ed6cf4bd]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f16ed13dea7]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 16: clone()
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 0> 2023-02-07T14:52:22.777+0000 7f16e7efa700 -1 *** Caught signal (Aborted) **
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: in thread 7f16e7efa700 thread_name:ms_dispatch
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f16ed149140]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 2: gsignal()
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 3: abort()
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7f16ed3e3028]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 5: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7f16ed3e3169]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 6: (StrayManager::_eval_stray_remote(CDentry*, CDentry*)+0x5d4) [0x5617f42c83c4]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 7: (StrayManager::_eval_stray(CDentry*)+0x605) [0x5617f42c8dc5]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 8: (StrayManager::eval_stray(CDentry*)+0x1f) [0x5617f42c915f]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 9: (MDCache::scan_stray_dir(dirfrag_t)+0x338) [0x5617f4225de8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 10: (MDSContext::complete(int)+0x5b) [0x5617f442f86b]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 11: (MDSRank::_advance_queues()+0x80) [0x5617f411d720]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 12: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x1cf) [0x5617f411dfff]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 13: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x5617f411e9e8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 14: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x5617f40f8b6f]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 15: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7f16ed612cb8]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 16: (DispatchQueue::entry()+0x5ef) [0x7f16ed6103bf]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f16ed6cf4bd]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 18: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f16ed13dea7]
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: 19: clone()
Feb 07 14:52:22 hyper-4-1 ceph-mds[2388349]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Feb 07 14:52:22 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Main process exited, code=killed, status=6/ABRT
Feb 07 14:52:22 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Failed with result 'signal'.
Feb 07 14:52:22 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Consumed 3.474s CPU time.
Feb 07 14:52:23 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Scheduled restart job, restart counter is at 130.
Feb 07 14:52:23 hyper-4-1 systemd[1]: Stopped Ceph metadata server daemon.
Feb 07 14:52:23 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Consumed 3.474s CPU time.
Feb 07 14:52:23 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Start request repeated too quickly.
Feb 07 14:52:23 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Failed with result 'signal'.
Feb 07 14:52:23 hyper-4-1 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 07 14:57:30 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Start request repeated too quickly.
Feb 07 14:57:30 hyper-4-1 systemd[1]: ceph-mds@hyper-4-1.service: Failed with result 'signal'.
Feb 07 14:57:30 hyper-4-1 systemd[1]: Failed to start Ceph metadata server daemon.
If more details are needed, let me know.
Thanks a lot!