All MDS's in standby

cfgmgr

Member
Jul 25, 2023
64
14
13
Greetings!

Yesterday I upgraded from 8.2.0 to 8.4.5. Then I needed to do a ceph 17 --> 18 and then a 18 --> 19. The upgrade and the subsequent ceph upgrades went without a hitch.

Ran into an issue this morning where at some point post upgrade, all the MDS's were sitting standby, none were active.

I restarted where the primary MDS was but it just went stanby right away.

I ran the following which brought everything back online.

Code:
ceph fs set cephfs allow_standby_replay false

This RedHat KB is almost a dead match for what we saw:
https://access.redhat.com/solutions/7090429

Anyone seen this post upgrade to squid? I did review the logs but it wasn't clear where the issue was. Per the RHEL KB above they are still investigating things. It also doesnt indicate which version of Ceph they were running either other than the product names.

Thanks!
 
Checking back this morning - everything has been fine. I will probably set the above setting back to true today.
 
Well the MDS's were "fine" but apparently there have been a couple crashes today one only one HV. However this is the manager, not MDS

Code:
# ceph crash info 2025-08-01T06:10:00.370935Z_93ce2fcc-5c7b-4e12-b6b1-4acfe945b9e2
{
    "assert_condition": "nref == 0",
    "assert_file": "./src/common/RefCountedObj.cc",
    "assert_func": "virtual ceph::common::RefCountedObject::~RefCountedObject()",
    "assert_line": 14,
    "assert_msg": "./src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedObject()' thread 729edfee16c0 time 2025-08-01T06:10:00.368960+0000\n./src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)\n",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x729ef825b050]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x8aeec) [0x729ef82a9eec]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17b) [0x729ef86c784c]",
        "/usr/lib/ceph/libceph-common.so.2(+0x2c798f) [0x729ef86c798f]",
        "/usr/lib/ceph/libceph-common.so.2(+0x3c0c15) [0x729ef87c0c15]",
        "(MMgrCommand::~MMgrCommand()+0x7a) [0x59c1f881277a]",
        "(ceph::common::RefCountedObject::put() const+0x1ad) [0x729ef87c0f2d]",
        "(TrackedOp::put()+0x25a) [0x59c1f866fb3a]",
        "(OpHistoryServiceThread::entry()+0x143) [0x59c1f86e77f3]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x729ef82a81f5]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x729ef832889c]"
    ],
    "ceph_version": "19.2.2",
    "crash_id": "2025-08-01T06:10:00.370935Z_93ce2fcc-5c7b-4e12-b6b1-4acfe945b9e2",
    "entity_name": "mgr.hv01",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-mgr",
    "stack_sig": "34573e4c3543433958d462fb8fbe67add0880797713bcf5217f9638195366242",
    "timestamp": "2025-08-01T06:10:00.370935Z",
    "utsname_hostname": "hv01",
    "utsname_machine": "x86_64",
    "utsname_release": "6.8.12-13-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.12-13 (2025-07-22T10:00Z)"